gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2021-01-05	Internal changes.	Andrei Vagin
	PiperOrigin-RevId: 350159657
2020-10-29	Keep magic constants out of netstack	Kevin Krakauer
	PiperOrigin-RevId: 339721152
2020-10-23	Rewrite reference leak checker without finalizers.	Dean Deng
	Our current reference leak checker uses finalizers to verify whether an object has reached zero references before it is garbage collected. There are multiple problems with this mechanism, so a rewrite is in order. With finalizers, there is no way to guarantee that a finalizer will run before the program exits. When an unreachable object with a finalizer is garbage collected, its finalizer will be added to a queue and run asynchronously. The best we can do is run garbage collection upon sandbox exit to make sure that all finalizers are enqueued. Furthermore, if there is a chain of finalized objects, e.g. A points to B points to C, garbage collection needs to run multiple times before all of the finalizers are enqueued. The first GC run will register the finalizer for A but not free it. It takes another GC run to free A, at which point B's finalizer can be registered. As a result, we need to run GC as many times as the length of the longest such chain to have a somewhat reliable leak checker. Finally, a cyclical chain of structs pointing to one another will never be garbage collected if a finalizer is set. This is a well-known issue with Go finalizers (https://github.com/golang/go/issues/7358). Using leak checking on filesystem objects that produce cycles will not work and even result in memory leaks. The new leak checker stores reference counted objects in a global map when leak check is enabled and removes them once they are destroyed. At sandbox exit, any remaining objects in the map are considered as leaked. This provides a deterministic way of detecting leaks without relying on the complexities of finalizers and garbage collection. This approach has several benefits over the former, including: - Always detects leaks of objects that should be destroyed very close to sandbox exit. The old checker very rarely detected these leaks, because it relied on garbage collection to be run in a short window of time. - Panics if we forgot to enable leak check on a ref-counted object (we will try to remove it from the map when it is destroyed, but it will never have been added). - Can store extra logging information in the map values without adding to the size of the ref count struct itself. With the size of just an int64, the ref count object remains compact, meaning frequent operations like IncRef/DecRef are more cache-efficient. - Can aggregate leak results in a single report after the sandbox exits. Instead of having warnings littered in the log, which were non-deterministically triggered by garbage collection, we can print all warning messages at once. Note that this could also be a limitation--the sandbox must exit properly for leaks to be detected. Some basic benchmarking indicates that this change does not significantly affect performance when leak checking is enabled, which is understandable since registering/unregistering is only done once for each filesystem object. Updates #1486. PiperOrigin-RevId: 338685972
2020-10-19	Remove legacy bazel configurations.	Adin Scannell
	Using the newer bazel rules necessitates a transition from proto1 to proto2. In order to resolve the incompatibility between proto2 and gogoproto, the cri runtimeoptions proto must be vendored. Further, some of the semantics of bazel caching changed during the transition. It is now necessary to: - Ensure that :gopath depends only on pure library targets, as the propagation of go_binary build attributes (pure, static) will affected the generated files (though content remains the same, there are conflicts with respect to the gopath). - Update bazel.mk to include the possibility of binaries in the bazel-out directory, as it will now put runsc and others there. This required some refinements to the mechanism of extracting paths, since some the existing regex resulted in false positives. - Change nogo rules to prevent escape generation on binary targets. For some reason, the newer version of bazel attempted to run the nogo analysis on the binary targets, which fails due to the fact that objdump does not work on the final binary. This must be due to a change in the semantics of aspects in bazel3. PiperOrigin-RevId: 337958324
2020-09-17	Add VFS2 overlay support in runsc	Fabricio Voznika
	All tests under runsc are passing with overlay enabled. Updates #1487, #1199 PiperOrigin-RevId: 332181267
2020-09-15	Add support for OCI seccomp filters in the sandbox.	Ian Lewis
	OCI configuration includes support for specifying seccomp filters. In runc, these filter configurations are converted into seccomp BPF programs and loaded into the kernel via libseccomp. runsc needs to be a static binary so, for runsc, we cannot rely on a C library and need to implement the functionality in Go. The generator added here implements basic support for taking OCI seccomp configuration and converting it into a seccomp BPF program with the same behavior as a program generated by libseccomp. - New conditional operations were added to pkg/seccomp to support operations available in OCI. - AllowAny and AllowValue were renamed to MatchAny and EqualTo to better reflect that syscalls matching the conditionals result in the provided action not simply SCMP_RET_ALLOW. - BuildProgram in pkg/seccomp no longer panics if provided an empty list of rules. It now builds a program with the architecture sanity check only. - ProgramBuilder now allows adding labels that are unused. However, backwards jumps are still not permitted. Fixes #510 PiperOrigin-RevId: 331938697
2020-09-04	Simplify FD handling for container start/exec	Fabricio Voznika
	VFS1 and VFS2 host FDs have different dupping behavior, making error prone to code for both. Change the contract so that FDs are released as they are used, so the caller can simple defer a block that closes all remaining files. This also addresses handling of partial failures. With this fix, more VFS2 tests can be enabled. Updates #1487 PiperOrigin-RevId: 330112266
2020-08-19	Move boot.Config to its own package	Fabricio Voznika
	Updates #3494 PiperOrigin-RevId: 327548511
2020-07-22	Support for receiving outbound packets in AF_PACKET.	Bhasker Hariharan
	Updates #173 PiperOrigin-RevId: 322665518
2020-07-13	Merge pull request #2672 from amscanne:shim-integrated	gVisor bot
	PiperOrigin-RevId: 321053634
2020-06-25	Moved FUSE device under the fuse directory	Ridwan Sharif

2020-06-24	Port /dev/net/tun device to VFS2.	Nicolas Lacasse
	Updates #2912 #1035 PiperOrigin-RevId: 318162565
2020-06-23	Port /dev/tty device to VFS2.	Nicolas Lacasse
	Support is limited to the functionality that exists in VFS1. Updates #2923 #1035 PiperOrigin-RevId: 317981417
2020-06-11	Add //pkg/sentry/fsimpl/overlay.	Jamie Liu
	Major differences from existing overlay filesystems: - Linux allows lower layers in an overlay to require revalidation, but not the upper layer. VFS1 allows the upper layer in an overlay to require revalidation, but not the lower layer. VFS2 does not allow any layers to require revalidation. (Now that vfs.MkdirOptions.ForSyntheticMountpoint exists, no uses of overlay in VFS1 are believed to require upper layer revalidation; in particular, the requirement that the upper layer support the creation of "trusted." extended attributes for whiteouts effectively required the upper filesystem to be tmpfs in most cases.) - Like VFS1, but unlike Linux, VFS2 overlay does not attempt to make mutations of the upper layer atomic using a working directory and features like RENAME_WHITEOUT. (This may change in the future, since not having a working directory makes error recovery for some operations, e.g. rmdir, particularly painful.) - Like Linux, but unlike VFS1, VFS2 represents whiteouts using character devices with rdev == 0; the equivalent of the whiteout attribute on directories is xattr trusted.overlay.opaque = "y"; and there is no equivalent to the whiteout attribute on non-directories since non-directories are never merged with lower layers. - Device and inode numbers work as follows: - In Linux, modulo the xino feature and a special case for when all layers are the same filesystem: - Directories use the overlay filesystem's device number and an ephemeral inode number assigned by the overlay. - Non-directories that have been copied up use the device and inode number assigned by the upper filesystem. - Non-directories that have not been copied up use a per-(overlay, layer)-pair device number and the inode number assigned by the lower filesystem. - In VFS1, device and inode numbers always come from the lower layer unless "whited out"; this has the adverse effect of requiring interaction with the lower filesystem even for non-directory files that exist on the upper layer. - In VFS2, device and inode numbers are assigned as in Linux, except that xino and the samefs special case are not supported. - Like Linux, but unlike VFS1, VFS2 does not attempt to maintain memory mapping coherence across copy-up. (This may have to change in the future, as users may be dependent on this property.) - Like Linux, but unlike VFS1, VFS2 uses the overlayfs mounter's credentials when interacting with the overlay's layers, rather than the caller's. - Like Linux, but unlike VFS1, VFS2 permits multiple lower layers in an overlay. - Like Linux, but unlike VFS1, VFS2's overlay filesystem is application-mountable. Updates #1199 PiperOrigin-RevId: 316019067
2020-05-06	Fix runsc syscall documentation generation.	Adin Scannell
	We can register any number of tables with any number of architectures, and need not limit the definitions to the architecture in question. This allows runsc to generate documentation for all architectures simultaneously. Similarly, this simplifies the VFSv2 patching process. PiperOrigin-RevId: 310224827
2020-05-04	Add TTY support on VFS2 to runsc	Fabricio Voznika
	Updates #1623, #1487 PiperOrigin-RevId: 309777922
2020-04-30	FIFO QDisc implementation	Bhasker Hariharan
	Updates #231 PiperOrigin-RevId: 309323808
2020-04-24	VFS2: Get HelloWorld image tests to pass with VFS2	Zach Koopmans
	This change includes: - Modifications to loader_test.go to get TestCreateMountNamespace to pass with VFS2. - Changes necessary to get TestHelloWorld in image tests to pass with VFS2. This means runsc can run the hello-world container with docker on VSF2. Note: Containers that use sockets will not run with these changes. See "//test/image/...". Any tests here with sockets currently fail (which is all of them but HelloWorld). PiperOrigin-RevId: 308363072
2020-04-23	Simplify Docker test infrastructure.	Adin Scannell
	This change adds a layer of abstraction around the internal Docker APIs, and eliminates all direct dependencies on Dockerfiles in the infrastructure. A subsequent change will automated the generation of local images (with efficient caching). Note that this change drops the use of bazel container rules, as that experiment does not seem to be viable. PiperOrigin-RevId: 308095430
2020-04-22	Move user home detection to its own library.	Nicolas Lacasse
	PiperOrigin-RevId: 307977689
2020-04-17	Get /bin/true to run on VFS2	Zach Koopmans
	Included: - loader_test.go RunTest and TestStartSignal VFS2 - container_test.go TestAppExitStatus on VFS2 - experimental flag added to runsc to turn on VFS2 Note: shared mounts are not yet supported. PiperOrigin-RevId: 307070753
2020-04-14	Merge pull request #2212 from aaronlu:dup_stdioFDs	gVisor bot
	PiperOrigin-RevId: 306477639
2020-02-20	Initial network namespace support.	gVisor bot
	TCP/IP will work with netstack networking. hostinet doesn't work, and sockets will have the same behavior as it is now. Before the userspace is able to create device, the default loopback device can be used to test. /proc/net and /sys/net will still be connected to the root network stack; this is the same behavior now. Issue #1833 PiperOrigin-RevId: 296309389
2020-01-28	Add vfs.FileDescription to FD table	Fabricio Voznika
	FD table now holds both VFS1 and VFS2 types and uses the correct one based on what's set. Parts of this CL are just initial changes (e.g. sys_read.go, runsc/main.go) to serve as a template for the remaining changes. Updates #1487 Updates #1623 PiperOrigin-RevId: 292023223
2020-01-27	Update package locations.	Adin Scannell
	Because the abi will depend on the core types for marshalling (usermem, context, safemem, safecopy), these need to be flattened from the sentry directory. These packages contain no sentry-specific details. PiperOrigin-RevId: 291811289
2020-01-27	Standardize on tools directory.	Adin Scannell
	PiperOrigin-RevId: 291745021
2020-01-09	New sync package.	Ian Gudger
	* Rename syncutil to sync. * Add aliases to sync types. * Replace existing usage of standard library sync package. This will make it easier to swap out synchronization primitives. For example, this will allow us to use primitives from github.com/sasha-s/go-deadlock to check for lock ordering violations. Updates #1472 PiperOrigin-RevId: 289033387
2019-12-06	Merge pull request #1233 from xiaobo55x:compatLog	gVisor bot
	PiperOrigin-RevId: 284305935
2019-12-03	Enable runsc compatLog support on arm64.	Haibo Xu
	Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I3fd5e552f5f03b5144ed52647f75af3b8253b1d6
2019-11-13	Enable runsc/boot support on arm64.	Haibo Xu
	This patch also include a minor change to replace syscall.Dup2 with syscall.Dup3 which was missed in a previous commit(ref a25a976). Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I00beb9cc492e44c762ebaa3750201c63c1f7c2f3
2019-11-04	Add NETLINK_KOBJECT_UEVENT socket support	Michael Pratt
	NETLINK_KOBJECT_UEVENT sockets send udev-style messages for device events. gVisor doesn't have any device events, so our sockets don't need to do anything once created. systemd's device manager needs to be able to create one of these sockets. It also wants to install a BPF filter on the socket. Since we'll never send any messages, the filter would never be invoked, thus we just fake it out. Fixes #1117 Updates #1119 PiperOrigin-RevId: 278405893
2019-10-07	Rename epsocket to netstack.	Kevin Krakauer
	PiperOrigin-RevId: 273365058
2019-09-25	Remove centralized registration of protocols.	Kevin Krakauer
	Also removes the need for protocol names. PiperOrigin-RevId: 271186030
2019-09-23	Always set HOME env var with `runsc exec`.	Nicolas Lacasse
	We already do this for `runsc run`, but need to do the same for `runsc exec`. PiperOrigin-RevId: 270793459
2019-08-02	Stops container if gofer is killed	Fabricio Voznika
	Each gofer now has a goroutine that polls on the FDs used to communicate with the sandbox. The respective gofer is destroyed if any of the FDs is closed. Closes #601 PiperOrigin-RevId: 261383725
2019-07-03	Avoid importing platforms from many source files	Andrei Vagin
	PiperOrigin-RevId: 256494243
2019-07-02	Remove map from fd_map, change to fd_table.	Adin Scannell
	This renames FDMap to FDTable and drops the kernel.FD type, which had an entire package to itself and didn't serve much use (it was freely cast between types, and served as more of an annoyance than providing any protection.) Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup operation, and 10-15 ns per concurrent lookup operation of savings. This also fixes two tangential usage issues with the FDMap. Namely, non-atomic use of NewFDFrom and associated calls to Remove (that are both racy and fail to drop the reference on the underlying file.) PiperOrigin-RevId: 256285890
2019-06-28	Add finalizer on AtomicRefCount to check for leaks.	Ian Gudger
	PiperOrigin-RevId: 255711454
2019-06-13	Update canonical repository.	Adin Scannell
	This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620
2019-06-13	Set the HOME environment variable (fixes #293)	Ian Lewis
	runsc will now set the HOME environment variable as required by POSIX. The user's home directory is retrieved from the /etc/passwd file located on the container's file system during boot. PiperOrigin-RevId: 253120627
2019-06-12	gvisor/runsc: apply seccomp filters before parsing a state file	Andrei Vagin
	PiperOrigin-RevId: 252869983
2019-06-11	Add support to mount pod shared tmpfs mounts	Fabricio Voznika
	Parse annotations containing 'gvisor.dev/spec/mount' that gives hints about how mounts are shared between containers inside a pod. This information can be used to better inform how to mount these volumes inside gVisor. For example, a volume that is shared between containers inside a pod can be bind mounted inside the sandbox, instead of being two independent mounts. For now, this information is used to allow the same tmpfs mounts to be shared between containers which wasn't possible before. PiperOrigin-RevId: 252704037
2019-06-07	Move //pkg/sentry/memutil to //pkg/memutil.	Jamie Liu
	PiperOrigin-RevId: 252124156
2019-03-14	Decouple filemem from platform and move it to pgalloc.MemoryFile.	Jamie Liu
	This is in preparation for improved page cache reclaim, which requires greater integration between the page cache and page allocator. PiperOrigin-RevId: 238444706 Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-05	Add uncaught signal message to the user log	Fabricio Voznika
	This help troubleshoot cases where the container is killed and the app logs don't show the reason. PiperOrigin-RevId: 236982883 Change-Id: I361892856a146cea5b04abaa3aedbf805e123724
2019-02-22	Rename ping endpoints to icmp endpoints.	Kevin Krakauer
	PiperOrigin-RevId: 235248572 Change-Id: I5b0538b6feb365a98712c2a2d56d856fe80a8a09
2019-01-31	Remove license comments	Michael Pratt
	Nothing reads them and they can simply get stale. Generated with: $ sed -i "s/licenses($.$)./licenses(\1)/" **/BUILD PiperOrigin-RevId: 231818945 Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2018-12-10	Open source system call tests.	Brian Geffon
	PiperOrigin-RevId: 224886231 Change-Id: I0fccb4d994601739d8b16b1d4e6b31f40297fb22
2018-11-20	Internal change.	Nicolas Lacasse
	PiperOrigin-RevId: 221848471 Change-Id: I882fbe5ce7737048b2e1f668848e9c14ed355665
2018-11-13	Internal change.	Nicolas Lacasse
	PiperOrigin-RevId: 221343626 Change-Id: I03d57293a555cf4da9952a81803b9f8463173c89