gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2021-06-29	[syserror] Change syserror to linuxerr for E2BIG, EADDRINUSE, and EINVAL	Zach Koopmans
	Remove three syserror entries duplicated in linuxerr. Because of the linuxerr.Equals method, this is a mere change of return values from syserror to linuxerr definitions. Done with only these three errnos as CLs removing all grow to a significantly large size. PiperOrigin-RevId: 382173835
2021-06-22	[syserror] Add conversions to linuxerr with temporary Equals method.	Zach Koopmans
	Add Equals method to compare syserror and unix.Errno errors to linuxerr errors. This will facilitate removal of syserror definitions in a followup, and finding needed conversions from unix.Errno to linuxerr. PiperOrigin-RevId: 380909667
2021-06-16	[syserror] Refactor linuxerr and error package.	Zach Koopmans
	Move Error struct to pkg/errors package for use in multiple places. Move linuxerr static definitions under pkg/errors/linuxerr. Add a lookup list for quick lookup of errors.Error by errno. This is useful when converting syserror errors and unix.Errno/syscall.Errrno values to errors.Error. Update benchmarks routines to include conversions. The below benchmarks show *errors.Error usage to be comparable to using unix.Errno. BenchmarkAssignUnix BenchmarkAssignUnix-32 787875022 1.284 ns/op BenchmarkAssignLinuxerr BenchmarkAssignLinuxerr-32 1000000000 1.209 ns/op BenchmarkAssignSyserror BenchmarkAssignSyserror-32 759269229 1.429 ns/op BenchmarkCompareUnix BenchmarkCompareUnix-32 1000000000 1.310 ns/op BenchmarkCompareLinuxerr BenchmarkCompareLinuxerr-32 1000000000 1.241 ns/op BenchmarkCompareSyserror BenchmarkCompareSyserror-32 147196165 8.248 ns/op BenchmarkSwitchUnix BenchmarkSwitchUnix-32 373233556 3.664 ns/op BenchmarkSwitchLinuxerr BenchmarkSwitchLinuxerr-32 476323929 3.294 ns/op BenchmarkSwitchSyserror BenchmarkSwitchSyserror-32 39293408 29.62 ns/op BenchmarkReturnUnix BenchmarkReturnUnix-32 1000000000 0.5042 ns/op BenchmarkReturnLinuxerr BenchmarkReturnLinuxerr-32 1000000000 0.8152 ns/op BenchmarkConvertUnixLinuxerr BenchmarkConvertUnixLinuxerr-32 739948875 1.547 ns/op BenchmarkConvertUnixLinuxerrZero BenchmarkConvertUnixLinuxerrZero-32 977733974 1.489 ns/op PiperOrigin-RevId: 379806801
2021-06-01	Move sync generics to their own packages	Tamir Duberstein
	The presence of multiple packages in a single directory sometimes confuses `go mod`, producing output like: go: downloading gvisor.dev/gvisor v0.0.0-20210601174640-77dc0f5bc94d $GOMODCACHE/gvisor.dev/gvisor@v0.0.0-20210601174640-77dc0f5bc94d/pkg/linewriter/linewriter.go:21:2: found packages sync (aliases.go) and seqatomic (generic_atomicptr_unsafe.go) in $GOMODCACHE/gvisor.dev/gvisor@v0.0.0-20210601174640-77dc0f5bc94d/pkg/sync imports.go:67:2: found packages tcp (accept.go) and rcv (rcv_test.go) in $GOMODCACHE/gvisor.dev/gvisor@v0.0.0-20210601174640-77dc0f5bc94d/pkg/tcpip/transport/tcp PiperOrigin-RevId: 376956213
2021-04-02	Implement cgroupfs.	Rahat Mahmood
	A skeleton implementation of cgroupfs. It supports trivial cpu and memory controllers with no support for hierarchies. PiperOrigin-RevId: 366561126
2021-03-29	[syserror] Split usermem package	Zach Koopmans
	Split usermem package to help remove syserror dependency in go_marshal. New hostarch package contains code not dependent on syserror. PiperOrigin-RevId: 365651233
2021-03-03	[op] Replace syscall package usage with golang.org/x/sys/unix in pkg/.	Ayush Ranjan
	The syscall package has been deprecated in favor of golang.org/x/sys. Note that syscall is still used in the following places: - pkg/sentry/socket/hostinet/stack.go: some netlink related functionalities are not yet available in golang.org/x/sys. - syscall.Stat_t is still used in some places because os.FileInfo.Sys() still returns it and not unix.Stat_t. Updates #214 PiperOrigin-RevId: 360701387
2021-01-22	Implement F_GETLK fcntl.	Dean Deng
	Fixes #5113. PiperOrigin-RevId: 353313374
2020-11-13	Check for misuse of kernel.Task as context.Context.	Jamie Liu
	Checks in Task.block() and Task.Value() are conditional on race detection being enabled, since these functions are relatively hot. Checks in Task.SleepStart() and Task.UninterruptibleSleepStart() are enabled unconditionally, since these functions are not thought to lie on any critical paths, and misuse of these functions is required for b/168241471 to manifest. PiperOrigin-RevId: 342342175
2020-11-12	Rename kernel.TaskContext to kernel.TaskImage.	Jamie Liu
	This reduces confusion with context.Context (which is also relevant to kernel.Tasks) and is consistent with existing function kernel.LoadTaskImage(). PiperOrigin-RevId: 342167298
2020-10-23	Rewrite reference leak checker without finalizers.	Dean Deng
	Our current reference leak checker uses finalizers to verify whether an object has reached zero references before it is garbage collected. There are multiple problems with this mechanism, so a rewrite is in order. With finalizers, there is no way to guarantee that a finalizer will run before the program exits. When an unreachable object with a finalizer is garbage collected, its finalizer will be added to a queue and run asynchronously. The best we can do is run garbage collection upon sandbox exit to make sure that all finalizers are enqueued. Furthermore, if there is a chain of finalized objects, e.g. A points to B points to C, garbage collection needs to run multiple times before all of the finalizers are enqueued. The first GC run will register the finalizer for A but not free it. It takes another GC run to free A, at which point B's finalizer can be registered. As a result, we need to run GC as many times as the length of the longest such chain to have a somewhat reliable leak checker. Finally, a cyclical chain of structs pointing to one another will never be garbage collected if a finalizer is set. This is a well-known issue with Go finalizers (https://github.com/golang/go/issues/7358). Using leak checking on filesystem objects that produce cycles will not work and even result in memory leaks. The new leak checker stores reference counted objects in a global map when leak check is enabled and removes them once they are destroyed. At sandbox exit, any remaining objects in the map are considered as leaked. This provides a deterministic way of detecting leaks without relying on the complexities of finalizers and garbage collection. This approach has several benefits over the former, including: - Always detects leaks of objects that should be destroyed very close to sandbox exit. The old checker very rarely detected these leaks, because it relied on garbage collection to be run in a short window of time. - Panics if we forgot to enable leak check on a ref-counted object (we will try to remove it from the map when it is destroyed, but it will never have been added). - Can store extra logging information in the map values without adding to the size of the ref count struct itself. With the size of just an int64, the ref count object remains compact, meaning frequent operations like IncRef/DecRef are more cache-efficient. - Can aggregate leak results in a single report after the sandbox exits. Instead of having warnings littered in the log, which were non-deterministically triggered by garbage collection, we can print all warning messages at once. Note that this could also be a limitation--the sandbox must exit properly for leaks to be detected. Some basic benchmarking indicates that this change does not significantly affect performance when leak checking is enabled, which is understandable since registering/unregistering is only done once for each filesystem object. Updates #1486. PiperOrigin-RevId: 338685972
2020-10-19	[vfs2] Fix fork reference leaks.	Dean Deng
	PiperOrigin-RevId: 337919424
2020-10-14	Fix shm reference leak.	Dean Deng
	All shm segments in an IPC namespace should be released once that namespace is destroyed. Add reference counting to IPCNamespace so that once the last task with a reference on it exits, we can trigger a destructor that will clean up all shm segments that have not been explicitly freed by the application. PiperOrigin-RevId: 337032977
2020-10-02	Convert uses of the binary package in kernel to go-marshal.	Rahat Mahmood
	PiperOrigin-RevId: 335077195
2020-09-24	Fix socket record leak in VFS2	Tiwei Bie
	VFS2 socket record is not removed from the system-wide socket table when the socket is released, which will lead to a memory leak. This patch fixes this issue. Fixes: #3874 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-24	Rename kernel.SocketEntry to kernel.SocketRecord	Tiwei Bie
	SocketEntry can be confusing with the template types as the 'Entry' is usually used as a suffix for list element types, e.g. socketEntry in the same package. Suggested by Dean (@dean-deng). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-15	Enable automated marshalling for the syscall package.	Rahat Mahmood
	PiperOrigin-RevId: 331940975
2020-09-11	Move the 'marshal' and 'primitive' packages to the 'pkg' directory.	Rahat Mahmood
	PiperOrigin-RevId: 331256608
2020-08-25	Use new reference count utility throughout gvisor.	Dean Deng
	This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted soon anyway. The following structs now use the new tool, with leak check enabled: devpts:rootInode fuse:inode kernfs:Dentry kernfs:dir kernfs:readonlyDir kernfs:StaticDirectory proc:fdDirInode proc:fdInfoDirInode proc:subtasksInode proc:taskInode proc:tasksInode vfs:FileDescription vfs:MountNamespace vfs:Filesystem sys:dir kernel:FSContext kernel:ProcessGroup kernel:Session shm:Shm mm:aioMappable mm:SpecialMappable transport:queue And the following use the template, but because they currently are not leak checked, a TODO is left instead of enabling leak check in this patch: kernel:FDTable tun:tunEndpoint Updates #1486. PiperOrigin-RevId: 328460377
2020-08-25	Expose basic coverage information to userspace through kcov interface.	Dean Deng
	In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block, which updates the memory-mapped coverage information. The Go coverage tool does not allow us to inject arbitrary instructions into basic blocks, but it does provide data that we can convert to a kcov-like format and transfer them to userspace through a memory mapping. Note that this is not a strict implementation of kcov, which is especially tricky to do because we do not have the same coverage tools available in Go that that are available for the actual Linux kernel. In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block to write program counters to the kcov memory mapping. In Go, however, coverage tools only give us a count of basic blocks as they are executed. Every time we return to userspace, we collect the coverage information and write out PCs for each block that was executed, providing userspace with the illusion that the kcov data is always up to date. For convenience, we also generate a unique synthetic PC for each block instead of using actual PCs. Finally, we do not provide thread-specific coverage data (each kcov instance only contains PCs executed by the thread owning it); instead, we will supply data for any file specified by -- instrumentation_filter. Also, fix issue in nogo that was causing pkg/coverage:coverage_nogo compilation to fail. PiperOrigin-RevId: 328426526
2020-08-17	Remove weak references from unix sockets.	Dean Deng
	The abstract socket namespace no longer holds any references on sockets. Instead, TryIncRef() is used when a socket is being retrieved in BoundEndpoint(). Abstract sockets are now responsible for removing themselves from the namespace they are in, when they are destroyed. Updates #1486. PiperOrigin-RevId: 327064173
2020-07-23	Add task work mechanism.	Dean Deng
	Like task_work in Linux, this allows us to register callbacks to be executed before returning to userspace. This is needed for kcov support, which requires coverage information to be up-to-date whenever we are in user mode. We will provide coverage data through the kcov interface to enable coverage-directed fuzzing in syzkaller. One difference from Linux is that task work cannot queue work before the transition to userspace that it precedes; queued work will be picked up before the next transition. PiperOrigin-RevId: 322889984
2020-06-23	Support for saving pointers to fields in the state package.	Adin Scannell
	Previously, it was not possible to encode/decode an object graph which contained a pointer to a field within another type. This was because the encoder was previously unable to disambiguate a pointer to an object and a pointer within the object. This CL remedies this by constructing an address map tracking the full memory range object occupy. The encoded Refvalue message has been extended to allow references to children objects within another object. Because the encoding process may learn about object structure over time, we cannot encode any objects under the entire graph has been generated. This CL also updates the state package to use standard interfaces intead of reflection-based dispatch in order to improve performance overall. This includes a custom wire protocol to significantly reduce the number of allocations and take advantage of structure packing. As part of these changes, there are a small number of minor changes in other places of the code base: * The lists used during encoding are changed to use intrusive lists with the objectEncodeState directly, which required that the ilist Len() method is updated to work properly with the ElementMapper mechanism. * A bug is fixed in the list code wherein Remove() called on an element that is already removed can corrupt the list (removing the element if there's only a single element). Now the behavior is correct. * Standard error wrapping is introduced. * Compressio was updated to implement the new wire.Reader and wire.Writer inteface methods directly. The lack of a ReadByte and WriteByte caused issues not due to interface dispatch, but because underlying slices for a Read or Write call through an interface would always escape to the heap! * Statify has been updated to support the new APIs. See README.md for a description of how the new mechanism works. PiperOrigin-RevId: 318010298
2020-06-16	Port aio to VFS2.	Nicolas Lacasse
	In order to make sure all aio goroutines have stopped during S/R, a new WaitGroup was added to TaskSet, analagous to runningGoroutines. This WaitGroup is incremented with each aio goroutine, and waited on during kernel.Pause. The old VFS1 aio code was changed to use this new WaitGroup, rather than fs.Async. The only uses of fs.Async are now inode and mount Release operations, which do not call fs.Async recursively. This fixes a lock-ordering violation that can cause deadlocks. Updates #1035. PiperOrigin-RevId: 316689380
2020-05-14	Port memfd_create to vfs2 and finish implementation of file seals.	Nicolas Lacasse
	Closes #2612. PiperOrigin-RevId: 311548074
2020-05-07	Move pkg/sentry/vfs/{eventfd,timerfd} to new packages in pkg/sentry/fsimpl.	Nicolas Lacasse
	They don't depend on anything in VFS2, so they should be their own packages. PiperOrigin-RevId: 310416807
2020-04-16	Implement pipe(2) and pipe2(2) for VFS2.	Jamie Liu
	Updates #1035 PiperOrigin-RevId: 306968644
2020-04-03	Add FileDescriptionImpl for Unix sockets.	Dean Deng
	This change involves several steps: - Refactor the VFS1 unix socket implementation to share methods between VFS1 and VFS2 where possible. Re-implement the rest. - Override the default PRead, Read, PWrite, Write, Ioctl, Release methods in FileDescriptionDefaultImpl. - Add functions to create and initialize a new Dentry/Inode and FileDescription for a Unix socket file. Updates #1476 PiperOrigin-RevId: 304689796
2020-03-31	Add socket filesystem and global disconnected socket mount for VFS2.	Dean Deng
	A socket mount where anonymous sockets will reside is added to the VirtualFilesystem. Socketfs is built on top of kernfs. Updates #1476, #1478, #1484, #1485. PiperOrigin-RevId: 304095251
2020-02-14	Enable automated marshalling for struct stat.	gVisor bot
	This requires fixing a few build issues for non-am64 platforms. PiperOrigin-RevId: 295196922
2020-02-14	Plumb VFS2 inside the Sentry	gVisor bot
	- Added fsbridge package with interface that can be used to open and read from VFS1 and VFS2 files. - Converted ELF loader to use fsbridge - Added VFS2 types to FSContext - Added vfs.MountNamespace to ThreadGroup Updates #1623 PiperOrigin-RevId: 295183950
2020-02-05	Add notes to relevant tests.	Adin Scannell
	These were out-of-band notes that can help provide additional context and simplify automated imports. PiperOrigin-RevId: 293525915
2020-01-28	Add vfs.FileDescription to FD table	Fabricio Voznika
	FD table now holds both VFS1 and VFS2 types and uses the correct one based on what's set. Parts of this CL are just initial changes (e.g. sys_read.go, runsc/main.go) to serve as a template for the remaining changes. Updates #1487 Updates #1623 PiperOrigin-RevId: 292023223
2020-01-27	Update package locations.	Adin Scannell
	Because the abi will depend on the core types for marshalling (usermem, context, safemem, safecopy), these need to be flattened from the sentry directory. These packages contain no sentry-specific details. PiperOrigin-RevId: 291811289
2020-01-27	Standardize on tools directory.	Adin Scannell
	PiperOrigin-RevId: 291745021
2020-01-09	New sync package.	Ian Gudger
	* Rename syncutil to sync. * Add aliases to sync types. * Replace existing usage of standard library sync package. This will make it easier to swap out synchronization primitives. For example, this will allow us to use primitives from github.com/sasha-s/go-deadlock to check for lock ordering violations. Updates #1472 PiperOrigin-RevId: 289033387
2019-11-21	Import and structure cleanup.	Adin Scannell
	PiperOrigin-RevId: 281795269
2019-10-16	Reorder BUILD license and load functions in gvisor.	Kevin Krakauer
	PiperOrigin-RevId: 275139066
2019-09-23	internal BUILD file cleanup.	gVisor bot
	PiperOrigin-RevId: 270680704
2019-09-19	Job control: controlling TTYs and foreground process groups.	Kevin Krakauer
	Adresses a deadlock with the rolled back change: https://github.com/google/gvisor/commit/b6a5b950d28e0b474fdad160b88bc15314cf9259 Creating a session from an orphaned process group was causing a lock to be acquired twice by a single goroutine. This behavior is addressed, and a test (OrphanRegression) has been added to pty.cc. Implemented the following ioctls: - TIOCSCTTY - set controlling TTY - TIOCNOTTY - remove controlling tty, maybe signal some other processes - TIOCGPGRP - get foreground process group. Also enables tcgetpgrp(). - TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp(). Next steps are to actually turn terminal-generated control characters (e.g. C^c) into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when appropriate. PiperOrigin-RevId: 270088599
2019-09-12	Remove go_test from go_stateify and go_marshal	Michael Pratt
	They are no-ops, so the standard rule works fine. PiperOrigin-RevId: 268776264
2019-08-30	Automated rollback of changelist 261387276	Bhasker Hariharan
	PiperOrigin-RevId: 266491264
2019-08-02	Job control: controlling TTYs and foreground process groups.	Kevin Krakauer
	(Don't worry, this is mostly tests.) Implemented the following ioctls: - TIOCSCTTY - set controlling TTY - TIOCNOTTY - remove controlling tty, maybe signal some other processes - TIOCGPGRP - get foreground process group. Also enables tcgetpgrp(). - TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp(). Next steps are to actually turn terminal-generated control characters (e.g. C^c) into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when appropriate. PiperOrigin-RevId: 261387276
2019-07-09	build: add nogo for static validation	Adin Scannell
	PiperOrigin-RevId: 257297820
2019-07-02	Remove map from fd_map, change to fd_table.	Adin Scannell
	This renames FDMap to FDTable and drops the kernel.FD type, which had an entire package to itself and didn't serve much use (it was freely cast between types, and served as more of an annoyance than providing any protection.) Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup operation, and 10-15 ns per concurrent lookup operation of savings. This also fixes two tangential usage issues with the FDMap. Namely, non-atomic use of NewFDFrom and associated calls to Remove (that are both racy and fail to drop the reference on the underlying file.) PiperOrigin-RevId: 256285890
2019-06-13	Update canonical repository.	Adin Scannell
	This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620
2019-06-10	Store more information in the kernel socket table.	Rahat Mahmood
	Store enough information in the kernel socket table to distinguish between different types of sockets. Previously we were only storing the socket family, but this isn't enough to classify sockets. For example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets are SOCK_DGRAM sockets with a particular protocol. Instead of creating more sub-tables, flatten the socket table and provide a filtering mechanism based on the socket entry. Also generate and store a socket entry index ("sl" in linux) which allows us to output entries in a stable order from procfs. PiperOrigin-RevId: 252495895
2019-04-01	Save/restore simple devices.	Rahat Mahmood
	We weren't saving simple devices' last allocated inode numbers, which caused inode number reuse across S/R. PiperOrigin-RevId: 241414245 Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
2019-03-14	Decouple filemem from platform and move it to pgalloc.MemoryFile.	Jamie Liu
	This is in preparation for improved page cache reclaim, which requires greater integration between the page cache and page allocator. PiperOrigin-RevId: 238444706 Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-02-20	Make some ptrace commands x86-only	Haibo Xu
	Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I9751f859332d433ca772d6b9733f5a5a64398ec7 PiperOrigin-RevId: 234877624