gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2020-09-29	Merge pull request #3875 from btw616:fix/issue-3874	gVisor bot
	PiperOrigin-RevId: 334428344
2020-09-27	Clean up kcov.	Dean Deng
	Previously, we did not check the kcov mode when performing task work. As a result, disabling kcov did not do anything. Also avoid expensive atomic RMW when consuming coverage data. We don't need the swap if the value is already zero (which is most of the time), and it is ok if there are slight inconsistencies due to a race between coverage data generation (incrementing the value) and consumption (reading a nonzero value and writing zero). PiperOrigin-RevId: 334049207
2020-09-24	Add basic stateify annotations.	Adin Scannell
	Updates #1663 PiperOrigin-RevId: 333539293
2020-09-24	Fix socket record leak in VFS2	Tiwei Bie
	VFS2 socket record is not removed from the system-wide socket table when the socket is released, which will lead to a memory leak. This patch fixes this issue. Fixes: #3874 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-24	Rename kernel.SocketEntry to kernel.SocketRecord	Tiwei Bie
	SocketEntry can be confusing with the template types as the 'Entry' is usually used as a suffix for list element types, e.g. socketEntry in the same package. Suggested by Dean (@dean-deng). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-22	Handle EOF properly in splice/sendfile.	Dean Deng
	Use HandleIOErrorVFS2 instead of custom error handling. PiperOrigin-RevId: 333227581
2020-09-17	Complete vfs2 implementation of fallocate.	Dean Deng
	This change includes overlay, special regular gofer files, and hostfs. Fixes #3589. PiperOrigin-RevId: 332330860
2020-09-16	Rename marshal.Task to marshal.CopyContext.	Rahat Mahmood
	CopyContext is a better name for the interface because from go-marshal's perspective, the interface has nothing to do with a task. A kernel.Task happens to implement the interface, but so can other things like MemoryManager and IO sequences. PiperOrigin-RevId: 331959678
2020-09-15	Enable automated marshalling for the syscall package.	Rahat Mahmood
	PiperOrigin-RevId: 331940975
2020-09-15	Add support for OCI seccomp filters in the sandbox.	Ian Lewis
	OCI configuration includes support for specifying seccomp filters. In runc, these filter configurations are converted into seccomp BPF programs and loaded into the kernel via libseccomp. runsc needs to be a static binary so, for runsc, we cannot rely on a C library and need to implement the functionality in Go. The generator added here implements basic support for taking OCI seccomp configuration and converting it into a seccomp BPF program with the same behavior as a program generated by libseccomp. - New conditional operations were added to pkg/seccomp to support operations available in OCI. - AllowAny and AllowValue were renamed to MatchAny and EqualTo to better reflect that syscalls matching the conditionals result in the provided action not simply SCMP_RET_ALLOW. - BuildProgram in pkg/seccomp no longer panics if provided an empty list of rules. It now builds a program with the architecture sanity check only. - ProgramBuilder now allows adding labels that are unused. However, backwards jumps are still not permitted. Fixes #510 PiperOrigin-RevId: 331938697
2020-09-15	Release FDTable lock before dropping the fds.	Nayana Bidari
	This is needed for SO_LINGER, where close() is blocked for linger timeout and we are holding the FDTable lock for the entire timeout which will not allow us to create/delete other fds. We have to release the locks and then drop the fds. PiperOrigin-RevId: 331844185
2020-09-14	Correct FDSize in /proc/[pid]/status.	Jamie Liu
	In Linux, FDSize is fs/proc/array.c:task_state() => struct fdtable::max_fds, which is set to the underlying array's length in fs/file.c:alloc_fdtable(). Follow-up changes: - Remove FDTable.GetRefs() and FDTable.GetRefsVFS2(), which are unused. - Reset FDTable.used to 0 during restore, since the subsequent calls to FDTable.setAll() increment it again, causing its value to be doubled. (After this CL, FDTable.used is only used to avoid reallocation in FDTable.GetFDs(), so this fix is not very visible.) PiperOrigin-RevId: 331588190
2020-09-11	Move the 'marshal' and 'primitive' packages to the 'pkg' directory.	Rahat Mahmood
	PiperOrigin-RevId: 331256608
2020-09-08	Fix the use after nil check on args.MountNamespaceVFS2	Tiwei Bie
	The args.MountNamespaceVFS2 is used again after the nil check, instead, mntnsVFS2 which holds the expected reference should be used. This patch fixes this issue. Fixes: #3855 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-01	Fix panic when calling dup2().	Nayana Bidari
	PiperOrigin-RevId: 329572337
2020-08-27	Fix vfs2 pipe behavior when splicing to a non-pipe.	Dean Deng
	Fixes *.sh Java runtime tests, where splice()-ing from a pipe to /dev/zero would not actually empty the pipe. There was no guarantee that the data would actually be consumed on a splice operation unless the output file's implementation of Write/PWrite actually called VFSPipeFD.CopyIn. Now, whatever bytes are "written" are consumed regardless of whether CopyIn is called or not. Furthermore, the number of bytes in the IOSequence for reads is now capped at the amount of data actually available. Before, splicing to /dev/zero would always return the requested splice size without taking the actual available data into account. This change also refactors the case where an input file is spliced into an output pipe so that it follows a similar pattern, which is arguably cleaner anyway. Updates #3576. PiperOrigin-RevId: 328843954
2020-08-27	[go-marshal] Support for usermem.IOOpts.	Ayush Ranjan
	PiperOrigin-RevId: 328839759
2020-08-25	Use new reference count utility throughout gvisor.	Dean Deng
	This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted soon anyway. The following structs now use the new tool, with leak check enabled: devpts:rootInode fuse:inode kernfs:Dentry kernfs:dir kernfs:readonlyDir kernfs:StaticDirectory proc:fdDirInode proc:fdInfoDirInode proc:subtasksInode proc:taskInode proc:tasksInode vfs:FileDescription vfs:MountNamespace vfs:Filesystem sys:dir kernel:FSContext kernel:ProcessGroup kernel:Session shm:Shm mm:aioMappable mm:SpecialMappable transport:queue And the following use the template, but because they currently are not leak checked, a TODO is left instead of enabling leak check in this patch: kernel:FDTable tun:tunEndpoint Updates #1486. PiperOrigin-RevId: 328460377
2020-08-25	Expose basic coverage information to userspace through kcov interface.	Dean Deng
	In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block, which updates the memory-mapped coverage information. The Go coverage tool does not allow us to inject arbitrary instructions into basic blocks, but it does provide data that we can convert to a kcov-like format and transfer them to userspace through a memory mapping. Note that this is not a strict implementation of kcov, which is especially tricky to do because we do not have the same coverage tools available in Go that that are available for the actual Linux kernel. In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block to write program counters to the kcov memory mapping. In Go, however, coverage tools only give us a count of basic blocks as they are executed. Every time we return to userspace, we collect the coverage information and write out PCs for each block that was executed, providing userspace with the illusion that the kcov data is always up to date. For convenience, we also generate a unique synthetic PC for each block instead of using actual PCs. Finally, we do not provide thread-specific coverage data (each kcov instance only contains PCs executed by the thread owning it); instead, we will supply data for any file specified by -- instrumentation_filter. Also, fix issue in nogo that was causing pkg/coverage:coverage_nogo compilation to fail. PiperOrigin-RevId: 328426526
2020-08-21	Pass overlay credentials via context in copy up.	Nicolas Lacasse
	Some VFS operations (those which operate on FDs) get their credentials via the context instead of via an explicit creds param. For these cases, we must pass the overlay credentials on the context. PiperOrigin-RevId: 327881259
2020-08-20	Consistent precondition formatting	Michael Pratt
	Our "Preconditions:" blocks are very useful to determine the input invariants, but they are bit inconsistent throughout the codebase, which makes them harder to read (particularly cases with 5+ conditions in a single paragraph). I've reformatted all of the cases to fit in simple rules: 1. Cases with a single condition are placed on a single line. 2. Cases with multiple conditions are placed in a bulleted list. This format has been added to the style guide. I've also mentioned "Postconditions:", though those are much less frequently used, and all uses already match this style. PiperOrigin-RevId: 327687465
2020-08-18	Move ERESTART* error definitions to syserror package.	Dean Deng
	This is needed to avoid circular dependencies between the vfs and kernel packages. PiperOrigin-RevId: 327355524
2020-08-17	Remove weak references from unix sockets.	Dean Deng
	The abstract socket namespace no longer holds any references on sockets. Instead, TryIncRef() is used when a socket is being retrieved in BoundEndpoint(). Abstract sockets are now responsible for removing themselves from the namespace they are in, when they are destroyed. Updates #1486. PiperOrigin-RevId: 327064173
2020-08-07	Add context.FullStateChanged()	Andrei Vagin
	It indicates that the Sentry has changed the state of the thread and next calls of PullFullState() has to do nothing. PiperOrigin-RevId: 325567415
2020-08-03	Add callbacks to support lazy loading/restoring thread states	Andrei Vagin
	PiperOrigin-RevId: 324748508
2020-08-03	Plumbing context.Context to DecRef() and Release().	Nayana Bidari
	context is passed to DecRef() and Release() which is needed for SO_LINGER implementation. PiperOrigin-RevId: 324672584
2020-07-27	Move platform.File in memmap	Andrei Vagin
	The subsequent systrap changes will need to import memmap from the platform package. PiperOrigin-RevId: 323409486
2020-07-23	Add AfterFunc to tcpip.Clock	Sam Balana
	Changes the API of tcpip.Clock to also provide a method for scheduling and rescheduling work after a specified duration. This change also implements the AfterFunc method for existing implementations of tcpip.Clock. This is the groundwork required to mock time within tests. All references to CancellableTimer has been replaced with the tcpip.Job interface, allowing for custom implementations of scheduling work. This is a BREAKING CHANGE for clients that implement their own tcpip.Clock or use tcpip.CancellableTimer. Migration plan: 1. Add AfterFunc(d, f) to tcpip.Clock 2. Replace references of tcpip.CancellableTimer with tcpip.Job 3. Replace calls to tcpip.CancellableTimer#StopLocked with tcpip.Job#Cancel 4. Replace calls to tcpip.CancellableTimer#Reset with tcpip.Job#Schedule 5. Replace calls to tcpip.NewCancellableTimer with tcpip.NewJob. PiperOrigin-RevId: 322906897
2020-07-23	Implement get/set_robust_list.	Nicolas Lacasse
	PiperOrigin-RevId: 322904430
2020-07-23	Add task work mechanism.	Dean Deng
	Like task_work in Linux, this allows us to register callbacks to be executed before returning to userspace. This is needed for kcov support, which requires coverage information to be up-to-date whenever we are in user mode. We will provide coverage data through the kcov interface to enable coverage-directed fuzzing in syzkaller. One difference from Linux is that task work cannot queue work before the transition to userspace that it precedes; queued work will be picked up before the next transition. PiperOrigin-RevId: 322889984
2020-07-15	Merge pull request #3165 from ridwanmsharif:ridwanmsharif/fuse-off-by-default	gVisor bot
	PiperOrigin-RevId: 321411758
2020-07-13	Disable debug time adjustment logging	Fabricio Voznika
	When --debug is enabled, the following log messages are printed every second filling up the log: D0430 18:04:42.823775 129561 parameters.go:238] Clock(Monotonic): error: 46 ns, adjusted frequency from 3591713733 Hz to 3591714196 Hz D0430 18:04:42.823870 129561 parameters.go:238] Clock(Realtime): error: 36 ns, adjusted frequency from 3591714003 Hz to 3591714169 Hz D0430 18:04:42.823892 129561 timekeeper.go:209] Updating VDSO parameters: {monotonicReady:1 monotonicBaseCycles:15758797714254696 monotonicBaseRef:29000233837 monotonicFrequency:3591714196 realtimeReady:1 realtimeBaseCycles:15758797714610880 realtimeBaseRef:1588269882823867374 realtimeFrequency:3591714169} Info and warning messages for larger changes are kept the same. PiperOrigin-RevId: 321048523
2020-07-09	Gate FUSE behind a runsc flag	Ridwan Sharif
	This change gates all FUSE commands (by gating /dev/fuse) behind a runsc flag. In order to use FUSE commands, use the --fuse flag with the --vfs2 flag. Check if FUSE is enabled by running dmesg in the sandbox.
2020-07-01	Port fallocate to VFS2.	Zach Koopmans
	PiperOrigin-RevId: 319283715
2020-07-01	Complete async signal delivery support in vfs2.	Dean Deng
	- Support FIOASYNC, FIO{SET,GET}OWN, SIOC{G,S}PGRP (refactor getting/setting owner in the process). - Unset signal recipient when setting owner with pid == 0 and valid owner type. Updates #2923. PiperOrigin-RevId: 319231420
2020-06-27	Port GETOWN, SETOWN fcntls to vfs2.	Dean Deng
	Also make some fixes to vfs1's F_SETOWN. The fcntl test now entirely passes on vfs2. Fixes #2920. PiperOrigin-RevId: 318669529
2020-06-26	Require CAP_SYS_ADMIN in the root user namespace for TTY theft	Kevin Krakauer
	PiperOrigin-RevId: 318563543
2020-06-25	Avoid an allocation in epoll	Tamir Duberstein
	PiperOrigin-RevId: 318346153
2020-06-24	Remove waiter.Entry.Context	Tamir Duberstein
	This field is redundant since state can be stored in the callback. PiperOrigin-RevId: 318134855
2020-06-23	Support for saving pointers to fields in the state package.	Adin Scannell
	Previously, it was not possible to encode/decode an object graph which contained a pointer to a field within another type. This was because the encoder was previously unable to disambiguate a pointer to an object and a pointer within the object. This CL remedies this by constructing an address map tracking the full memory range object occupy. The encoded Refvalue message has been extended to allow references to children objects within another object. Because the encoding process may learn about object structure over time, we cannot encode any objects under the entire graph has been generated. This CL also updates the state package to use standard interfaces intead of reflection-based dispatch in order to improve performance overall. This includes a custom wire protocol to significantly reduce the number of allocations and take advantage of structure packing. As part of these changes, there are a small number of minor changes in other places of the code base: * The lists used during encoding are changed to use intrusive lists with the objectEncodeState directly, which required that the ilist Len() method is updated to work properly with the ElementMapper mechanism. * A bug is fixed in the list code wherein Remove() called on an element that is already removed can corrupt the list (removing the element if there's only a single element). Now the behavior is correct. * Standard error wrapping is introduced. * Compressio was updated to implement the new wire.Reader and wire.Writer inteface methods directly. The lack of a ReadByte and WriteByte caused issues not due to interface dispatch, but because underlying slices for a Read or Write call through an interface would always escape to the heap! * Statify has been updated to support the new APIs. See README.md for a description of how the new mechanism works. PiperOrigin-RevId: 318010298
2020-06-17	Implement POSIX locks	Fabricio Voznika
	- Change FileDescriptionImpl Lock/UnlockPOSIX signature to take {start,length,whence}, so the correct offset can be calculated in the implementations. - Create PosixLocker interface to make it possible to share the same locking code from different implementations. Closes #1480 PiperOrigin-RevId: 316910286
2020-06-16	Port aio to VFS2.	Nicolas Lacasse
	In order to make sure all aio goroutines have stopped during S/R, a new WaitGroup was added to TaskSet, analagous to runningGoroutines. This WaitGroup is incremented with each aio goroutine, and waited on during kernel.Pause. The old VFS1 aio code was changed to use this new WaitGroup, rather than fs.Async. The only uses of fs.Async are now inode and mount Release operations, which do not call fs.Async recursively. This fixes a lock-ordering violation that can cause deadlocks. Updates #1035. PiperOrigin-RevId: 316689380
2020-06-16	Miscellaneous VFS2 fixes.	Jamie Liu
	PiperOrigin-RevId: 316627764
2020-06-12	vfs2: implement fcntl(fd, F_SETFL, flags)	Andrei Vagin
	PiperOrigin-RevId: 316148074
2020-06-10	Merge pull request #2763 from ↵	gVisor bot
	gaurav1086:sentry_kernel_timekeeper_use_buffered_channel PiperOrigin-RevId: 315803553
2020-06-09	sentry: use defer wg.Done() unconditionally	Gaurav Singh
	Signed-off-by: Gaurav Singh <gaurav1086@gmail.com>
2020-06-09	Implement flock(2) in VFS2	Fabricio Voznika
	LockFD is the generic implementation that can be embedded in FileDescriptionImpl implementations. Unique lock ID is maintained in vfs.FileDescription and is created on demand. Updates #1480 PiperOrigin-RevId: 315604825
2020-06-05	Implement mount(2) and umount2(2) for VFS2.	Rahat Mahmood
	This is mostly syscall plumbing, VFS2 already implements the internals of mounts. In addition to the syscall defintions, the following mount-related mechanisms are updated: - Implement MS_NOATIME for VFS2, but only for tmpfs and goferfs. The other VFS2 filesystems don't implement node-level timestamps yet. - Implement the 'mode', 'uid' and 'gid' mount options for VFS2's tmpfs. - Plumb mount namespace ownership, which is necessary for checking appropriate capabilities during mount(2). Updates #1035 PiperOrigin-RevId: 315035352
2020-06-05	Unshare files on exec	Andrei Vagin
	The current task can share its fdtable with a few other tasks, but after exec, this should be a completely separate process. PiperOrigin-RevId: 314999565
2020-05-29	Implement IN_EXCL_UNLINK inotify option in vfs2.	Dean Deng
	Limited to tmpfs. Inotify support in other filesystem implementations to follow. Updates #1479 PiperOrigin-RevId: 313828648