summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/kernel
AgeCommit message (Collapse)Author
2020-09-15Add support for OCI seccomp filters in the sandbox.Ian Lewis
OCI configuration includes support for specifying seccomp filters. In runc, these filter configurations are converted into seccomp BPF programs and loaded into the kernel via libseccomp. runsc needs to be a static binary so, for runsc, we cannot rely on a C library and need to implement the functionality in Go. The generator added here implements basic support for taking OCI seccomp configuration and converting it into a seccomp BPF program with the same behavior as a program generated by libseccomp. - New conditional operations were added to pkg/seccomp to support operations available in OCI. - AllowAny and AllowValue were renamed to MatchAny and EqualTo to better reflect that syscalls matching the conditionals result in the provided action not simply SCMP_RET_ALLOW. - BuildProgram in pkg/seccomp no longer panics if provided an empty list of rules. It now builds a program with the architecture sanity check only. - ProgramBuilder now allows adding labels that are unused. However, backwards jumps are still not permitted. Fixes #510 PiperOrigin-RevId: 331938697
2020-09-15Release FDTable lock before dropping the fds.Nayana Bidari
This is needed for SO_LINGER, where close() is blocked for linger timeout and we are holding the FDTable lock for the entire timeout which will not allow us to create/delete other fds. We have to release the locks and then drop the fds. PiperOrigin-RevId: 331844185
2020-09-14Correct FDSize in /proc/[pid]/status.Jamie Liu
In Linux, FDSize is fs/proc/array.c:task_state() => struct fdtable::max_fds, which is set to the underlying array's length in fs/file.c:alloc_fdtable(). Follow-up changes: - Remove FDTable.GetRefs() and FDTable.GetRefsVFS2(), which are unused. - Reset FDTable.used to 0 during restore, since the subsequent calls to FDTable.setAll() increment it again, causing its value to be doubled. (After this CL, FDTable.used is only used to avoid reallocation in FDTable.GetFDs(), so this fix is not very visible.) PiperOrigin-RevId: 331588190
2020-09-11Move the 'marshal' and 'primitive' packages to the 'pkg' directory.Rahat Mahmood
PiperOrigin-RevId: 331256608
2020-09-08Fix the use after nil check on args.MountNamespaceVFS2Tiwei Bie
The args.MountNamespaceVFS2 is used again after the nil check, instead, mntnsVFS2 which holds the expected reference should be used. This patch fixes this issue. Fixes: #3855 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-01Fix panic when calling dup2().Nayana Bidari
PiperOrigin-RevId: 329572337
2020-08-27Fix vfs2 pipe behavior when splicing to a non-pipe.Dean Deng
Fixes *.sh Java runtime tests, where splice()-ing from a pipe to /dev/zero would not actually empty the pipe. There was no guarantee that the data would actually be consumed on a splice operation unless the output file's implementation of Write/PWrite actually called VFSPipeFD.CopyIn. Now, whatever bytes are "written" are consumed regardless of whether CopyIn is called or not. Furthermore, the number of bytes in the IOSequence for reads is now capped at the amount of data actually available. Before, splicing to /dev/zero would always return the requested splice size without taking the actual available data into account. This change also refactors the case where an input file is spliced into an output pipe so that it follows a similar pattern, which is arguably cleaner anyway. Updates #3576. PiperOrigin-RevId: 328843954
2020-08-27[go-marshal] Support for usermem.IOOpts.Ayush Ranjan
PiperOrigin-RevId: 328839759
2020-08-25Use new reference count utility throughout gvisor.Dean Deng
This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted soon anyway. The following structs now use the new tool, with leak check enabled: devpts:rootInode fuse:inode kernfs:Dentry kernfs:dir kernfs:readonlyDir kernfs:StaticDirectory proc:fdDirInode proc:fdInfoDirInode proc:subtasksInode proc:taskInode proc:tasksInode vfs:FileDescription vfs:MountNamespace vfs:Filesystem sys:dir kernel:FSContext kernel:ProcessGroup kernel:Session shm:Shm mm:aioMappable mm:SpecialMappable transport:queue And the following use the template, but because they currently are not leak checked, a TODO is left instead of enabling leak check in this patch: kernel:FDTable tun:tunEndpoint Updates #1486. PiperOrigin-RevId: 328460377
2020-08-25Expose basic coverage information to userspace through kcov interface.Dean Deng
In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block, which updates the memory-mapped coverage information. The Go coverage tool does not allow us to inject arbitrary instructions into basic blocks, but it does provide data that we can convert to a kcov-like format and transfer them to userspace through a memory mapping. Note that this is not a strict implementation of kcov, which is especially tricky to do because we do not have the same coverage tools available in Go that that are available for the actual Linux kernel. In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block to write program counters to the kcov memory mapping. In Go, however, coverage tools only give us a count of basic blocks as they are executed. Every time we return to userspace, we collect the coverage information and write out PCs for each block that was executed, providing userspace with the illusion that the kcov data is always up to date. For convenience, we also generate a unique synthetic PC for each block instead of using actual PCs. Finally, we do not provide thread-specific coverage data (each kcov instance only contains PCs executed by the thread owning it); instead, we will supply data for any file specified by -- instrumentation_filter. Also, fix issue in nogo that was causing pkg/coverage:coverage_nogo compilation to fail. PiperOrigin-RevId: 328426526
2020-08-21Pass overlay credentials via context in copy up.Nicolas Lacasse
Some VFS operations (those which operate on FDs) get their credentials via the context instead of via an explicit creds param. For these cases, we must pass the overlay credentials on the context. PiperOrigin-RevId: 327881259
2020-08-20Consistent precondition formattingMichael Pratt
Our "Preconditions:" blocks are very useful to determine the input invariants, but they are bit inconsistent throughout the codebase, which makes them harder to read (particularly cases with 5+ conditions in a single paragraph). I've reformatted all of the cases to fit in simple rules: 1. Cases with a single condition are placed on a single line. 2. Cases with multiple conditions are placed in a bulleted list. This format has been added to the style guide. I've also mentioned "Postconditions:", though those are much less frequently used, and all uses already match this style. PiperOrigin-RevId: 327687465
2020-08-18Move ERESTART* error definitions to syserror package.Dean Deng
This is needed to avoid circular dependencies between the vfs and kernel packages. PiperOrigin-RevId: 327355524
2020-08-17Remove weak references from unix sockets.Dean Deng
The abstract socket namespace no longer holds any references on sockets. Instead, TryIncRef() is used when a socket is being retrieved in BoundEndpoint(). Abstract sockets are now responsible for removing themselves from the namespace they are in, when they are destroyed. Updates #1486. PiperOrigin-RevId: 327064173
2020-08-07Add context.FullStateChanged()Andrei Vagin
It indicates that the Sentry has changed the state of the thread and next calls of PullFullState() has to do nothing. PiperOrigin-RevId: 325567415
2020-08-03Add callbacks to support lazy loading/restoring thread statesAndrei Vagin
PiperOrigin-RevId: 324748508
2020-08-03Plumbing context.Context to DecRef() and Release().Nayana Bidari
context is passed to DecRef() and Release() which is needed for SO_LINGER implementation. PiperOrigin-RevId: 324672584
2020-07-27Move platform.File in memmapAndrei Vagin
The subsequent systrap changes will need to import memmap from the platform package. PiperOrigin-RevId: 323409486
2020-07-23Add AfterFunc to tcpip.ClockSam Balana
Changes the API of tcpip.Clock to also provide a method for scheduling and rescheduling work after a specified duration. This change also implements the AfterFunc method for existing implementations of tcpip.Clock. This is the groundwork required to mock time within tests. All references to CancellableTimer has been replaced with the tcpip.Job interface, allowing for custom implementations of scheduling work. This is a BREAKING CHANGE for clients that implement their own tcpip.Clock or use tcpip.CancellableTimer. Migration plan: 1. Add AfterFunc(d, f) to tcpip.Clock 2. Replace references of tcpip.CancellableTimer with tcpip.Job 3. Replace calls to tcpip.CancellableTimer#StopLocked with tcpip.Job#Cancel 4. Replace calls to tcpip.CancellableTimer#Reset with tcpip.Job#Schedule 5. Replace calls to tcpip.NewCancellableTimer with tcpip.NewJob. PiperOrigin-RevId: 322906897
2020-07-23Implement get/set_robust_list.Nicolas Lacasse
PiperOrigin-RevId: 322904430
2020-07-23Add task work mechanism.Dean Deng
Like task_work in Linux, this allows us to register callbacks to be executed before returning to userspace. This is needed for kcov support, which requires coverage information to be up-to-date whenever we are in user mode. We will provide coverage data through the kcov interface to enable coverage-directed fuzzing in syzkaller. One difference from Linux is that task work cannot queue work before the transition to userspace that it precedes; queued work will be picked up before the next transition. PiperOrigin-RevId: 322889984
2020-07-15Merge pull request #3165 from ridwanmsharif:ridwanmsharif/fuse-off-by-defaultgVisor bot
PiperOrigin-RevId: 321411758
2020-07-13Disable debug time adjustment loggingFabricio Voznika
When --debug is enabled, the following log messages are printed every second filling up the log: D0430 18:04:42.823775 129561 parameters.go:238] Clock(Monotonic): error: 46 ns, adjusted frequency from 3591713733 Hz to 3591714196 Hz D0430 18:04:42.823870 129561 parameters.go:238] Clock(Realtime): error: 36 ns, adjusted frequency from 3591714003 Hz to 3591714169 Hz D0430 18:04:42.823892 129561 timekeeper.go:209] Updating VDSO parameters: {monotonicReady:1 monotonicBaseCycles:15758797714254696 monotonicBaseRef:29000233837 monotonicFrequency:3591714196 realtimeReady:1 realtimeBaseCycles:15758797714610880 realtimeBaseRef:1588269882823867374 realtimeFrequency:3591714169} Info and warning messages for larger changes are kept the same. PiperOrigin-RevId: 321048523
2020-07-09Gate FUSE behind a runsc flagRidwan Sharif
This change gates all FUSE commands (by gating /dev/fuse) behind a runsc flag. In order to use FUSE commands, use the --fuse flag with the --vfs2 flag. Check if FUSE is enabled by running dmesg in the sandbox.
2020-07-01Port fallocate to VFS2.Zach Koopmans
PiperOrigin-RevId: 319283715
2020-07-01Complete async signal delivery support in vfs2.Dean Deng
- Support FIOASYNC, FIO{SET,GET}OWN, SIOC{G,S}PGRP (refactor getting/setting owner in the process). - Unset signal recipient when setting owner with pid == 0 and valid owner type. Updates #2923. PiperOrigin-RevId: 319231420
2020-06-27Port GETOWN, SETOWN fcntls to vfs2.Dean Deng
Also make some fixes to vfs1's F_SETOWN. The fcntl test now entirely passes on vfs2. Fixes #2920. PiperOrigin-RevId: 318669529
2020-06-26Require CAP_SYS_ADMIN in the root user namespace for TTY theftKevin Krakauer
PiperOrigin-RevId: 318563543
2020-06-25Avoid an allocation in epollTamir Duberstein
PiperOrigin-RevId: 318346153
2020-06-24Remove waiter.Entry.ContextTamir Duberstein
This field is redundant since state can be stored in the callback. PiperOrigin-RevId: 318134855
2020-06-23Support for saving pointers to fields in the state package.Adin Scannell
Previously, it was not possible to encode/decode an object graph which contained a pointer to a field within another type. This was because the encoder was previously unable to disambiguate a pointer to an object and a pointer within the object. This CL remedies this by constructing an address map tracking the full memory range object occupy. The encoded Refvalue message has been extended to allow references to children objects within another object. Because the encoding process may learn about object structure over time, we cannot encode any objects under the entire graph has been generated. This CL also updates the state package to use standard interfaces intead of reflection-based dispatch in order to improve performance overall. This includes a custom wire protocol to significantly reduce the number of allocations and take advantage of structure packing. As part of these changes, there are a small number of minor changes in other places of the code base: * The lists used during encoding are changed to use intrusive lists with the objectEncodeState directly, which required that the ilist Len() method is updated to work properly with the ElementMapper mechanism. * A bug is fixed in the list code wherein Remove() called on an element that is already removed can corrupt the list (removing the element if there's only a single element). Now the behavior is correct. * Standard error wrapping is introduced. * Compressio was updated to implement the new wire.Reader and wire.Writer inteface methods directly. The lack of a ReadByte and WriteByte caused issues not due to interface dispatch, but because underlying slices for a Read or Write call through an interface would always escape to the heap! * Statify has been updated to support the new APIs. See README.md for a description of how the new mechanism works. PiperOrigin-RevId: 318010298
2020-06-17Implement POSIX locksFabricio Voznika
- Change FileDescriptionImpl Lock/UnlockPOSIX signature to take {start,length,whence}, so the correct offset can be calculated in the implementations. - Create PosixLocker interface to make it possible to share the same locking code from different implementations. Closes #1480 PiperOrigin-RevId: 316910286
2020-06-16Port aio to VFS2.Nicolas Lacasse
In order to make sure all aio goroutines have stopped during S/R, a new WaitGroup was added to TaskSet, analagous to runningGoroutines. This WaitGroup is incremented with each aio goroutine, and waited on during kernel.Pause. The old VFS1 aio code was changed to use this new WaitGroup, rather than fs.Async. The only uses of fs.Async are now inode and mount Release operations, which do not call fs.Async recursively. This fixes a lock-ordering violation that can cause deadlocks. Updates #1035. PiperOrigin-RevId: 316689380
2020-06-16Miscellaneous VFS2 fixes.Jamie Liu
PiperOrigin-RevId: 316627764
2020-06-12vfs2: implement fcntl(fd, F_SETFL, flags)Andrei Vagin
PiperOrigin-RevId: 316148074
2020-06-10Merge pull request #2763 from ↵gVisor bot
gaurav1086:sentry_kernel_timekeeper_use_buffered_channel PiperOrigin-RevId: 315803553
2020-06-09sentry: use defer wg.Done() unconditionallyGaurav Singh
Signed-off-by: Gaurav Singh <gaurav1086@gmail.com>
2020-06-09Implement flock(2) in VFS2Fabricio Voznika
LockFD is the generic implementation that can be embedded in FileDescriptionImpl implementations. Unique lock ID is maintained in vfs.FileDescription and is created on demand. Updates #1480 PiperOrigin-RevId: 315604825
2020-06-05Implement mount(2) and umount2(2) for VFS2.Rahat Mahmood
This is mostly syscall plumbing, VFS2 already implements the internals of mounts. In addition to the syscall defintions, the following mount-related mechanisms are updated: - Implement MS_NOATIME for VFS2, but only for tmpfs and goferfs. The other VFS2 filesystems don't implement node-level timestamps yet. - Implement the 'mode', 'uid' and 'gid' mount options for VFS2's tmpfs. - Plumb mount namespace ownership, which is necessary for checking appropriate capabilities during mount(2). Updates #1035 PiperOrigin-RevId: 315035352
2020-06-05Unshare files on execAndrei Vagin
The current task can share its fdtable with a few other tasks, but after exec, this should be a completely separate process. PiperOrigin-RevId: 314999565
2020-05-29Implement IN_EXCL_UNLINK inotify option in vfs2.Dean Deng
Limited to tmpfs. Inotify support in other filesystem implementations to follow. Updates #1479 PiperOrigin-RevId: 313828648
2020-05-29Port inotify to vfs2, with support in tmpfs.Dean Deng
Support in other filesystem impls is still needed. Unlike in Linux and vfs1, we need to plumb inotify down to each filesystem implementation in order to keep track of links/inode structures properly. IN_EXCL_UNLINK still needs to be implemented, as well as a few inotify hooks that are not present in either vfs1 or vfs2. Those will be addressed in subsequent changes. Updates #1479. PiperOrigin-RevId: 313781995
2020-05-26Implement splice(2) and tee(2) for VFS2.Jamie Liu
Updates #138 PiperOrigin-RevId: 313326354
2020-05-15Minor formatting updates for gvisor.dev.Adin Scannell
* Aggregate architecture Overview in "What is gVisor?" as it makes more sense in one place. * Drop "user-space kernel" and use "application kernel". The term "user-space kernel" is confusing when some platform implementation do not run in user-space (instead running in guest ring zero). * Clear up the relationship between the Platform page in the user guide and the Platform page in the architecture guide, and ensure they are cross-linked. * Restore the call-to-action quick start link in the main page, and drop the GitHub link (which also appears in the top-right). * Improve image formatting by centering all doc and blog images, and move the image captions to the alt text. PiperOrigin-RevId: 311845158
2020-05-14Port memfd_create to vfs2 and finish implementation of file seals.Nicolas Lacasse
Closes #2612. PiperOrigin-RevId: 311548074
2020-05-07Allocate device numbers for VFS2 filesystems.Jamie Liu
Updates #1197, #1198, #1672 PiperOrigin-RevId: 310432006
2020-05-07Move pkg/sentry/vfs/{eventfd,timerfd} to new packages in pkg/sentry/fsimpl.Nicolas Lacasse
They don't depend on anything in VFS2, so they should be their own packages. PiperOrigin-RevId: 310416807
2020-05-06Fix runsc syscall documentation generation.Adin Scannell
We can register any number of tables with any number of architectures, and need not limit the definitions to the architecture in question. This allows runsc to generate documentation for all architectures simultaneously. Similarly, this simplifies the VFSv2 patching process. PiperOrigin-RevId: 310224827
2020-05-04Fix flaky monotonic time.Adin Scannell
This change ensures that even platforms with some TSC issues (e.g. KVM), can get reliable monotonic time by applied a lower bound on each read. PiperOrigin-RevId: 309773801
2020-04-27Don't leak vfs.MountNamespace reference if kernel.TaskSet.NewTask fails.Jamie Liu
PiperOrigin-RevId: 308617610