summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/kernel
AgeCommit message (Collapse)Author
2020-06-23Support for saving pointers to fields in the state package.Adin Scannell
Previously, it was not possible to encode/decode an object graph which contained a pointer to a field within another type. This was because the encoder was previously unable to disambiguate a pointer to an object and a pointer within the object. This CL remedies this by constructing an address map tracking the full memory range object occupy. The encoded Refvalue message has been extended to allow references to children objects within another object. Because the encoding process may learn about object structure over time, we cannot encode any objects under the entire graph has been generated. This CL also updates the state package to use standard interfaces intead of reflection-based dispatch in order to improve performance overall. This includes a custom wire protocol to significantly reduce the number of allocations and take advantage of structure packing. As part of these changes, there are a small number of minor changes in other places of the code base: * The lists used during encoding are changed to use intrusive lists with the objectEncodeState directly, which required that the ilist Len() method is updated to work properly with the ElementMapper mechanism. * A bug is fixed in the list code wherein Remove() called on an element that is already removed can corrupt the list (removing the element if there's only a single element). Now the behavior is correct. * Standard error wrapping is introduced. * Compressio was updated to implement the new wire.Reader and wire.Writer inteface methods directly. The lack of a ReadByte and WriteByte caused issues not due to interface dispatch, but because underlying slices for a Read or Write call through an interface would always escape to the heap! * Statify has been updated to support the new APIs. See README.md for a description of how the new mechanism works. PiperOrigin-RevId: 318010298
2020-06-17Implement POSIX locksFabricio Voznika
- Change FileDescriptionImpl Lock/UnlockPOSIX signature to take {start,length,whence}, so the correct offset can be calculated in the implementations. - Create PosixLocker interface to make it possible to share the same locking code from different implementations. Closes #1480 PiperOrigin-RevId: 316910286
2020-06-16Port aio to VFS2.Nicolas Lacasse
In order to make sure all aio goroutines have stopped during S/R, a new WaitGroup was added to TaskSet, analagous to runningGoroutines. This WaitGroup is incremented with each aio goroutine, and waited on during kernel.Pause. The old VFS1 aio code was changed to use this new WaitGroup, rather than fs.Async. The only uses of fs.Async are now inode and mount Release operations, which do not call fs.Async recursively. This fixes a lock-ordering violation that can cause deadlocks. Updates #1035. PiperOrigin-RevId: 316689380
2020-06-16Miscellaneous VFS2 fixes.Jamie Liu
PiperOrigin-RevId: 316627764
2020-06-12vfs2: implement fcntl(fd, F_SETFL, flags)Andrei Vagin
PiperOrigin-RevId: 316148074
2020-06-10Merge pull request #2763 from ↵gVisor bot
gaurav1086:sentry_kernel_timekeeper_use_buffered_channel PiperOrigin-RevId: 315803553
2020-06-09sentry: use defer wg.Done() unconditionallyGaurav Singh
Signed-off-by: Gaurav Singh <gaurav1086@gmail.com>
2020-06-09Implement flock(2) in VFS2Fabricio Voznika
LockFD is the generic implementation that can be embedded in FileDescriptionImpl implementations. Unique lock ID is maintained in vfs.FileDescription and is created on demand. Updates #1480 PiperOrigin-RevId: 315604825
2020-06-05Implement mount(2) and umount2(2) for VFS2.Rahat Mahmood
This is mostly syscall plumbing, VFS2 already implements the internals of mounts. In addition to the syscall defintions, the following mount-related mechanisms are updated: - Implement MS_NOATIME for VFS2, but only for tmpfs and goferfs. The other VFS2 filesystems don't implement node-level timestamps yet. - Implement the 'mode', 'uid' and 'gid' mount options for VFS2's tmpfs. - Plumb mount namespace ownership, which is necessary for checking appropriate capabilities during mount(2). Updates #1035 PiperOrigin-RevId: 315035352
2020-06-05Unshare files on execAndrei Vagin
The current task can share its fdtable with a few other tasks, but after exec, this should be a completely separate process. PiperOrigin-RevId: 314999565
2020-05-29Implement IN_EXCL_UNLINK inotify option in vfs2.Dean Deng
Limited to tmpfs. Inotify support in other filesystem implementations to follow. Updates #1479 PiperOrigin-RevId: 313828648
2020-05-29Port inotify to vfs2, with support in tmpfs.Dean Deng
Support in other filesystem impls is still needed. Unlike in Linux and vfs1, we need to plumb inotify down to each filesystem implementation in order to keep track of links/inode structures properly. IN_EXCL_UNLINK still needs to be implemented, as well as a few inotify hooks that are not present in either vfs1 or vfs2. Those will be addressed in subsequent changes. Updates #1479. PiperOrigin-RevId: 313781995
2020-05-26Implement splice(2) and tee(2) for VFS2.Jamie Liu
Updates #138 PiperOrigin-RevId: 313326354
2020-05-15Minor formatting updates for gvisor.dev.Adin Scannell
* Aggregate architecture Overview in "What is gVisor?" as it makes more sense in one place. * Drop "user-space kernel" and use "application kernel". The term "user-space kernel" is confusing when some platform implementation do not run in user-space (instead running in guest ring zero). * Clear up the relationship between the Platform page in the user guide and the Platform page in the architecture guide, and ensure they are cross-linked. * Restore the call-to-action quick start link in the main page, and drop the GitHub link (which also appears in the top-right). * Improve image formatting by centering all doc and blog images, and move the image captions to the alt text. PiperOrigin-RevId: 311845158
2020-05-14Port memfd_create to vfs2 and finish implementation of file seals.Nicolas Lacasse
Closes #2612. PiperOrigin-RevId: 311548074
2020-05-07Allocate device numbers for VFS2 filesystems.Jamie Liu
Updates #1197, #1198, #1672 PiperOrigin-RevId: 310432006
2020-05-07Move pkg/sentry/vfs/{eventfd,timerfd} to new packages in pkg/sentry/fsimpl.Nicolas Lacasse
They don't depend on anything in VFS2, so they should be their own packages. PiperOrigin-RevId: 310416807
2020-05-06Fix runsc syscall documentation generation.Adin Scannell
We can register any number of tables with any number of architectures, and need not limit the definitions to the architecture in question. This allows runsc to generate documentation for all architectures simultaneously. Similarly, this simplifies the VFSv2 patching process. PiperOrigin-RevId: 310224827
2020-05-04Fix flaky monotonic time.Adin Scannell
This change ensures that even platforms with some TSC issues (e.g. KVM), can get reliable monotonic time by applied a lower bound on each read. PiperOrigin-RevId: 309773801
2020-04-27Don't leak vfs.MountNamespace reference if kernel.TaskSet.NewTask fails.Jamie Liu
PiperOrigin-RevId: 308617610
2020-04-25Enable automated marshalling for signals and the arch package.Rahat Mahmood
PiperOrigin-RevId: 308472331
2020-04-24Move hostfs mount to Kernel struct.Dean Deng
This is needed to set up host fds passed through a Unix socket. Note that the host package depends on kernel, so we cannot set up the hostfs mount directly in Kernel.Init as we do for sockfs and pipefs. Also, adjust sockfs to make its setup look more like hostfs's and pipefs's. PiperOrigin-RevId: 308274053
2020-04-23Enable automated marshalling for mempolicy syscalls.Rahat Mahmood
PiperOrigin-RevId: 308170679
2020-04-23Enable automated marshalling for epoll events.Rahat Mahmood
Ensure we use the correct architecture-specific defintion of epoll event, and use go-marshal for serialization. PiperOrigin-RevId: 308145677
2020-04-23Merge pull request #1819 from lubinszARM:pr_signal_2gVisor bot
PiperOrigin-RevId: 308100771
2020-04-22Specify a memory file in platform.New().Andrei Vagin
PiperOrigin-RevId: 307941984
2020-04-17Get /bin/true to run on VFS2Zach Koopmans
Included: - loader_test.go RunTest and TestStartSignal VFS2 - container_test.go TestAppExitStatus on VFS2 - experimental flag added to runsc to turn on VFS2 Note: shared mounts are not yet supported. PiperOrigin-RevId: 307070753
2020-04-16Implement pipe(2) and pipe2(2) for VFS2.Jamie Liu
Updates #1035 PiperOrigin-RevId: 306968644
2020-04-16Make ExtractErrno a functionFabricio Voznika
PiperOrigin-RevId: 306891171
2020-04-13Merge pull request #2168 from xiaobo55x:ptrace_testgVisor bot
PiperOrigin-RevId: 306306809
2020-04-13Port socket-related syscalls to VFS2.Dean Deng
Note that most kinds of sockets are not yet supported in VFS2 (only Unix sockets are partially supported at the moment), so these syscalls will still generally fail. Enabling them allows us to begin running socket tests for VFS2 as more features are ported over. Updates #1476, #1478, #1484, #1485. PiperOrigin-RevId: 306292294
2020-04-13Remove obsolete TODOs for b/38173783Jon Budd
The comments in the ticket indicate that this behavior is fine and that the ticket should be closed, so we shouldn't need pointers to the ticket. PiperOrigin-RevId: 306266071
2020-04-10Add logging message for noNewPrivileges OCI option.Ian Lewis
noNewPrivileges is ignored if set to false since gVisor assumes that PR_SET_NO_NEW_PRIVS is always enabled. PiperOrigin-RevId: 305991947
2020-04-10Remove TODO from kernel.StracerFabricio Voznika
The dependency strace=>kernel grew over time. strace also depends on task's FD table and FSContext. It could be fixed with some interfaces the other way, but then we're trading an interface for another, and kernel.Stracer is likely cleaner. Closes #155 PiperOrigin-RevId: 305909678
2020-04-09Merge pull request #2253 from amscanne:nogogVisor bot
PiperOrigin-RevId: 305807868
2020-04-10Enable syscall ptrace test on arm64.Haibo Xu
Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I5bb8fa7d580d173b1438d6465e1adb442216c8fa
2020-04-08Clean up TODOsFabricio Voznika
PiperOrigin-RevId: 305592245
2020-04-08Fix all copy locks violations.Adin Scannell
This required minor restructuring of how system call tables were saved and restored, but it makes way more sense this way. Updates #2243
2020-04-06Port timerfd to VFS2.Nicolas Lacasse
PiperOrigin-RevId: 305067208
2020-04-04Record VFS2 sockets in global socket map.Dean Deng
Updates #1476, #1478, #1484, #1485. PiperOrigin-RevId: 304845354
2020-04-03Refactor software GSO code.Bhasker Hariharan
Software GSO implementation currently has a complicated code path with implicit assumptions that all packets to WritePackets carry same Data and it does this to avoid allocations on the path etc. But this makes it hard to reuse the WritePackets API. This change breaks all such assumptions by introducing a new Vectorised View API ReadToVV which can be used to cleanly split a VV into multiple independent VVs. Further this change also makes packet buffers linkable to form an intrusive list. This allows us to get rid of the array of packet buffers that are passed in the WritePackets API call and replace it with a list of packet buffers. While this code does introduce some more allocations in the benchmarks it doesn't cause any degradation. Updates #231 PiperOrigin-RevId: 304731742
2020-04-03Add FileDescriptionImpl for Unix sockets.Dean Deng
This change involves several steps: - Refactor the VFS1 unix socket implementation to share methods between VFS1 and VFS2 where possible. Re-implement the rest. - Override the default PRead, Read, PWrite, Write, Ioctl, Release methods in FileDescriptionDefaultImpl. - Add functions to create and initialize a new Dentry/Inode and FileDescription for a Unix socket file. Updates #1476 PiperOrigin-RevId: 304689796
2020-04-03Ensure EOF is handled propertly during splice.Adin Scannell
PiperOrigin-RevId: 304684417
2020-03-31Implement automated marshalling for slices of Marshallable types.Rahat Mahmood
PiperOrigin-RevId: 304119255
2020-03-31Arm64 signal#2: signal support in arch moduleBin Lu
SA_RESTORER is always used on Intel platform. But this flag is optional on other platforms. The vdso is enabled, so we can use the sigreturn trampolines the vdso provides instead on Arm platform. Signed-off-by: Bin Lu <bin.lu@arm.com>
2020-03-31Add socket filesystem and global disconnected socket mount for VFS2.Dean Deng
A socket mount where anonymous sockets will reside is added to the VirtualFilesystem. Socketfs is built on top of kernfs. Updates #1476, #1478, #1484, #1485. PiperOrigin-RevId: 304095251
2020-03-26Support owner matching for iptables.Nayana Bidari
This feature will match UID and GID of the packet creator, for locally generated packets. This match is only valid in the OUTPUT and POSTROUTING chains. Forwarded packets do not have any socket associated with them. Packets from kernel threads do have a socket, but usually no owner.
2020-03-19Remove workMu from tcpip.Endpoint.Bhasker Hariharan
workMu is removed and e.mu is now a mutex that supports TryLock. The packet processing path tries to lock the mutex and if its locked it will just queue the packet and move on. The endpoint.UnlockUser() will process any backlog of packets before unlocking the socket. This simplifies the locking inside tcp endpoints a lot. Further the endpoint.LockUser() implements spinning as long as the lock is not held by another syscall goroutine. This ensures low latency as not spinning leads to the task thread being put to sleep if the lock is held by the packet dispatch path. This is suboptimal as the lower layer rarely holds the lock for long so implementing spinning here helps. If the lock is held by another task goroutine then we just proceed to call LockUser() and the task could be put to sleep. The protocol goroutines themselves just call e.mu.Lock() and block if the lock is currently not available. Updates #231, #357 PiperOrigin-RevId: 301808349
2020-03-18Fix FDTable.NewFDVFS2Fabricio Voznika
It was looking at VFS1 table to determine where to allocate the next FD from. Updates #1035 PiperOrigin-RevId: 301678858
2020-03-16fdtable: don't try to zap fdtable entry if close is called for non-existing fdAndrei Vagin
FDTable.setAll is used to zap entries, but it grows the table up to a specified fd. Reported-by: syzbot+9e281b0750d2d4caa190@syzkaller.appspotmail.com PiperOrigin-RevId: 301280000