summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry
AgeCommit message (Collapse)Author
2021-10-27Sychronize access to cpuset controller bitmaps.Rahat Mahmood
Reported-by: syzbot+39d434b96cf7c29a66ad@syzkaller.appspotmail.com Reported-by: syzbot+7c38bce6353d91facca3@syzkaller.appspotmail.com PiperOrigin-RevId: 406024052
2021-10-27Record counts of packets with unknown L3/L4 numbersNick Brown
Previously, we recorded a single aggregated count. These per-protocol counts can help us debug field issues when frames are dropped for this reason. PiperOrigin-RevId: 405913911
2021-10-26Simplify vfs.NewDisconnectedMount signature and callpoints.Ayush Ranjan
vfs.NewDisconnectedMount has no error paths. Its much prettier without the error return value. Also simplify MountDisconnected which would immediately drop the refs taken by NewDisconnectedMount. Instead make it directly call newMount. PiperOrigin-RevId: 405767966
2021-10-26platform/kvm: map vdso and vvar into a guest address spaceAndrei Vagin
Right now, each vdso call triggers vmexit. VDSO and VVAR pages are mapped with VM_IO and get_user_pages fails for such vma-s. KVM was not able to handle this case up to the v4.8 kernel. This problem was fixed by add6a0cd1c5ba ("KVM: MMU: try to fix up page faults before giving up"). For some unknown reasons, it still doesn't work in case of nested virtualization. Before: BenchmarkKernelVDSO-6 252519 4598 ns/op After: BenchmarkKernelVDSO-6 34431957 34.91 ns/op PiperOrigin-RevId: 405715941
2021-10-26Obtain ref on root dentry in mqfs.GetFilesystem.Ayush Ranjan
As documented in FilesystemType.GetFilesystem, a reference should be taken on the returned dentry and filesystem by GetFilesystem implementation. mqfs did not do that. Additionally cleanup and clarify ref counting of dentry, filesystem and mount in mqfs. Reported-by: syzbot+a2c54bfb6e1525228e5f@syzkaller.appspotmail.com Reported-by: syzbot+ccd305cdab11cfebbfff@syzkaller.appspotmail.com PiperOrigin-RevId: 405700565
2021-10-26Ensure statfs::f_namelen is set by VFS2 gofer statfs/fstatfs.Jamie Liu
VFS1 discards the value of f_namelen returned by the filesystem and returns NAME_MAX unconditionally instead, so it doesn't run into this. Also set f_frsize for completeness. PiperOrigin-RevId: 405579707
2021-10-25Do not leak non-permission mode bits in mq_open(2).Ayush Ranjan
As caught by syzkaller, we were leaking non-permission bits while passing the user generated mode. DynamicBytesFile panics in this case. Reported-by: syzbot+5abe52d47d56a5a98c89@syzkaller.appspotmail.com PiperOrigin-RevId: 405481392
2021-10-21Merge pull request #6345 from sudo-sturbia:mq/syscallsgVisor bot
PiperOrigin-RevId: 404901660
2021-10-20Report correct error when restore failsFabricio Voznika
When file corruption is detected, report vfs.ErrCorruption to distinguish corruption error from other restore errors. Updates #1035 PiperOrigin-RevId: 404588445
2021-10-19Fix typo in FIXMEFabricio Voznika
PiperOrigin-RevId: 404400399
2021-10-19Do not return non-nil *lisafs.Inode to doCreateAt on error.Ayush Ranjan
lisafs.ClientFile.MkdirAt is allowed to return a non-nil Inode and a non-nil error on an RPC error. The caller must not use the returned (invalid) Inode on error. But a code path in the gofer client does end up using it. More specifically, when the Mkdir RPC fails and we end up creating a synthetic dentry for a mountpoint, we end up returning the (invalid) non-nil Inode to filesystem.doCreateAt implementation which thinks that a remote file was created. But that non-nil Inode is actually invalid because the RPC failed. Things go downhill from there. Update client to not use childDirInode if RPC failed. PiperOrigin-RevId: 404396573
2021-10-19Stub cpuset cgroup control files.Rahat Mahmood
PiperOrigin-RevId: 404382475
2021-10-18conntrack: use tcpip.Clock instead of time.TimeKevin Krakauer
- We should be using a monotonic clock - This will make future testing easier Updates #6748. PiperOrigin-RevId: 404072318
2021-10-18Report ramdiskfs usage correctlyFabricio Voznika
Updates #1035 PiperOrigin-RevId: 404072231
2021-10-18Support distinction for RWMutex and read-only locks.Adin Scannell
Fixes #6590 PiperOrigin-RevId: 404007524
2021-10-14Report total memory based on limit or hostFabricio Voznika
gVisor was previously reporting the lower of cgroup limit or 2GB as total memory. This may cause applications to make bad decisions based on amount of memory available to them when more than 2GB is required. This change makes the lower of cgroup limit or the host total memory to be reported inside the sandbox. This also is more inline with docker which always reports host total memory. Note that reporting cgroup limit is strictly better than host total memory when there is a limit set. Fixes #5608 PiperOrigin-RevId: 403241608
2021-10-12Remove state:"nosave"/"zerovalue" annotations from all waiter.Queues.Jamie Liu
Prior to cl/318010298, //pkg/state couldn't handle pointers to struct fields, which meant that it couldn't handle intrusive linked lists, which meant that it couldn't handle waiter.Queue, which meant that it couldn't handle epoll. As a result, VFS1 unregisters all epoll waiters before saving and re-registers them after loading, and waitable VFS1 file implementations tag their waiter.Queues state:"nosave" (causing them to be skipped by the save/restore machinery) or state:"zerovalue" (causing them to only be checked for zero-value-equality on save). VFS2 required cl/318010298 to support save/restore (due to the Impl inheritance pattern used by vfs.FileDescription, vfs.Dentry, etc.); correspondingly, VFS2 epoll assumes that waiter.Queues *will be* saved and loaded correctly, and VFS2 file implementations do not tag waiter.Queues. Some waiter.Queues, e.g. pipe.Pipe.Queue and kernel.Task.signalQueue, are used by both VFS1 and VFS2 (the latter via signalfd); as a result of the above, tagging these Queues state:"nosave" or state:"zerovalue" breaks VFS2 epoll. Remove VFS1 epoll unregistration before saving (bringing it in line with VFS2), and remove these tags from all waiter.Queues. Also clean up after the epoll test added by cl/402323053, which implied this issue (by instantiating DisableSave in the new test) without reporting it. PiperOrigin-RevId: 402596216
2021-10-11Merge pull request #6428 from dillanzhou:fix_epoll_vfs2gVisor bot
PiperOrigin-RevId: 402323053
2021-10-08Remove ring0 floating point save/load functions on amd64.Jamie Liu
ring0.Save/LoadFloatingPoint() are only usable if the caller can ensure that Go will not clobber floating point registers before/after calling them respectively. Due to regabig in Go 1.17, this is no longer the case; regabig (among other things) maintains a zeroed XMM15 during ABIInternal execution, including by zeroing it after ABI0-to-ABIInternal transitions. In ring0.sysenter/exception, this happens in ring0.kernelSyscall/kernelException.abi0 respectively; in ring0.CPU.SwitchToUser, this happens after returning from ring0.sysret/iret.abi0. Delete these functions and do floating point save/load in assembly. While arm64 doesn't appear to be immediately affected (so this CL permits us to resume usage of Go 1.17), its use of Save/LoadFloatingPoint() still seems to be incorrect for the same fundamental reason (Go code can't sanely assume what registers the Go compiler will or won't use) and should be fixed eventually. PiperOrigin-RevId: 401895658
2021-10-08Remove redundant slice copy in lisafs gofer client.Ayush Ranjan
listXattr() was doing redundant work. Remove it. PiperOrigin-RevId: 401871315
2021-10-08Disallow "trusted" namespace xattr in VFS2 gofer client.Ayush Ranjan
Allowing this namespace makes way for a lot of GetXattr RPCs to the gofer process when the gofer filesystem is the lower layer of an overlay. The overlay filesystem aggressively queries for "trusted.overlay.opaque" which in practice is never found in the lower layer gofer. But leads to a lot of wasted work. A consequence is that mutable gofer upper layer is not supported anymore but that is still consistent with VFS1. We can revisit when need arises. PiperOrigin-RevId: 401860585
2021-10-07add convenient wrapper for eventfdKevin Krakauer
The same create/write/read pattern is copied around several places. It's easier to understand in a package with names and comments, and we can reuse the smart blocking code in package rawfile. PiperOrigin-RevId: 401647108
2021-10-07Add a new metric to detect the number of spurious loss recoveries.Nayana Bidari
- Implements RFC 3522 (Eifel detection algorithm) to detect if the connection entered loss recovery unnecessarily. - Added a new metric to count the total number of spurious loss recoveries. - Added tests to verify the new metric. PiperOrigin-RevId: 401637359
2021-10-07tests: use a proper path to the kvm deviceAndrei Vagin
PiperOrigin-RevId: 401624134
2021-10-07Store timestamps as time.TimeTamir Duberstein
Rather than boiling down to an integer eagerly, do it as late as possible. PiperOrigin-RevId: 401599308
2021-10-06Create null entry connection on first IPTables hookGhanan Gowripalan
...all connections should be tracked by ConnTrack, so create a no-op connection entry on the first hook into IPTables (Prerouting or Output) and let NAT targets modify the connection entry if they need to instead of letting the NAT target create their own connection entry. This also prepares for "twice-NAT" where a packet may have both DNAT and SNAT performed on it (which requires the ability to update ConnTrack entries). Updates #5696. PiperOrigin-RevId: 401360377
2021-10-06Add global lisafs kernel flag.Ayush Ranjan
PiperOrigin-RevId: 401296116
2021-10-01Merge pull request #6551 from sudo-sturbia:msgqueue/procfsgVisor bot
PiperOrigin-RevId: 400258924
2021-09-30kernel: print PID in addition to TID in task log messagesAndrei Vagin
For multithreads processes, it is hard to read logs without knowing task pids. And let's print a decimal return codeo for syscalls. A hex return code are usefull for system calls that return addresses. For other syscalls, the decimal form is more readable. PiperOrigin-RevId: 400035449
2021-09-28Move `safecopy.ReplaceSignalHandler` into `sighandling` package.Etienne Perot
PiperOrigin-RevId: 399560357
2021-09-28Implement stubs for mq_open(2) and mq_unlink(2).Zyad A. Ali
Support mq_open and mq_unlink, and enable syscall tests. Updates #136
2021-09-28Implement Registry.Remove.Zyad A. Ali
Remove implements the behaviour of mq_unlink(2). Updates #136
2021-09-28Use one mutex for both Registry and RegistryImpl.Zyad A. Ali
Updates #136
2021-09-28Implement Registry.FindOrCreate.Zyad A. Ali
FindOrCreate implements the behaviour of mq_open(2). Updates #136
2021-09-28Return FDs in RegistryImpl functions and use Views.Zyad A. Ali
Update RegistryImpl functions to return file descriptions, instead of queues, and use Views in queue inodes. Updates #136
2021-09-28Define mq.View and use it for mqfs.queueFD.Zyad A. Ali
View makes it easier to handle O_RDONLY, O_WRONLY, and ORDWR options in mq_open(2). Updates #136
2021-09-28Initialize POSIX queues' registry after creating a new IPCNamespace.Zyad A. Ali
Updates #136
2021-09-28Move filesystem creation from GetFilesystem to RegistryImpl.Zyad A. Ali
Move root dentry and filesystem creation from GetFilesystem to NewRegistryImpl, create IPCNamespace.InitPosixQueues to create a new mqueue filesystem for each ipc namespace, and update GetFilesystem to retreive fs and root dentry from IPCNamespace and return them. Updates #136
2021-09-27Move `sighandling` package out of `sentry`.Etienne Perot
PiperOrigin-RevId: 399295737
2021-09-27Add procfs files for SysV message queues.Zyad A. Ali
2021-09-24Update the comment for Task.netnsAndrei Vagin
Task.netns can be accessed atomically, so Task.mu isn't needed to access it. PiperOrigin-RevId: 398773947
2021-09-24Merge pull request #6647 from avagin:task-netnsgVisor bot
PiperOrigin-RevId: 398763161
2021-09-23kernel: allow to access Task.netns without taking Task.muAndrei Vagin
This allows to avoind unnecessary lock-ordering dependencies on task.mu.
2021-09-23Create the cgroupfs mount point in sysfs.Rahat Mahmood
Create the /sys/fs/cgroup directory when cgroups are available. This creates the empty directory to serve as the mountpoint, actually mounting cgroups is left to the launcher/userspace. This is consistent with Linux behaviour. Without this mountpoint, getdents(2) on /sys/fs indicates an empty directory even if the launcher mounts cgroupfs at /sys/fs/cgroup. The launcher can't create the mountpoint directory since sysfs doesn't support mkdir. PiperOrigin-RevId: 398596698
2021-09-23Merge pull request #6573 from avagin:kvm-seccomp-mmapgVisor bot
PiperOrigin-RevId: 398572735
2021-09-23Pass AddressableEndpoint to IPTablesGhanan Gowripalan
...instead of an address. This allows a later change to more precisely select an address based on the NAT type (source vs. destination NAT). PiperOrigin-RevId: 398559901
2021-09-22Add Execve and ExitNotifyParent checkpoints.Jamie Liu
Call sites for the two checkpoints aren't added yet. PiperOrigin-RevId: 398375903
2021-09-22kvm: check that safecopy is handled correctly in the guest ring0Andrei Vagin
Signed-off-by: Andrei Vagin <avagin@google.com>
2021-09-22kvm: trap mmap syscalls to map new regions to the guestAndrei Vagin
We install seccomp rules so that the SIGSYS signal is generated for each mmap system call. Then our signal handler executes the real mmap syscall and if a new regions is created, it maps it to the guest. Signed-off-by: Andrei Vagin <avagin@google.com>
2021-09-22kvm/arm: calculate virtual-to-physical mappings only onceAndrei Vagin