summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry
AgeCommit message (Collapse)Author
2019-05-17Fix gofer rename ctime and cleanup stat_times testMichael Pratt
There is a lot of redundancy that we can simplify in the stat_times test. This will make it easier to add new tests. However, the simplification reveals that cached uattrs on goferfs don't properly update ctime on rename. PiperOrigin-RevId: 248773425 Change-Id: I52662728e1e9920981555881f9a85f9ce04041cf
2019-05-15gofer: don't call hostfile.Close if hostFile is nilAndrei Vagin
PiperOrigin-RevId: 248437159 Change-Id: Ife71f6ca032fca59ec97a82961000ed0af257101
2019-05-14Start of support for /proc/pid/cgroup file.Nicolas Lacasse
PiperOrigin-RevId: 248263378 Change-Id: Ic057d2bb0b6212110f43ac4df3f0ac9bf931ab98
2019-05-14Remove false commentMichael Pratt
PiperOrigin-RevId: 248249285 Change-Id: I9b6d267baa666798b22def590ff20c9a118efd47
2019-05-10Add pgalloc.DelayedEvictionManual.Jamie Liu
PiperOrigin-RevId: 247667272 Change-Id: I16b04e11bb93f50b7e05e888992303f730e4a877
2019-05-09Implement fallocate(2)Fabricio Voznika
Closes #225 PiperOrigin-RevId: 247508791 Change-Id: I04f47cf2770b30043e5a272aba4ba6e11d0476cc
2019-05-08Set the FilesytemType in MountSource from the Filesystem.Nicolas Lacasse
And stop storing the Filesystem in the MountSource. This allows us to decouple the MountSource filesystem type from the name of the filesystem. PiperOrigin-RevId: 247292982 Change-Id: I49cbcce3c17883b7aa918ba76203dfd6d1b03cc8
2019-05-07Remove defers from gofer.contextFileFabricio Voznika
Most are single line methods in hot paths. PiperOrigin-RevId: 247050267 Change-Id: I428d78723fe00b57483185899dc8fa9e1f01e2ea
2019-05-06Ensure all uses of MM.brk occur under MM.mappingMu in MM.Brk().Jamie Liu
PiperOrigin-RevId: 246921386 Change-Id: I71d8908858f45a9a33a0483470d0240eaf0fd012
2019-05-03gofer: don't leak file descriptorsAndrei Vagin
Fixes #219 PiperOrigin-RevId: 246568639 Change-Id: Ic7afd15dde922638d77f6429c508d1cbe2e4288a
2019-04-30Update reference to old typeMichael Pratt
PiperOrigin-RevId: 246036806 Change-Id: I5554a43a1f8146c927402db3bf98488a2da0fbe7
2019-04-30Implement async MemoryFile eviction, and use it in CachingInodeOperations.Jamie Liu
This feature allows MemoryFile to delay eviction of "optional" allocations, such as unused cached file pages. Note that this incidentally makes CachingInodeOperations writeback asynchronous, in the sense that it doesn't occur until eviction; this is necessary because between when a cached page becomes evictable and when it's evicted, file writes (via CachingInodeOperations.Write) may dirty the page. As currently implemented, this feature won't meaningfully impact steady-state memory usage or caching; the reclaimer goroutine will schedule eviction as soon as it runs out of other work to do. Future CLs increase caching by adding constraints on when eviction is scheduled. PiperOrigin-RevId: 246014822 Change-Id: Ia85feb25a2de92a48359eb84434b6ec6f9bea2cb
2019-04-29Implement the MSG_CTRUNC msghdr flag for Unix sockets.Ian Gudger
Updates google/gvisor#206 PiperOrigin-RevId: 245880573 Change-Id: Ifa715e98d47f64b8a32b04ae9378d6cd6bd4025e
2019-04-29Change copyright notice to "The gVisor Authors"Michael Pratt
Based on the guidelines at https://opensource.google.com/docs/releasing/authors/. 1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./' 2. Manual fixup of "Google Inc" references. 3. Add AUTHORS file. Authors may request to be added to this file. 4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS. Fixes #209 PiperOrigin-RevId: 245823212 Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29Allow and document bug ids in gVisor codebase.Nicolas Lacasse
PiperOrigin-RevId: 245818639 Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-29createAt should return all errors from FindInode except ENOENT.Nicolas Lacasse
Previously, createAt was eating all errors from FindInode except for EACCES and proceeding with the creation. This is incorrect, as FindInode can return many other errors (like ENAMETOOLONG) that should stop creation. This CL changes createAt to return all errors encountered except for ENOENT, which we can ignore because we are about to create the thing. PiperOrigin-RevId: 245773222 Change-Id: I1b317021de70f0550fb865506f6d8147d4aebc56
2019-04-26kvm: remove non-sane sanity checkAdin Scannell
Apparently some platforms don't have pSize < vSize. Fixes #208 PiperOrigin-RevId: 245480998 Change-Id: I2a98229912f4ccbfcd8e79dfa355104f14275a9c
2019-04-26Fix reference counting bug in /proc/PID/fdinfo/.Kevin Krakauer
PiperOrigin-RevId: 245452217 Change-Id: I7164d8f57fe34c17e601079eb9410a6d95af1869
2019-04-25Perform explicit CPUID and FP state compatibility checks on restoreMichael Pratt
PiperOrigin-RevId: 245341004 Change-Id: Ic4d581039d034a8ae944b43e45e84eb2c3973657
2019-04-25Don't enforce NAME_MAX in fs.Dirent.walk().Jamie Liu
Maximum filename length is filesystem-dependent, and obtained via statfs::f_namelen. This limit is usually 255 bytes (NAME_MAX), but not always. For example, VFAT supports filenames of up to 255... UCS-2 characters, which Linux conservatively takes to mean UTF-8-encoded bytes: fs/fat/inode.c:fat_statfs(), FAT_LFN_LEN * NLS_MAX_CHARSET_SIZE. As a result, Linux's VFS does not enforce NAME_MAX: $ rg --maxdepth=1 '\WNAME_MAX\W' fs/ include/linux/ fs/libfs.c 38: buf->f_namelen = NAME_MAX; 64: if (dentry->d_name.len > NAME_MAX) include/linux/relay.h 74: char base_filename[NAME_MAX]; /* saved base filename */ include/linux/fscrypt.h 149: * filenames up to NAME_MAX bytes, since base64 encoding expands the length. include/linux/exportfs.h 176: * understanding that it is already pointing to a a %NAME_MAX+1 sized Remove this check from core VFS, and add it to ramfs (and by extension tmpfs), where it is actually applicable: mm/shmem.c:shmem_dir_inode_operations.lookup == simple_lookup *does* enforce NAME_MAX. PiperOrigin-RevId: 245324748 Change-Id: I17567c4324bfd60e31746a5270096e75db963fac
2019-04-22Bugfix: fix fstatat symbol link to dirWei Zhang
For a symbol link to some directory, eg. `/tmp/symlink -> /tmp/dir` `fstatat("/tmp/symlink")` should return symbol link data, but `fstatat("/tmp/symlink/")` (symlink with trailing slash) should return directory data it points following linux behaviour. Currently fstatat() a symlink with trailing slash will get "not a directory" error which is wrong. Signed-off-by: Wei Zhang <zhangwei198900@gmail.com> Change-Id: I63469b1fb89d083d1c1255d32d52864606fbd7e2 PiperOrigin-RevId: 244783916
2019-04-22Fix doc typoMichael Pratt
PiperOrigin-RevId: 244773890 Change-Id: I2d0cd7789771276ba545b38efff6d3e24133baaa
2019-04-22Clean up state error handlingMichael Pratt
PiperOrigin-RevId: 244773836 Change-Id: I32223f79d2314fe1ac4ddfc63004fc22ff634adf
2019-04-19Add support for the MSG_TRUNC msghdr flag.Ian Gudger
The MSG_TRUNC flag is set in the msghdr when a message is truncated. Fixes google/gvisor#200 PiperOrigin-RevId: 244440486 Change-Id: I03c7d5e7f5935c0c6b8d69b012db1780ac5b8456
2019-04-18Format struct pollfd in poll(2)/ppoll(2)Michael Pratt
I0410 15:40:38.854295 3776 x:0] [ 1] poll_test E poll(0x2b00bfb5c020 [{FD: 0x3 anon_inode:[eventfd], Events: POLLOUT, REvents: ...}], 0x1, 0x1) I0410 15:40:38.854348 3776 x:0] [ 1] poll_test X poll(0x2b00bfb5c020 [{FD: 0x3 anon_inode:[eventfd], Events: POLLOUT|POLLERR|POLLHUP, REvents: POLLOUT}], 0x1, 0x1) = 0x1 (10.765?s) PiperOrigin-RevId: 244269879 Change-Id: If07ba54a486fdeaaedfc0123769b78d1da862307
2019-04-18Only emit unimplemented syscall events for unsupported values.Ian Gudger
Only emit unimplemented syscall events for setting SO_OOBINLINE and SO_LINGER when attempting to set unsupported values. PiperOrigin-RevId: 244229675 Change-Id: Icc4562af8f733dd75a90404621711f01a32a9fc1
2019-04-17Don't allow sigtimedwait to catch unblockable signalsMichael Pratt
The existing logic attempting to do this is incorrect. Unary ^ has higher precedence than &^, so mask always has UnblockableSignals cleared, allowing dequeueSignalLocked to dequeue unblockable signals (which allows userspace to ignore them). Switch the logic so that unblockable signals are always masked. PiperOrigin-RevId: 244058487 Change-Id: Ib19630ac04068a1fbfb9dc4a8eab1ccbdb21edc3
2019-04-17Use FD limit and file size limit from hostFabricio Voznika
FD limit and file size limit is read from the host, instead of using hard-coded defaults, given that they effect the sandbox process. Also limit the direct cache to use no more than half if the available FDs. PiperOrigin-RevId: 244050323 Change-Id: I787ad0fdf07c49d589e51aebfeae477324fe26e6
2019-04-17Convert poll/select to operate more directly on linux.PollFDMichael Pratt
Current, doPoll copies the user struct pollfd array into a []syscalls.PollFD, which contains internal kdefs.FD and waiter.EventMask types. While these are currently binary-compatible with the Linux versions, we generally discourage copying directly to internal types (someone may inadvertantly change kdefs.FD to uint64). Instead, copy directly to a []linux.PollFD, which will certainly be binary compatible. Most of syscalls/polling.go is included directly into syscalls/linux/sys_poll.go, as it can then operate directly on linux.PollFD. The additional syscalls.PollFD type is providing little value. I've also added explicit conversion functions for waiter.EventMask, which creates the possibility of a different binary format. PiperOrigin-RevId: 244042947 Change-Id: I24e5b642002a32b3afb95a9dcb80d4acd1288abf
2019-04-11Format FDs in strace logsMichael Pratt
Normal files display their path in the current mount namespace: I0410 10:57:54.964196 216336 x:0] [ 1] ls X read(0x3 /proc/filesystems, 0x55cee3bdb2c0 "nodev\t9p\nnodev\tdevpts \nnodev\tdevtmpfs\nnodev\tproc\nnodev\tramdiskfs\nnodev\tsysfs\nnodev\ttmpfs\n", 0x1000) = 0x58 (24.462?s) AT_FDCWD includes the CWD: I0411 12:58:48.278427 1526 x:0] [ 1] stat_test E newfstatat(AT_FDCWD /home/prattmic, 0x55ea719b564e /proc/self, 0x7ef5cefc2be8, 0x0) Sockets (and other non-vfs files) display an inode number (like /proc/PID/fd): I0410 10:54:38.909123 207684 x:0] [ 1] nc E bind(0x3 socket:[1], 0x55b5a1652040 {Family: AF_INET, Addr: , Port: 8080}, 0x10) I also fixed a few syscall args that should be Path. PiperOrigin-RevId: 243169025 Change-Id: Ic7dda6a82ae27062fe2a4a371557acfd6a21fa2a
2019-04-11Use open fids when fstat()ing gofer files.Jamie Liu
PiperOrigin-RevId: 243018347 Change-Id: I1e5b80607c1df0747482abea61db7fcf24536d37
2019-04-10Internal changeMichael Pratt
PiperOrigin-RevId: 242978508 Change-Id: I0ea59ac5ba1dd499e87c53f2e24709371048679b
2019-04-10Fix uses of RootFromContext.Nicolas Lacasse
RootFromContext can return a dirent with reference taken, or nil. We must call DecRef if (and only if) a real dirent is returned. PiperOrigin-RevId: 242965515 Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
2019-04-10DATA RACE in fs.(*Dirent).fullNameYong He
add renameMu.Lock when oldParent == newParent in order to avoid data race in following report: WARNING: DATA RACE Read at 0x00c000ba2160 by goroutine 405: gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*Dirent).fullName() pkg/sentry/fs/dirent.go:246 +0x6c gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*Dirent).FullName() pkg/sentry/fs/dirent.go:356 +0x8b gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*FDMap).String() pkg/sentry/kernel/fd_map.go:135 +0x1e0 fmt.(*pp).handleMethods() GOROOT/src/fmt/print.go:603 +0x404 fmt.(*pp).printArg() GOROOT/src/fmt/print.go:686 +0x255 fmt.(*pp).doPrintf() GOROOT/src/fmt/print.go:1003 +0x33f fmt.Fprintf() GOROOT/src/fmt/print.go:188 +0x7f gvisor.googlesource.com/gvisor/pkg/log.(*Writer).Emit() pkg/log/log.go:121 +0x89 gvisor.googlesource.com/gvisor/pkg/log.GoogleEmitter.Emit() pkg/log/glog.go:162 +0x1acc gvisor.googlesource.com/gvisor/pkg/log.(*GoogleEmitter).Emit() <autogenerated>:1 +0xe1 gvisor.googlesource.com/gvisor/pkg/log.(*BasicLogger).Debugf() pkg/log/log.go:177 +0x111 gvisor.googlesource.com/gvisor/pkg/log.Debugf() pkg/log/log.go:235 +0x66 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Debugf() pkg/sentry/kernel/task_log.go:48 +0xfe gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).DebugDumpState() pkg/sentry/kernel/task_log.go:66 +0x11f gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute() pkg/sentry/kernel/task_run.go:272 +0xc80 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run() pkg/sentry/kernel/task_run.go:91 +0x24b Previous write at 0x00c000ba2160 by goroutine 423: gvisor.googlesource.com/gvisor/pkg/sentry/fs.Rename() pkg/sentry/fs/dirent.go:1628 +0x61f gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1.1() pkg/sentry/syscalls/linux/sys_file.go:1864 +0x1f8 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt( gvisor.googlesource.com/g/linux/sys_file.go:51 +0x20f gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1() pkg/sentry/syscalls/linux/sys_file.go:1852 +0x218 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt() pkg/sentry/syscalls/linux/sys_file.go:51 +0x20f gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt() pkg/sentry/syscalls/linux/sys_file.go:1840 +0x180 gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Rename() pkg/sentry/syscalls/linux/sys_file.go:1873 +0x60 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall() pkg/sentry/kernel/task_syscall.go:165 +0x17a gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke() pkg/sentry/kernel/task_syscall.go:283 +0xb4 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter() pkg/sentry/kernel/task_syscall.go:244 +0x10c gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall() pkg/sentry/kernel/task_syscall.go:219 +0x1e3 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute() pkg/sentry/kernel/task_run.go:215 +0x15a9 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run() pkg/sentry/kernel/task_run.go:91 +0x24b Reported-by: syzbot+e1babbf756fab380dfff@syzkaller.appspotmail.com Change-Id: Icd2620bb3ea28b817bf0672d454a22b9d8ee189a PiperOrigin-RevId: 242938741
2019-04-10Allow threads with CAP_SYS_RESOURCE to raise hard rlimits.Kevin Krakauer
PiperOrigin-RevId: 242919489 Change-Id: Ie3267b3bcd8a54b54bc16a6556369a19e843376f
2019-04-10Start saving MountSource.DirentCache.Nicolas Lacasse
DirentCache is already a savable type, and it ensures that it is empty at the point of Save. There is no reason not to save it along with the MountSource. This did uncover an issue where not all MountSources were properly flushed before Save. If a mount point has an open file and is then unmounted, we save the MountSource without flushing it first. This CL also fixes that by flushing all MountSources for all open FDs on Save. PiperOrigin-RevId: 242906637 Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
2019-04-10Fixed /proc/cpuinfo permissionsShiva Prasanth
This also applies these permissions to other static proc files. Change-Id: I4167e585fed49ad271aa4e1f1260babb3239a73d PiperOrigin-RevId: 242898575
2019-04-09syscalls: sendfile: limit the count to MAX_RW_COUNTLi Qiang
From sendfile spec and also the linux kernel code, we should limit the count arg to 'MAX_RW_COUNT'. This patch export 'MAX_RW_COUNT' in kernel pkg and use it in the implementation of sendfile syscall. Signed-off-by: Li Qiang <pangpei.lq@antfin.com> Change-Id: I1086fec0685587116984555abd22b07ac233fbd2 PiperOrigin-RevId: 242745831
2019-04-09Add TCP checksum verification.Bhasker Hariharan
PiperOrigin-RevId: 242704699 Change-Id: I87db368ca343b3b4bf4f969b17d3aa4ce2f8bd4f
2019-04-08Export kernel.SignalInfoPriv.Jamie Liu
Also add kernel.SignalInfoNoInfo, and use it in RLIMIT_FSIZE checks. PiperOrigin-RevId: 242562428 Change-Id: I4887c0e1c8f5fddcabfe6d4281bf76d2f2eafe90
2019-04-08Intermediate ram fs dirs should be writable.Nicolas Lacasse
We construct a ramfs tree of "scaffolding" directories for all mount points, so that a directory exists that each mount point can be mounted over. We were creating these directories without write permissions, which meant that they were not wribable even when underlayed under a writable filesystem. They should be writable. PiperOrigin-RevId: 242507789 Change-Id: I86645e35417560d862442ff5962da211dbe9b731
2019-04-05Use string type for extended attribute values, instead of []byte.Nicolas Lacasse
Strings are a better fit for this usage because they are immutable in Go, and can contain arbitrary bytes. It also allows us to avoid casting bytes to string (and the associated allocation) in the hot path when checking for overlay whiteouts. PiperOrigin-RevId: 242208856 Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2
2019-04-04gvisor: Add support for the MS_NOEXEC mount optionAndrei Vagin
https://github.com/google/gvisor/issues/145 PiperOrigin-RevId: 242044115 Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878
2019-04-04Remove defer from trivial ThreadID methodsMichael Pratt
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid, respectively, where removing defer saves ~100ns. This may be a small improvement to application logging, which may call gettid/getpid frequently. PiperOrigin-RevId: 242039616 Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
2019-04-04BUILD: Add useful go_path targetAdin Scannell
Change-Id: Ibd6d8a1a63826af6e62a0f0669f8f0866c8091b4 PiperOrigin-RevId: 242037969
2019-04-03Only CopyOut CPU when it changesMichael Pratt
This will save copies when preemption is not caused by a CPU migration. PiperOrigin-RevId: 241844399 Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851
2019-04-03Don't release d.mu in checks for child-existence.Nicolas Lacasse
Dirent.exists() is called in Create to check whether a child with the given name already exists. Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu while calling d.Inode.Lookup. During this existence check, a racing Rename() can acquire d.mu and create a new child of the dirent with the same name. (Note that the source and destination of the rename must be in the same directory, otherwise renameMu will be taken preventing the race.) In this case, d.exists() can return false, even though a child with the same name actually does exist. This CL changes d.exists() so that it does not release d.mu while walking, thus preventing the race with Rename. It also adds comments noting that lockForRename may not take renameMu if the source and destination are in the same directory, as this is a bit surprising (at least it was to me). PiperOrigin-RevId: 241842579 Change-Id: I56524870e39dfcd18cab82054eb3088846c34813
2019-04-03Cache ThreadGroups in PIDNamespaceMichael Pratt
If there are thousands of threads, ThreadGroupsAppend becomes very expensive as it must iterate over all Tasks to find the ThreadGroup leaders. Reduce the cost by maintaining a map of ThreadGroups which can be used to grab them all directly. The one somewhat visible change is to convert PID namespace init children zapping to a group-directed SIGKILL, as Linux did in 82058d668465 "signal: Use group_send_sig_info to kill all processes in a pid namespace". In a benchmark that creates N threads which sleep for two minutes, we see approximately this much CPU time in ThreadGroupsAppend: Before: 1 thread: 0ms 1024 threads: 30ms - 9130ms 4096 threads: 50ms - 2000ms 8192 threads: 18160ms 16384 threads: 17210ms After: 1 thread: 0ms 1024 threads: 0ms 4096 threads: 0ms 8192 threads: 0ms 16384 threads: 0ms The profiling is actually extremely noisy (likely due to cache effects), as some runs show almost no samples at 1024, 4096 threads, but obviously this does not scale to lots of threads. PiperOrigin-RevId: 241828039 Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c
2019-04-03Fix index out of bounds in tty implementation.Kevin Krakauer
The previous implementation revolved around runes instead of bytes, which caused weird behavior when converting between the two. For example, peekRune would read the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for ?. However, peekRune also returned the length of the byte (1). When calling utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte character ?. tl;dr: I apparently didn't understand runes when I wrote this. PiperOrigin-RevId: 241789081 Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
2019-04-03Addresses data race in tty implementation.Kevin Krakauer
Also makes the safemem reading and writing inline, as it makes it easier to see what locks are held. PiperOrigin-RevId: 241775201 Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c