summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/kernel
AgeCommit message (Collapse)Author
2019-04-11Use open fids when fstat()ing gofer files.Jamie Liu
PiperOrigin-RevId: 243018347 Change-Id: I1e5b80607c1df0747482abea61db7fcf24536d37
2019-04-10Internal changeMichael Pratt
PiperOrigin-RevId: 242978508 Change-Id: I0ea59ac5ba1dd499e87c53f2e24709371048679b
2019-04-10Fix uses of RootFromContext.Nicolas Lacasse
RootFromContext can return a dirent with reference taken, or nil. We must call DecRef if (and only if) a real dirent is returned. PiperOrigin-RevId: 242965515 Change-Id: Ie2b7b4cb19ee09b6ccf788b71f3fd7efcdf35a11
2019-04-10Allow threads with CAP_SYS_RESOURCE to raise hard rlimits.Kevin Krakauer
PiperOrigin-RevId: 242919489 Change-Id: Ie3267b3bcd8a54b54bc16a6556369a19e843376f
2019-04-10Start saving MountSource.DirentCache.Nicolas Lacasse
DirentCache is already a savable type, and it ensures that it is empty at the point of Save. There is no reason not to save it along with the MountSource. This did uncover an issue where not all MountSources were properly flushed before Save. If a mount point has an open file and is then unmounted, we save the MountSource without flushing it first. This CL also fixes that by flushing all MountSources for all open FDs on Save. PiperOrigin-RevId: 242906637 Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
2019-04-09syscalls: sendfile: limit the count to MAX_RW_COUNTLi Qiang
From sendfile spec and also the linux kernel code, we should limit the count arg to 'MAX_RW_COUNT'. This patch export 'MAX_RW_COUNT' in kernel pkg and use it in the implementation of sendfile syscall. Signed-off-by: Li Qiang <pangpei.lq@antfin.com> Change-Id: I1086fec0685587116984555abd22b07ac233fbd2 PiperOrigin-RevId: 242745831
2019-04-08Export kernel.SignalInfoPriv.Jamie Liu
Also add kernel.SignalInfoNoInfo, and use it in RLIMIT_FSIZE checks. PiperOrigin-RevId: 242562428 Change-Id: I4887c0e1c8f5fddcabfe6d4281bf76d2f2eafe90
2019-04-04Remove defer from trivial ThreadID methodsMichael Pratt
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid, respectively, where removing defer saves ~100ns. This may be a small improvement to application logging, which may call gettid/getpid frequently. PiperOrigin-RevId: 242039616 Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
2019-04-03Only CopyOut CPU when it changesMichael Pratt
This will save copies when preemption is not caused by a CPU migration. PiperOrigin-RevId: 241844399 Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851
2019-04-03Cache ThreadGroups in PIDNamespaceMichael Pratt
If there are thousands of threads, ThreadGroupsAppend becomes very expensive as it must iterate over all Tasks to find the ThreadGroup leaders. Reduce the cost by maintaining a map of ThreadGroups which can be used to grab them all directly. The one somewhat visible change is to convert PID namespace init children zapping to a group-directed SIGKILL, as Linux did in 82058d668465 "signal: Use group_send_sig_info to kill all processes in a pid namespace". In a benchmark that creates N threads which sleep for two minutes, we see approximately this much CPU time in ThreadGroupsAppend: Before: 1 thread: 0ms 1024 threads: 30ms - 9130ms 4096 threads: 50ms - 2000ms 8192 threads: 18160ms 16384 threads: 17210ms After: 1 thread: 0ms 1024 threads: 0ms 4096 threads: 0ms 8192 threads: 0ms 16384 threads: 0ms The profiling is actually extremely noisy (likely due to cache effects), as some runs show almost no samples at 1024, 4096 threads, but obviously this does not scale to lots of threads. PiperOrigin-RevId: 241828039 Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c
2019-04-02Set options on the correct Task in PTRACE_SEIZE.Jamie Liu
$ docker run --rm --runtime=runsc -it --cap-add=SYS_PTRACE debian bash -c "apt-get update && apt-get install strace && strace ls" ... Setting up strace (4.15-2) ... execve("/bin/ls", ["ls"], [/* 6 vars */]) = 0 brk(NULL) = 0x5646d8c1e000 uname({sysname="Linux", nodename="114ef93d2db3", ...}) = 0 ... PiperOrigin-RevId: 241643321 Change-Id: Ie4bce27a7fb147eef07bbae5895c6ef3f529e177
2019-04-02Fix more data races in shm debug messages.Rahat Mahmood
PiperOrigin-RevId: 241630409 Change-Id: Ie0df5f5a2f20c2d32e615f16e2ba43c88f963181
2019-04-01Save/restore simple devices.Rahat Mahmood
We weren't saving simple devices' last allocated inode numbers, which caused inode number reuse across S/R. PiperOrigin-RevId: 241414245 Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
2019-04-01gvisor: convert ilist to ilist:generic_listAndrei Vagin
ilist:generic_list works faster (cl/240185278) and the code looks cleaner without type casting. PiperOrigin-RevId: 241381175 Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
2019-03-29Use kernel.Task.CopyScratchBuffer in syscalls/linux where possible.Jamie Liu
PiperOrigin-RevId: 241072126 Change-Id: Ib4d9f58f550732ac4c5153d3cf159a5b1a9749da
2019-03-29Treat fsync errors during save as SaveRejection errors.Nicolas Lacasse
PiperOrigin-RevId: 241055485 Change-Id: I70259e9fef59bdf9733b35a2cd3319359449dd45
2019-03-29Treat ENOSPC as a state-file error during save.Nicolas Lacasse
PiperOrigin-RevId: 241028806 Change-Id: I770bf751a2740869a93c3ab50370a727ae580470
2019-03-28set task's name when forkchris.zn
When fork a child process, the name filed of TaskContext is not set. It results in that when we cat /proc/{pid}/status, the name filed is null. Like this: Name: State: S (sleeping) Tgid: 28 Pid: 28 PPid: 26 TracerPid: 0 FDSize: 8 VmSize: 89712 kB VmRSS: 6648 kB Threads: 1 CapInh: 00000000a93d35fb CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000a93d35fb Seccomp: 0 Change-Id: I5d469098c37cedd19da16b7ffab2e546a28a321e PiperOrigin-RevId: 240893304
2019-03-25Call memmap.Mappable.Translate with more conservative usermem.AccessType.Jamie Liu
MM.insertPMAsLocked() passes vma.maxPerms to memmap.Mappable.Translate (although it unsets AccessType.Write if the vma is private). This somewhat simplifies handling of pmas, since it means only COW-break needs to replace existing pmas. However, it also means that a MAP_SHARED mapping of a file opened O_RDWR dirties the file, regardless of the mapping's permissions and whether or not the mapping is ever actually written to with I/O that ignores permissions (e.g. ptrace(PTRACE_POKEDATA)). To fix this: - Change the pma-getting path to request only the permissions that are required for the calling access. - Change memmap.Mappable.Translate to take requested permissions, and return allowed permissions. This preserves the existing behavior in the common cases where the memmap.Mappable isn't fsutil.CachingInodeOperations and doesn't care if the translated platform.File pages are written to. - Change the MM.getPMAsLocked path to support permission upgrading of pmas outside of copy-on-write. PiperOrigin-RevId: 240196979 Change-Id: Ie0147c62c1fbc409467a6fa16269a413f3d7d571
2019-03-25epoll: use ilist:generic_list instead of ilist:ilistAndrei Vagin
ilist:generic_list works faster than ilist:ilist. Here is a beanchmark test to measure performance of epoll_wait, when readyList isn't empty. It shows about 30% better performance with these changes. Benchmark Time(ns) CPU(ns) Iterations Before: BM_EpollAllEvents 46725 46899 14286 After: BM_EpollAllEvents 33167 33300 18919 PiperOrigin-RevId: 240185278 Change-Id: I3e33f9b214db13ab840b91613400525de5b58d18
2019-03-22Implement PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_LISTEN.Jamie Liu
PiperOrigin-RevId: 239803092 Change-Id: I42d612ed6a889e011e8474538958c6de90c6fcab
2019-03-20gvisor: don't allocate a new credential object on forkAndrei Vagin
A credential object is immutable, so we don't need to copy it for a new task. PiperOrigin-RevId: 239519266 Change-Id: I0632f641fdea9554779ac25d84bee4231d0d18f2
2019-03-18Remove racy access to shm fields.Rahat Mahmood
PiperOrigin-RevId: 239016776 Change-Id: Ia7af4258e7c69b16a4630a6f3278aa8e6b627746
2019-03-14Decouple filemem from platform and move it to pgalloc.MemoryFile.Jamie Liu
This is in preparation for improved page cache reclaim, which requires greater integration between the page cache and page allocator. PiperOrigin-RevId: 238444706 Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-05Priority-inheritance futex implementationFabricio Voznika
It is Implemented without the priority inheritance part given that gVisor defers scheduling decisions to Go runtime and doesn't have control over it. PiperOrigin-RevId: 236989545 Change-Id: I714c8ca0798743ecf3167b14ffeb5cd834302560
2019-03-01Add semctl(GETPID) syscallFabricio Voznika
Also added unimplemented notification for semctl(2) commands. PiperOrigin-RevId: 236340672 Change-Id: I0795e3bd2e6d41d7936fabb731884df426a42478
2019-02-20Make some ptrace commands x86-onlyHaibo Xu
Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I9751f859332d433ca772d6b9733f5a5a64398ec7 PiperOrigin-RevId: 234877624
2019-02-19Set rax to syscall number on SECCOMP_RET_TRAP.Jamie Liu
PiperOrigin-RevId: 234690475 Change-Id: I1cbfb5aecd4697a4a26ec8524354aa8656cc3ba1
2019-02-19Fix clone(CLONE_NEWUSER).Jamie Liu
- Use new user namespace for namespace creation checks. - Ensure userns is never nil since it's used by other namespaces. PiperOrigin-RevId: 234673175 Change-Id: I4b9d9d1e63ce4e24362089793961a996f7540cd9
2019-02-14Don't allow writing or reading to TTY unless process group is in foreground.Nicolas Lacasse
If a background process tries to read from a TTY, linux sends it a SIGTTIN unless the signal is blocked or ignored, or the process group is an orphan, in which case the syscall returns EIO. See drivers/tty/n_tty.c:n_tty_read()=>job_control(). If a background process tries to write a TTY, set the termios, or set the foreground process group, linux then sends a SIGTTOU. If the signal is ignored or blocked, linux allows the write. If the process group is an orphan, the syscall returns EIO. See drivers/tty/tty_io.c:tty_check_change(). PiperOrigin-RevId: 234044367 Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
2019-02-07Implement /proc/net/unix.Rahat Mahmood
PiperOrigin-RevId: 232948478 Change-Id: Ib830121e5e79afaf5d38d17aeef5a1ef97913d23
2019-02-07Implement semctl(2) SETALL and GETALLFabricio Voznika
PiperOrigin-RevId: 232914984 Change-Id: Id2643d7ad8e986ca9be76d860788a71db2674cda
2019-01-31Move package sync to third_partyMichael Pratt
PiperOrigin-RevId: 231889261 Change-Id: I482f1df055bcedf4edb9fe3fe9b8e9c80085f1a0
2019-01-31Remove license commentsMichael Pratt
Nothing reads them and they can simply get stale. Generated with: $ sed -i "s/licenses(\(.*\)).*/licenses(\1)/" **/BUILD PiperOrigin-RevId: 231818945 Change-Id: Ibc3f9838546b7e94f13f217060d31f4ada9d4bf0
2019-01-14Remove fs.Handle, ramfs.Entry, and all the DeprecatedFileOperations.Nicolas Lacasse
More helper structs have been added to the fsutil package to make it easier to implement fs.InodeOperations and fs.FileOperations. PiperOrigin-RevId: 229305982 Change-Id: Ib6f8d3862f4216745116857913dbfa351530223b
2019-01-09Minor memevent fixes.Jamie Liu
- Call MemoryEvents.done.Add(1) outside of MemoryEvents.run() so that if MemoryEvents.Stop() => MemoryEvents.done.Wait() is called before the goroutine starts running, it still waits for the goroutine to stop. - Use defer to call MemoryEvents.done.Done() in MemoryEvents.run() so that it's called even if the goroutine panics. PiperOrigin-RevId: 228623307 Change-Id: I1b0459e7999606c1a1a271b16092b1ca87005015
2019-01-08Improve loader related error messages returned to users.Brian Geffon
PiperOrigin-RevId: 228382827 Change-Id: Ica1d30e0df826bdd77f180a5092b2b735ea5c804
2019-01-08Grant no initial capabilities to non-root UIDs.Jamie Liu
See modified comment in auth.NewUserCredentials(); compare to the behavior of setresuid(2) as implemented by //pkg/sentry/kernel/task_identity.go:kernel.Task.setKUIDsUncheckedLocked(). PiperOrigin-RevId: 228381765 Change-Id: I45238777c8f63fcf41b99fce3969caaf682fe408
2018-12-26Add EventChannel messages for uncaught signals.Brian Geffon
PiperOrigin-RevId: 226936778 Change-Id: I2a6dda157c55d39d81e1b543ab11a58a0bfe5c05
2018-12-18Add BPFAction type with StringerFabricio Voznika
PiperOrigin-RevId: 226018694 Change-Id: I98965e26fe565f37e98e5df5f997363ab273c91b
2018-12-14Move fdnotifier package to reduce internal confusion.Adin Scannell
PiperOrigin-RevId: 225632398 Change-Id: I909e7e2925aa369adc28e844c284d9a6108e85ce
2018-12-12Filesystems shouldn't be saving references to Platform.Rahat Mahmood
Platform objects are not savable, storing references to them in filesystem datastructures would cause save to fail if someone actually passed in a Platform. Current implementations work because everywhere a Platform is expected, we currently pass in a Kernel object which embeds Platform and thus satisfies the interface. Eliminate this indirection and save pointers to Kernel directly. PiperOrigin-RevId: 225288336 Change-Id: Ica399ff43f425e15bc150a0d7102196c3d54a2ab
2018-12-12Fix a data race on Shm.key.Rahat Mahmood
PiperOrigin-RevId: 225240907 Change-Id: Ie568ce3cd643f3e4a0eaa0444f4ed589dcf6031f
2018-12-12Pass information about map writableness to filesystems.Rahat Mahmood
This is necessary to implement file seals for memfds. PiperOrigin-RevId: 225239394 Change-Id: Ib3f1ab31385afc4b24e96cd81a05ef1bebbcbb70
2018-12-10Add type safety to shm ids and keys.Rahat Mahmood
PiperOrigin-RevId: 224864380 Change-Id: I49542279ad56bf15ba462d3de1ef2b157b31830a
2018-12-10Validate FS_BASE in Task.CloneMichael Pratt
arch_prctl already verified that the new FS_BASE was canonical, but Task.Clone did not. Centralize these checks in the arch packages. Failure to validate could cause an error in PTRACE_SET_REGS when we try to switch to the app. PiperOrigin-RevId: 224862398 Change-Id: Iefe63b3f9aa6c4810326b8936e501be3ec407f14
2018-12-06Add counters for memory events.Rahat Mahmood
Also ensure an event is emitted at startup. PiperOrigin-RevId: 224372065 Change-Id: I5f642b6d6b13c6468ee8f794effe285fcbbf29cf
2018-12-04Max link traversals should be for an entire path.Brian Geffon
The number of symbolic links that are allowed to be followed are for a full path and not just a chain of symbolic links. PiperOrigin-RevId: 224047321 Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
2018-11-20Use RET_KILL_PROCESS if available in kernelFabricio Voznika
RET_KILL_THREAD doesn't work well for Go because it will kill only the offending thread and leave the process hanging. RET_TRAP can be masked out and it's not guaranteed to kill the process. RET_KILL_PROCESS is available since 4.14. For older kernel, continue to use RET_TRAP as this is the best option (likely to kill process, easy to debug). PiperOrigin-RevId: 222357867 Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
2018-11-20Fix recursive read lock taken on TaskSetFabricio Voznika
SyncSyscallFiltersToThreadGroup and Task.TheadID() both acquired TaskSet RWLock in R mode and could deadlock if a writer comes in between. PiperOrigin-RevId: 222313551 Change-Id: I4221057d8d46fec544cbfa55765c9a284fe7ebfa