summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/kernel
AgeCommit message (Collapse)Author
2018-12-04Max link traversals should be for an entire path.Brian Geffon
The number of symbolic links that are allowed to be followed are for a full path and not just a chain of symbolic links. PiperOrigin-RevId: 224047321 Change-Id: I5e3c4caf66a93c17eeddcc7f046d1e8bb9434a40
2018-11-20Use RET_KILL_PROCESS if available in kernelFabricio Voznika
RET_KILL_THREAD doesn't work well for Go because it will kill only the offending thread and leave the process hanging. RET_TRAP can be masked out and it's not guaranteed to kill the process. RET_KILL_PROCESS is available since 4.14. For older kernel, continue to use RET_TRAP as this is the best option (likely to kill process, easy to debug). PiperOrigin-RevId: 222357867 Change-Id: Icc1d7d731274b16c2125b7a1ba4f7883fbdb2cbd
2018-11-20Fix recursive read lock taken on TaskSetFabricio Voznika
SyncSyscallFiltersToThreadGroup and Task.TheadID() both acquired TaskSet RWLock in R mode and could deadlock if a writer comes in between. PiperOrigin-RevId: 222313551 Change-Id: I4221057d8d46fec544cbfa55765c9a284fe7ebfa
2018-11-20Update futex to use usermem abstractions.Adin Scannell
This eliminates the indirection that existed in task_futex. PiperOrigin-RevId: 221832498 Change-Id: Ifb4c926d493913aa6694e193deae91616a29f042
2018-11-15Advertise vsyscall support via /proc/<pid>/maps.Rahat Mahmood
Also update test utilities for probing vsyscall support and add a metric to see if vsyscalls are actually used in sandboxes. PiperOrigin-RevId: 221698834 Change-Id: I57870ecc33ea8c864bd7437833f21aa1e8117477
2018-11-08Create stubs for syscalls upto Linux 4.4.Rahat Mahmood
Create syscall stubs for missing syscalls upto Linux 4.4 and advertise a kernel version of 4.4. PiperOrigin-RevId: 220667680 Change-Id: Idbdccde538faabf16debc22f492dd053a8af0ba7
2018-11-01Prevent premature destruction of shm segments.Rahat Mahmood
Shm segments can be marked for lazy destruction via shmctl(IPC_RMID), which destroys a segment once it is no longer attached to any processes. We were unconditionally decrementing the segment refcount on shmctl(IPC_RMID) which allowed a user to force a segment to be destroyed by repeatedly calling shmctl(IPC_RMID), with outstanding memory maps to the segment. This is problematic because the memory released by a segment destroyed this way can be reused by a different process while remaining accessible by the process with outstanding maps to the segment. PiperOrigin-RevId: 219713660 Change-Id: I443ab838322b4fb418ed87b2722c3413ead21845
2018-10-23Fix panic on creation of zero-len shm segments.Rahat Mahmood
Attempting to create a zero-len shm segment causes a panic since we try to allocate a zero-len filemem region. The existing code had a guard to disallow this, but the check didn't encode the fact that requesting a private segment implies a segment creation regardless of whether IPC_CREAT is explicitly specified. PiperOrigin-RevId: 218405743 Change-Id: I30aef1232b2125ebba50333a73352c2f907977da
2018-10-23Track paths and provide a rename hook.Adin Scannell
This change also adds extensive testing to the p9 package via mocks. The sanity checks and type checks are moved from the gofer into the core package, where they can be more easily validated. PiperOrigin-RevId: 218296768 Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
2018-10-20Add more unimplemented syscall eventsFabricio Voznika
Added events for *ctl syscalls that may have multiple different commands. For runsc, each syscall event is only logged once. For *ctl syscalls, use the cmd as identifier, not only the syscall number. PiperOrigin-RevId: 218015941 Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
2018-10-19Use correct company name in copyright headerIan Gudger
PiperOrigin-RevId: 217951017 Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-17Check thread group CPU timers in the CPU clock ticker.Jamie Liu
This reduces the number of goroutines and runtime timers when ITIMER_VIRTUAL or ITIMER_PROF are enabled, or when RLIMIT_CPU is set. This also ensures that thread group CPU timers only advance if running tasks are observed at the time the CPU clock advances, mostly eliminating the possibility that a CPU timer expiration observes no running tasks and falls back to the group leader. PiperOrigin-RevId: 217603396 Change-Id: Ia24ce934d5574334857d9afb5ad8ca0b6a6e65f4
2018-10-17runsc: Support job control signals for the root container.Nicolas Lacasse
Now containers run with "docker run -it" support control characters like ^C and ^Z. This required refactoring our signal handling a bit. Signals delivered to the "runsc boot" process are turned into loader.Signal calls with the appropriate delivery mode. Previously they were always sent directly to PID 1. PiperOrigin-RevId: 217566770 Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
2018-10-17Fix PTRACE_GETREGSET write sizeMichael Pratt
The existing logic is backwards and writes iov_len == 0 for a full write. PiperOrigin-RevId: 217560377 Change-Id: I5a39c31bf0ba9063a8495993bfef58dc8ab7c5fa
2018-10-17Move Unix transport out of netstackIan Gudger
PiperOrigin-RevId: 217557656 Change-Id: I63d27635b1a6c12877279995d2d9847b6a19da9b
2018-10-15Merge host.endpoint into host.ConnectedEndpointIan Gudger
host.endpoint contained duplicated logic from the sockerpair implementation and host.ConnectedEndpoint. Remove host.endpoint in favor of a host.ConnectedEndpoint wrapped in a socketpair end. PiperOrigin-RevId: 217240096 Change-Id: I4a3d51e3fe82bdf30e2d0152458b8499ab4c987c
2018-10-10When creating a new process group, add it to the session.Nicolas Lacasse
PiperOrigin-RevId: 216554791 Change-Id: Ia6b7a2e6eaad80a81b2a8f2e3241e93ebc2bda35
2018-10-08Implement shared futexes.Jamie Liu
- Shared futex objects on shared mappings are represented by Mappable + offset, analogous to Linux's use of inode + offset. Add type futex.Key, and change the futex.Manager bucket API to use futex.Keys instead of addresses. - Extend the futex.Checker interface to be able to return Keys for memory mappings. It returns Keys rather than just mappings because whether the address or the target of the mapping is used in the Key depends on whether the mapping is MAP_SHARED or MAP_PRIVATE; this matters because using mapping target for a futex on a MAP_PRIVATE mapping causes it to stop working across COW-breaking. - futex.Manager.WaitComplete depends on atomic updates to futex.Waiter.addr to determine when it has locked the right bucket, which is much less straightforward for struct futex.Waiter.key. Switch to an atomically-accessed futex.Waiter.bucket pointer. - futex.Manager.Wake now needs to take a futex.Checker to resolve addresses for shared futexes. CLONE_CHILD_CLEARTID requires the exit path to perform a shared futex wakeup (Linux: kernel/fork.c:mm_release() => sys_futex(tsk->clear_child_tid, FUTEX_WAKE, ...)). This is a problem because futexChecker is in the syscalls/linux package. Move it to kernel. PiperOrigin-RevId: 216207039 Change-Id: I708d68e2d1f47e526d9afd95e7fed410c84afccf
2018-10-03Fix panic if FIOASYNC callback is registered and triggered without targetIan Gudger
PiperOrigin-RevId: 215674589 Change-Id: I4f8871b64c570dc6da448d2fe351cec8a406efeb
2018-10-03Add S/R support for FIOASYNCIan Gudger
PiperOrigin-RevId: 215655197 Change-Id: I668b1bc7c29daaf2999f8f759138bcbb09c4de6f
2018-10-01runsc: Support job control signals in "exec -it".Nicolas Lacasse
Terminal support in runsc relies on host tty file descriptors that are imported into the sandbox. Application tty ioctls are sent directly to the host fd. However, those host tty ioctls are associated in the host kernel with a host process (in this case runsc), and the host kernel intercepts job control characters like ^C and send signals to the host process. Thus, typing ^C into a "runsc exec" shell will send a SIGINT to the runsc process. This change makes "runsc exec" handle all signals, and forward them into the sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is associated with a particular container process in the sandbox, the signal must be associated with the same container process. One big difficulty is that the signal should not necessarily be sent to the sandbox process started by "exec", but instead must be sent to the foreground process group for the tty. For example, we may exec "bash", and from bash call "sleep 100". A ^C at this point should SIGINT sleep, not bash. To handle this, tty files inside the sandbox must keep track of their foreground process group, which is set/get via ioctls. When an incoming ContainerSignal urpc comes in, we look up the foreground process group via the tty file. Unfortunately, this means we have to expose and cache the tty file in the Loader. Note that "runsc exec" now handles signals properly, but "runs run" does not. That will come in a later CL, as this one is complex enough already. Example: root@:/usr/local/apache2# sleep 100 ^C root@:/usr/local/apache2# sleep 100 ^Z [1]+ Stopped sleep 100 root@:/usr/local/apache2# fg sleep 100 ^C root@:/usr/local/apache2# PiperOrigin-RevId: 215334554 Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
2018-09-27Implement 'runsc kill --all'Fabricio Voznika
In order to implement kill --all correctly, the Sentry needs to track all tasks that belong to a given container. This change introduces ContainerID to the task, that gets inherited by all children. 'kill --all' then iterates over all tasks comparing the ContainerID field to find all processes that need to be signalled. PiperOrigin-RevId: 214841768 Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
2018-09-17runsc: Enable waiting on exited processes.Kevin Krakauer
This makes `runsc wait` behave more like waitpid()/wait4() in that: - Once a process has run to completion, you can wait on it and get its exit code. - Processes not waited on will consume memory (like a zombie process) PiperOrigin-RevId: 213358916 Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
2018-09-17Allow kernel.(*Task).Block to accept an extract only channelIan Gudger
PiperOrigin-RevId: 213328293 Change-Id: I4164133e6f709ecdb89ffbb5f7df3324c273860a
2018-09-14Fix interaction between rt_sigtimedwait and ignored signals.Jamie Liu
PiperOrigin-RevId: 213011782 Change-Id: I716c6ea3c586b0c6c5a892b6390d2d11478bc5af
2018-09-13Plumb monotonic time to netstackIan Gudger
Netstack needs to be portable, so this seems to be preferable to using raw system calls. PiperOrigin-RevId: 212917409 Change-Id: I7b2073e7db4b4bf75300717ca23aea4c15be944c
2018-09-13Extend memory usage events to report mapped memory usage.Rahat Mahmood
PiperOrigin-RevId: 212887555 Change-Id: I3545383ce903cbe9f00d9b5288d9ef9a049b9f4f
2018-09-07Add 'Starting gVisor...' message to syslogFabricio Voznika
This allows applications to verify they are running with gVisor. It also helps debugging when running with a mix of container runtimes. Closes #54 PiperOrigin-RevId: 212059457 Change-Id: I51d9595ee742b58c1f83f3902ab2e2ecbd5cedec
2018-09-07Use root abstract socket namespace for execFabricio Voznika
PiperOrigin-RevId: 211999211 Change-Id: I5968dd1a8313d3e49bb6e6614e130107495de41d
2018-09-06createProcessArgs.RootFromContext should return process Root if it exists.Nicolas Lacasse
It was always returning the MountNamespace root, which may be different from the process Root if the process is in a chroot environment. PiperOrigin-RevId: 211862181 Change-Id: I63bfeb610e2b0affa9fdbdd8147eba3c39014480
2018-09-04Distinguish Element and Linker for ilist.Adin Scannell
Furthermore, allow for the specification of an ElementMapper. This allows a single "Element" type to exist on multiple inline lists, and work without having to embed the entry type. This is a requisite change for supporting a per-Inode list of Dirents. PiperOrigin-RevId: 211467497 Change-Id: If2768999b43e03fdaecf8ed15f435fe37518d163
2018-08-31Document more task-goroutine-owned fields in kernel.Task.Jamie Liu
Task.creds can only be changed by the task's own set*id and execve syscalls, and Task namespaces can only be changed by the task's own unshare/setns syscalls. PiperOrigin-RevId: 211156279 Change-Id: I94d57105d34e8739d964400995a8a5d76306b2a0
2018-08-31Disintegrate kernel.TaskResources.Jamie Liu
This allows us to call kernel.FDMap.DecRef without holding mutexes cleanly. PiperOrigin-RevId: 211139657 Change-Id: Ie59d5210fb9282e1950e2e40323df7264a01bcec
2018-08-31Delete the long-obsolete kernel.TaskMaybe interface.Jamie Liu
PiperOrigin-RevId: 211131855 Change-Id: Ia7799561ccd65d16269e0ae6f408ab53749bca37
2018-08-28fasync: don't keep mutex after returnIan Gudger
PiperOrigin-RevId: 210637533 Change-Id: I3536c3f9efb54732a0d8ada8bc299142b2c1682f
2018-08-27Add /proc/sys/kernel/shm[all,max,mni].Brian Geffon
PiperOrigin-RevId: 210459956 Change-Id: I51859b90fa967631e0a54a390abc3b5541fbee66
2018-08-23Implement POSIX per-process interval timers.Jamie Liu
PiperOrigin-RevId: 210021612 Change-Id: If7c161e6fd08cf17942bfb6bc5a8d2c4e271c61e
2018-08-21Fix races in kernel.(*Task).Value()Ian Gudger
PiperOrigin-RevId: 209627180 Change-Id: Idc84afd38003427e411df6e75abfabd9174174e1
2018-08-15runsc fsgofer: Support dynamic serving of filesystems.Kevin Krakauer
When multiple containers run inside a sentry, each container has its own root filesystem and set of mounts. Containers are also added after sentry boot rather than all configured and known at boot time. The fsgofer needs to be able to serve the root filesystem of each container. Thus, it must be possible to add filesystems after the fsgofer has already started. This change: * Creates a URPC endpoint within the gofer process that listens for requests to serve new content. * Enables the sentry, when starting a new container, to add the new container's filesystem. * Mounts those new filesystems at separate roots within the sentry. PiperOrigin-RevId: 208903248 Change-Id: Ifa91ec9c8caf5f2f0a9eead83c4a57090ce92068
2018-08-09Fix missing O_LARGEFILE from O_CREAT filesMichael Pratt
Cleanup some more syscall.O_* references while we're here. PiperOrigin-RevId: 208133460 Change-Id: I48db71a38f817e4f4673977eafcc0e3874eb9a25
2018-08-07Hold TaskSet.mu in Task.Parent.Jamie Liu
PiperOrigin-RevId: 207766238 Change-Id: Id3b66d8fe1f44c3570f67fa5ae7ba16021e35be1
2018-08-07sentry: make epoll.pollEntry wait for the file operation in restore.Zhaozhong Ni
PiperOrigin-RevId: 207737935 Change-Id: I3a301ece1f1d30909715f36562474e3248b6a0d5
2018-08-02Automated rollback of changelist 207037226Zhaozhong Ni
PiperOrigin-RevId: 207125440 Change-Id: I6c572afb4d693ee72a0c458a988b0e96d191cd49
2018-08-02Add seccomp(2) support.Brian Geffon
Add support for the seccomp syscall and the flag SECCOMP_FILTER_FLAG_TSYNC. PiperOrigin-RevId: 207101507 Change-Id: I5eb8ba9d5ef71b0e683930a6429182726dc23175
2018-08-01Automated rollback of changelist 207007153Michael Pratt
PiperOrigin-RevId: 207037226 Change-Id: I8b5f1a056d4f3eab17846f2e0193bb737ecb5428
2018-08-01stateify: convert all packages to use explicit mode.Zhaozhong Ni
PiperOrigin-RevId: 207007153 Change-Id: Ifedf1cc3758dc18be16647a4ece9c840c1c636c9
2018-07-31proc: show file flags in fdinfoAndrei Vagin
Currently, there is an attempt to print FD flags, but they are not decoded into a number, so we see something like this: /criu # cat /proc/self/fdinfo/0 flags: {%!o(bool=000false)} Actually, fdinfo has to contain file flags. Change-Id: Idcbb7db908067447eb9ae6f2c3cfb861f2be1a97 PiperOrigin-RevId: 206794498
2018-07-27stateify: support explicit annotation mode; convert refs and stack packages.Zhaozhong Ni
We have been unnecessarily creating too many savable types implicitly. PiperOrigin-RevId: 206334201 Change-Id: Idc5a3a14bfb7ee125c4f2bb2b1c53164e46f29a8
2018-07-19kernel: mutations on creds now require a copy.Adin Scannell
PiperOrigin-RevId: 205315612 Change-Id: I9a0a1e32c8abfb7467a38743b82449cc92830316
2018-07-16Add CPUID faulting for ptrace and KVM.Adin Scannell
PiperOrigin-RevId: 204858314 Change-Id: I8252bf8de3232a7a27af51076139b585e73276d4