summaryrefslogtreecommitdiffhomepage
path: root/runsc/boot/controller.go
AgeCommit message (Collapse)Author
2021-03-18Skip /dev submount hack on VFS2.Jamie Liu
containerd usually configures both /dev and /dev/shm as tmpfs mounts, e.g.: ``` "mounts": [ ... { "destination": "/dev", "type": "tmpfs", "source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, ... { "destination": "/dev/shm", "type": "tmpfs", "source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=67108864" ] }, ... ``` (This is mostly consistent with how Linux is usually configured, except that /dev is conventionally devtmpfs, not regular tmpfs. runc/libcontainer implements OCI-runtime-spec-undocumented behavior to create /dev/{ptmx,fd,stdin,stdout,stderr} in non-bind /dev mounts. runsc silently switches /dev to devtmpfs. In VFS1, this is necessary to get device files like /dev/null at all, since VFS1 doesn't support real device special files, only what is hardcoded in devfs. VFS2 does support device special files, but using devtmpfs is the easiest way to get pre-created files in /dev.) runsc ignores many /dev submounts in the spec, including /dev/shm. In VFS1, this appears to be to avoid introducing a submount overlay for /dev, and is mostly fine since the typical mode for the /dev/shm mount is ~consistent with the mode of the /dev/shm directory provided by devfs (modulo the sticky bit). In VFS2, this is vestigial (VFS2 does not use submount overlays), and devtmpfs' /dev/shm mode is correct for the mount point but not the mount. So turn off this behavior for VFS2. After this change: ``` $ docker run --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwt 2 root root 40 Mar 18 00:16 . drwxr-xr-x 5 root root 360 Mar 18 00:16 .. $ docker run --runtime=runsc --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwx 1 root root 0 Mar 18 00:16 . dr-xr-xr-x 1 root root 0 Mar 18 00:16 .. $ docker run --runtime=runsc-vfs2 --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwt 2 root root 40 Mar 18 00:16 . drwxr-xr-x 5 root root 320 Mar 18 00:16 .. ``` Fixes #5687 PiperOrigin-RevId: 363699385
2021-03-06[op] Replace syscall package usage with golang.org/x/sys/unix in runsc/.Ayush Ranjan
The syscall package has been deprecated in favor of golang.org/x/sys. Note that syscall is still used in some places because the following don't seem to have an equivalent in unix package: - syscall.SysProcIDMap - syscall.Credential Updates #214 PiperOrigin-RevId: 361381490
2021-02-22Fix `runsc kill --pid`Fabricio Voznika
Previously, loader.signalProcess was inconsitently using both root and container's PID namespace to find the process. It used root namespace for the exec'd process and container's PID namespace for other processes. This fixes the code to use the root PID namespace across the board, which is the same PID reported in `runsc ps` (or soon will after https://github.com/google/gvisor/pull/5519). PiperOrigin-RevId: 358836297
2021-01-05Add benchmarks targets to BuildKite.Adin Scannell
This includes minor fix-ups: * Handle SIGTERM in runsc debug, to exit gracefully. * Fix cmd.debug.go opening all profiles as RDONLY. * Fix the test name in fio_test.go, and encode the block size in the test. PiperOrigin-RevId: 350205718
2020-12-29Make profiling commands synchronous.Adin Scannell
This allows for a model of profiling when you can start collection, and it will terminate when the sandbox terminates. Without this synchronous call, it is effectively impossible to collect length blocking and mutex profiles. PiperOrigin-RevId: 349483418
2020-11-17Add support for TTY in multi-containerFabricio Voznika
Fixes #2714 PiperOrigin-RevId: 342950412
2020-11-05Fix failure setting OOM score adjustmentFabricio Voznika
When OOM score adjustment needs to be set, all the containers need to be loaded to find all containers that belong to the sandbox. However, each load signals the container to ensure it is still alive. OOM score adjustment is set during creation and deletion of every container, generating a flood of signals to all containers. The fix removes the signal check when it's not needed. There is also a race fetching OOM score adjustment value from the parent when the sandbox exits at the same time (the time it took to signal containers above made this window quite large). The fix is to store the original value in the sandbox state file and use it when the value needs to be restored. Also add more logging and made the existing ones more consistent to help with debugging. PiperOrigin-RevId: 340940799
2020-10-23Support VFS2 save/restore.Jamie Liu
Inode number consistency checks are now skipped in save/restore tests for reasons described in greatest detail in StatTest.StateDoesntChangeAfterRename. They pass in VFS1 due to the bug described in new test case SimpleStatTest.DifferentFilesHaveDifferentDeviceInodeNumberPairs. Fixes #1663 PiperOrigin-RevId: 338776148
2020-09-04Simplify FD handling for container start/execFabricio Voznika
VFS1 and VFS2 host FDs have different dupping behavior, making error prone to code for both. Change the contract so that FDs are released as they are used, so the caller can simple defer a block that closes all remaining files. This also addresses handling of partial failures. With this fix, more VFS2 tests can be enabled. Updates #1487 PiperOrigin-RevId: 330112266
2020-08-19Move boot.Config to its own packageFabricio Voznika
Updates #3494 PiperOrigin-RevId: 327548511
2020-08-05Stop profiling when the sentry exitsFabricio Voznika
Also removes `--profile-goroutine` because it's equivalent to `debug --stacks`. PiperOrigin-RevId: 325061502
2020-07-14Prepare boot.Loader to support multi-container TTYFabricio Voznika
- Combine process creation code that is shared between root and subcontainer processes - Move root container information into a struct for clarity Updates #2714 PiperOrigin-RevId: 321204798
2020-02-26add profile optionmoricho
2020-02-20Initial network namespace support.gVisor bot
TCP/IP will work with netstack networking. hostinet doesn't work, and sockets will have the same behavior as it is now. Before the userspace is able to create device, the default loopback device can be used to test. /proc/net and /sys/net will still be connected to the root network stack; this is the same behavior now. Issue #1833 PiperOrigin-RevId: 296309389
2019-12-06Add runtime tracing.Adin Scannell
This adds meaningful annotations to the trace generated by the runtime/trace package. PiperOrigin-RevId: 284290115
2019-11-01Allow the watchdog to detect when the sandbox is stuck during setup.Nicolas Lacasse
The watchdog currently can find stuck tasks, but has no way to tell if the sandbox is stuck before the application starts executing. This CL adds a startup timeout and action to the watchdog. If Start() is not called before the given timeout (if non-zero), then the watchdog will take the action. PiperOrigin-RevId: 277970577
2019-10-23fix typokevin.xu
fix a typo
2019-10-23remove duplicated periodkevin.xu
remove a duplicated period
2019-10-16Fix problem with open FD when copy up is triggered in overlayfsFabricio Voznika
Linux kernel before 4.19 doesn't implement a feature that updates open FD after a file is open for write (and is copied to the upper layer). Already open FD will continue to read the old file content until they are reopened. This is especially problematic for gVisor because it caches open files. Flag was added to force readonly files to be reopenned when the same file is open for write. This is only needed if using kernels prior to 4.19. Closes #1006 It's difficult to really test this because we never run on tests on older kernels. I'm adding a test in GKE which uses kernels with the overlayfs problem for 1.14 and lower. PiperOrigin-RevId: 275115289
2019-10-08Remove stale TODOFabricio Voznika
PiperOrigin-RevId: 273630282
2019-10-07Rename epsocket to netstack.Kevin Krakauer
PiperOrigin-RevId: 273365058
2019-08-08netstack: Don't start endpoint goroutines too soon on restore.Rahat Mahmood
Endpoint protocol goroutines were previously started as part of loading the endpoint. This is potentially too soon, as resources used by these goroutine may not have been loaded. Protocol goroutines may perform meaningful work as soon as they're started (ex: incoming connect) which can cause them to indirectly access resources that haven't been loaded yet. This CL defers resuming all protocol goroutines until the end of restore. PiperOrigin-RevId: 262409429
2019-08-02Stops container if gofer is killedFabricio Voznika
Each gofer now has a goroutine that polls on the FDs used to communicate with the sandbox. The respective gofer is destroyed if any of the FDs is closed. Closes #601 PiperOrigin-RevId: 261383725
2019-08-02Remove kernel.mounts.Nicolas Lacasse
We can get the mount namespace from the CreateProcessArgs in all cases where we need it. This also gets rid of kernel.Destroy method, since the only thing it was doing was DecRefing the mounts. Removing the need to call kernel.SetRootMountNamespace also allowed for some more simplifications in the container fs setup code. PiperOrigin-RevId: 261357060
2019-06-24Allow to change logging options using 'runsc debug'Fabricio Voznika
New options are: runsc debug --strace=off|all|function1,function2 runsc debug --log-level=warning|info|debug runsc debug --log-packets=true|false Updates #407 PiperOrigin-RevId: 254843128
2019-06-13Update canonical repository.Adin Scannell
This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620
2019-06-12gvisor/runsc: apply seccomp filters before parsing a state fileAndrei Vagin
PiperOrigin-RevId: 252869983
2019-06-11Add support to mount pod shared tmpfs mountsFabricio Voznika
Parse annotations containing 'gvisor.dev/spec/mount' that gives hints about how mounts are shared between containers inside a pod. This information can be used to better inform how to mount these volumes inside gVisor. For example, a volume that is shared between containers inside a pod can be bind mounted inside the sandbox, instead of being two independent mounts. For now, this information is used to allow the same tmpfs mounts to be shared between containers which wasn't possible before. PiperOrigin-RevId: 252704037
2019-06-03Refactor container FS setupFabricio Voznika
No change in functionaly. Added containerMounter object to keep state while the mounts are processed. This will help upcoming changes to share mounts per-pod. PiperOrigin-RevId: 251350096
2019-06-03Remove 'clearStatus' option from container.Wait*PID()Fabricio Voznika
clearStatus was added to allow detached execution to wait on the exec'd process and retrieve its exit status. However, it's not currently used. Both docker and gvisor-containerd-shim wait on the "shim" process and retrieve the exit status from there. We could change gvisor-containerd-shim to use waits, but it will end up also consuming a process for the wait, which is similar to having the shim process. Closes #234 PiperOrigin-RevId: 251349490
2019-05-30Add support for collecting execution trace to runsc.Bhasker Hariharan
Updates #220 PiperOrigin-RevId: 250532302
2019-05-15Cleanup around urpc file payload handlingFabricio Voznika
urpc always closes all files once the RPC function returns. PiperOrigin-RevId: 248406857 Change-Id: I400a8562452ec75c8e4bddc2154948567d572950
2019-05-03Fix runsc restore to be compatible with docker start --checkpoint ...Andrei Vagin
Change-Id: I02b30de13f1393df66edf8829fedbf32405d18f8 PiperOrigin-RevId: 246621192
2019-04-29Change copyright notice to "The gVisor Authors"Michael Pratt
Based on the guidelines at https://opensource.google.com/docs/releasing/authors/. 1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./' 2. Manual fixup of "Google Inc" references. 3. Add AUTHORS file. Authors may request to be added to this file. 4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS. Fixes #209 PiperOrigin-RevId: 245823212 Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
2019-04-29Allow and document bug ids in gVisor codebase.Nicolas Lacasse
PiperOrigin-RevId: 245818639 Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
2019-04-02Remove obsolete TODO.Kevin Krakauer
PiperOrigin-RevId: 241637164 Change-Id: I65476a739cf38f1818dc47f6ce60638dec8b77a8
2019-04-02Change bug number for duplicate bug.Kevin Krakauer
PiperOrigin-RevId: 241567897 Change-Id: I580eac04f52bb15f4aab7df9822c4aa92e743021
2019-03-14Decouple filemem from platform and move it to pgalloc.MemoryFile.Jamie Liu
This is in preparation for improved page cache reclaim, which requires greater integration between the page cache and page allocator. PiperOrigin-RevId: 238444706 Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-03-11Add profiling commands to runscFabricio Voznika
Example: runsc debug --root=<dir> \ --profile-heap=/tmp/heap.prof \ --profile-cpu=/tmp/cpu.prod --profile-delay=30 \ <container ID> PiperOrigin-RevId: 237848456 Change-Id: Icff3f20c1b157a84d0922599eaea327320dad773
2019-01-29runsc: reap a sandbox process only in sandbox.Wait()Andrei Vagin
PiperOrigin-RevId: 231504064 Change-Id: I585b769aef04a3ad7e7936027958910a6eed9c8d
2019-01-18Scrub runsc error messagesFabricio Voznika
Removed "error" and "failed to" prefix that don't add value from messages. Adjusted a few other messages. In particular, when the container fail to start, the message returned is easier for humans to read: $ docker run --rm --runtime=runsc alpine foobar docker: Error response from daemon: OCI runtime start failed: <path> did not terminate sucessfully: starting container: starting root container [foobar]: starting sandbox: searching for executable "foobar", cwd: "/", $PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin": no such file or directory Closes #77 PiperOrigin-RevId: 230022798 Change-Id: I83339017c70dae09e4f9f8e0ea2e554c4d5d5cd1
2018-12-28Simplify synchronization between runsc and sandbox processFabricio Voznika
Make 'runsc create' join cgroup before creating sandbox process. This removes the need to synchronize platform creation and ensure that sandbox process is charged to the right cgroup from the start. PiperOrigin-RevId: 227166451 Change-Id: Ieb4b18e6ca0daf7b331dc897699ca419bc5ee3a2
2018-12-07sentry: turn "dynamically-created" procfs files into static creation.Zhaozhong Ni
PiperOrigin-RevId: 224600982 Change-Id: I547253528e24fb0bb318fc9d2632cb80504acb34
2018-11-09Close donated files if containerManager.Start() failsFabricio Voznika
PiperOrigin-RevId: 220869535 Change-Id: I9917e5daf02499f7aab6e2aa4051c54ff4461b9a
2018-11-07Add more logging to controller.goFabricio Voznika
PiperOrigin-RevId: 220519632 Change-Id: Iaeec007fc1aa3f0b72569b288826d45f2534c4bf
2018-11-05Fix race between start and destroyFabricio Voznika
Before this change, a container starting up could race with destroy (aka delete) and leave processes behind. Now, whenever a container is created, Loader.processes gets a new entry. Start now expects the entry to be there, and if it's not it means that the container was deleted. I've also fixed Loader.waitPID to search for the process using the init process's PID namespace. We could use a few more tests for signal and wait. I'll send them in another cl. PiperOrigin-RevId: 220224290 Change-Id: I15146079f69904dc07d43c3b66cc343a2dab4cc4
2018-11-05Log when external signal is receivedFabricio Voznika
PiperOrigin-RevId: 220204591 Change-Id: I21a9c6f5c12a376d18da5d10c1871837c4f49ad2
2018-10-19Use correct company name in copyright headerIan Gudger
PiperOrigin-RevId: 217951017 Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
2018-10-17runsc: Support job control signals for the root container.Nicolas Lacasse
Now containers run with "docker run -it" support control characters like ^C and ^Z. This required refactoring our signal handling a bit. Signals delivered to the "runsc boot" process are turned into loader.Signal calls with the appropriate delivery mode. Previously they were always sent directly to PID 1. PiperOrigin-RevId: 217566770 Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
2018-10-01runsc: Support job control signals in "exec -it".Nicolas Lacasse
Terminal support in runsc relies on host tty file descriptors that are imported into the sandbox. Application tty ioctls are sent directly to the host fd. However, those host tty ioctls are associated in the host kernel with a host process (in this case runsc), and the host kernel intercepts job control characters like ^C and send signals to the host process. Thus, typing ^C into a "runsc exec" shell will send a SIGINT to the runsc process. This change makes "runsc exec" handle all signals, and forward them into the sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is associated with a particular container process in the sandbox, the signal must be associated with the same container process. One big difficulty is that the signal should not necessarily be sent to the sandbox process started by "exec", but instead must be sent to the foreground process group for the tty. For example, we may exec "bash", and from bash call "sleep 100". A ^C at this point should SIGINT sleep, not bash. To handle this, tty files inside the sandbox must keep track of their foreground process group, which is set/get via ioctls. When an incoming ContainerSignal urpc comes in, we look up the foreground process group via the tty file. Unfortunately, this means we have to expose and cache the tty file in the Loader. Note that "runsc exec" now handles signals properly, but "runs run" does not. That will come in a later CL, as this one is complex enough already. Example: root@:/usr/local/apache2# sleep 100 ^C root@:/usr/local/apache2# sleep 100 ^Z [1]+ Stopped sleep 100 root@:/usr/local/apache2# fg sleep 100 ^C root@:/usr/local/apache2# PiperOrigin-RevId: 215334554 Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39