diff options
Diffstat (limited to 'g3doc')
-rw-r--r-- | g3doc/proposals/runtime_dedicate_os_thread.md | 188 | ||||
-rw-r--r-- | g3doc/user_guide/containerd/configuration.md | 87 | ||||
-rw-r--r-- | g3doc/user_guide/containerd/quick_start.md | 4 | ||||
-rw-r--r-- | g3doc/user_guide/install.md | 19 |
4 files changed, 269 insertions, 29 deletions
diff --git a/g3doc/proposals/runtime_dedicate_os_thread.md b/g3doc/proposals/runtime_dedicate_os_thread.md new file mode 100644 index 000000000..dc70055b0 --- /dev/null +++ b/g3doc/proposals/runtime_dedicate_os_thread.md @@ -0,0 +1,188 @@ +# `runtime.DedicateOSThread` + +Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that +this may be difficult to support in the Go runtime due to issues with GC. + +## Summary + +Allow goroutines to bind to kernel threads in a way that allows their scheduling +to be kernel-managed rather than runtime-managed. + +## Objectives + +* Reduce Go runtime overhead in the gVisor sentry (#2184). + +* Minimize intrusiveness of changes to the Go runtime. + +## Background + +In Go, execution contexts are referred to as goroutines, which the runtime calls +Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the +runtime) on which Gs are executed, as well as a pool of "virtual processors" +(called Ps by the runtime) of size equal to `runtime.GOMAXPROCS()`. Usually, +each M requires a P in order to execute Gs, limiting the number of concurrently +executing goroutines to `runtime.GOMAXPROCS()`. + +The `runtime.LockOSThread` function temporarily locks the invoking goroutine to +its current thread. It is primarily useful for interacting with OS or non-Go +library facilities that are per-thread. It does not reduce interactions with the +Go runtime scheduler: locked Ms relinquish their P when they become blocked, and +only continue execution after another M "chooses" their locked G to run and +donates their P to the locked M instead. + +## Problems + +### Context Switch Overhead + +Most goroutines in the gVisor sentry are task goroutines, which back application +threads. Task goroutines spend large amounts of time blocked on syscalls that +execute untrusted application code. When invoking said syscall (which varies by +gVisor platform), the task goroutine may interact with the Go runtime in one of +three ways: + +* It can invoke the syscall without informing the runtime. In this case, the + task goroutine will continue to hold its P during the syscall, limiting the + number of application threads that can run concurrently to + `runtime.GOMAXPROCS()`. This is problematic because the Go runtime scheduler + is known to scale poorly with `GOMAXPROCS`; see #1942 and + https://github.com/golang/go/issues/28808. It also means that preemption of + application threads must be driven by sentry or runtime code, which is + strictly slower than kernel-driven preemption (since the sentry must invoke + another syscall to preempt the application thread). + +* It can call `runtime.entersyscallblock` before invoking the syscall, and + `runtime.exitsyscall` after the syscall returns. In this case, the task + goroutine will release its P while the syscall is executing. This allows the + number of threads concurrently executing application code to exceed + `GOMAXPROCS`. However, this incurs additional latency on syscall entry (to + hand off the released P to another M, often requiring a `futex(FUTEX_WAKE)` + syscall) and on syscall exit (to acquire a new P). It also drastically + increases the number of threads that concurrently interact with the runtime + scheduler, which is also problematic for performance (both in terms of CPU + utilization and in terms of context switch latency); see #205. + +- It can call `runtime.entersyscall` before invoking the syscall, and + `runtime.exitsyscall` after the syscall returns. In this case, the task + goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to + steal it on behalf of another M after a 20us delay. This mitigates the + context switch latency problem when there are few task goroutines and the + interval between switches to application code (i.e. the interval between + application syscalls, page faults, or signal delivery) is short. (Cynically, + this means that it's most effective in microbenchmarks). However, the delay + before a P is stolen can also be problematic for performance when there are + both many task goroutines switching to application code (lazily releasing + their Ps) *and* many task goroutines switching to sentry code (contending + for Ps), which is likely in larger heterogeneous workloads. + +### Blocking Overhead + +Task goroutines block on behalf of application syscalls like `futex` and +`epoll_wait` by receiving from a Go channel. (Future work may convert task +goroutine blocking to use the `syncevent` package to avoid overhead associated +with channels and `select`, but this does not change how blocking interacts with +the Go runtime scheduler.) + +If `runtime.LockOSThread()` is not in effect when a task goroutine blocks, then +when the task goroutine is unblocked (by e.g. an application `FUTEX_WAKE`, +signal delivery, or a timeout) by sending to the blocked channel, +`runtime.ready` migrates the unblocked G to the unblocking P. In most cases, +this implies that every application thread block/unblock cycle results in a +migration of the thread between Ps, and therefore Ms, and therefore cores, +resulting in reduced application performance due to loss of CPU caches. +Furthermore, in most cases, the unblocking P cannot immediately switch to the +unblocked G (instead resuming execution of its current application thread after +completing the application's `futex(FUTEX_WAKE)`, `tgkill`, etc. syscall), often +requiring that another P steal the unblocked G before it can resume execution. + +If `runtime.LockOSThread()` is in effect when a task goroutine blocks, then the +G will remain locked to its M, avoiding the core migration described above; +however, wakeup latency is significantly increased since, as described in +"Background", the G still needs to be selected by the scheduler before it can +run, and the M that selects the G then needs to transfer its P to the locked M, +incurring an additional `FUTEX_WAKE` syscall and round of kernel scheduling. + +## Proposal + +We propose to add a function, tentatively called `DedicateOSThread`, to the Go +`runtime` package, documented as follows: + +```go +// DedicateOSThread wires the calling goroutine to its current operating system +// thread, and exempts it from counting against GOMAXPROCS. The calling +// goroutine will always execute in that thread, and no other goroutine will +// execute in it, until the calling goroutine has made as many calls to +// UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits +// without unlocking the thread, the thread will be terminated. +// +// DedicateOSThread should only be used by long-lived goroutines that usually +// block due to blocking system calls, rather than interaction with other +// goroutines. +func DedicateOSThread() +``` + +Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the +invoking G to a M), but additionally locks the invoking M to a P. Ps locked by +`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual +number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number +of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`). +Corollaries: + +* If `runtime.ready` observes that a readied G is locked to a M locked to a P, + it immediately wakes the locked M without migrating the G to the readying P + or waiting for a future call to `runtime.schedule` to select the readied G + in `runtime.findrunnable`. + +* `runtime.stoplockedm` and `runtime.reentersyscall` skip the release of + locked Ps; the latter also skips sysmon wakeup. `runtime.stoplockedm` and + `runtime.exitsyscall` skip re-acquisition of Ps if one is locked. + +* sysmon does not attempt to preempt Gs that are locked to Ps, avoiding + fruitless overhead from `tgkill` syscalls and signal delivery. + +* `runtime.findrunnable`'s work stealing skips locked Ps (suggesting that + unlocked Ps be tracked in a separate array). `runtime.findrunnable` on + locked Ps skip the global run queue, work stealing, and possibly netpoll. + +* New goroutines created by goroutines with locked Ps are enqueued on the + global run queue rather than the invoking P's local run queue. + +While gVisor's use case does not strictly require that the association is +reversible (with `runtime.UndedicateOSThread`), such a feature is required to +allow reuse of locked Ms, which is likely to be critical for performance. + +## Alternatives Considered + +* Make the runtime scale well with `GOMAXPROCS`. While we are also + concurrently investigating this problem, this would not address the issues + of increased preemption cost or blocking overhead. + +* Make the runtime scale well with number of Ms. It is unclear if this is + actually feasible, and would not address blocking overhead. + +* Make P-locking part of `LockOSThread`'s behavior. This would likely + introduce performance regressions in existing uses of `LockOSThread` that do + not fit this usage pattern. In particular, since `DedicateOSThread` + transitions the invoker's P from "counted against `GOMAXPROCS`" to "not + counted against `GOMAXPROCS`", it may need to wake another M to run a new P + (that is counted against `GOMAXPROCS`), and the converse applies to + `UndedicateOSThread`. + +* Rewrite the gVisor sentry in a language that does not force userspace + scheduling. This is a last resort due to the amount of code involved. + +## Related Issues + +The proposed functionality is directly analogous to `spawn_blocking` in Rust +async runtimes +[`async_std`](https://docs.rs/async-std/1.8.0/async_std/task/fn.spawn_blocking.html) +and [`tokio`](https://docs.rs/tokio/0.3.5/tokio/task/fn.spawn_blocking.html). + +Outside of gVisor: + +* https://github.com/golang/go/issues/21827#issuecomment-595152452 describes a + use case for this feature in go-delve, where the goroutine that would use + this feature spends much of its time blocked in `ptrace` syscalls. + +* This feature may improve performance in the use case described in + https://github.com/golang/go/issues/18237, given the prominence of + syscall.Syscall in the profile given in that bug report. diff --git a/g3doc/user_guide/containerd/configuration.md b/g3doc/user_guide/containerd/configuration.md index 5d485c24b..bb65aa514 100644 --- a/g3doc/user_guide/containerd/configuration.md +++ b/g3doc/user_guide/containerd/configuration.md @@ -4,41 +4,56 @@ This document describes how to configure runtime options for `containerd-shim-runsc-v1`. This follows the [Containerd Quick Start](./quick_start.md) and requires containerd 1.2 or later. -### Update `/etc/containerd/config.toml` to point to a configuration file for `containerd-shim-runsc-v1`. +## Shim Configuration -`containerd-shim-runsc-v1` supports a few different configuration options based -on the version of containerd that is used. For versions >= 1.3, it supports a -configurable `ConfigPath` in the containerd runtime configuration. +The shim can be provided with a configuration file containing options to the +shim itself as well as a set of flags to runsc. Here is a quick example: + +```shell +cat <<EOF | sudo tee /etc/containerd/runsc.toml +option = "value" +[runsc_config] + flag = "value" +``` + +The set of options that can be configured can be found in +[options.go](https://github.com/google/gvisor/blob/master/pkg/shim/v2/options.go). +Values under `[runsc_config]` can be used to set arbitrary flags to runsc. +`flag = "value"` is converted to `--flag="value"` when runsc is invoked. Run +`runsc flags` so see which flags are available + +Next, containerd needs to be configured to send the configuration file to the +shim. + +### Containerd 1.3+ + +Starting in 1.3, containerd supports a configurable `ConfigPath` in the runtime +configuration. Here is an example: ```shell cat <<EOF | sudo tee /etc/containerd/config.toml disabled_plugins = ["restart"] -[plugins.linux] - shim_debug = true [plugins.cri.containerd.runtimes.runsc] runtime_type = "io.containerd.runsc.v1" [plugins.cri.containerd.runtimes.runsc.options] TypeUrl = "io.containerd.runsc.v1.options" - # containerd 1.3 only! ConfigPath = "/etc/containerd/runsc.toml" EOF ``` -When you are done restart containerd to pick up the new configuration files. +When you are done, restart containerd to pick up the changes. ```shell sudo systemctl restart containerd ``` -### Configure `/etc/containerd/runsc.toml` +### Containerd 1.2 -> Note: For containerd 1.2, the config file should named `config.toml` and -> located in the runtime root. By default, this is `/run/containerd/runsc`. +For containerd 1.2, the config file is not configurable. It should be named +`config.toml` and located in the runtime root. By default, this is +`/run/containerd/runsc`. -The set of options that can be configured can be found in -[options.go](https://github.com/google/gvisor/blob/master/pkg/shim/v2/options/options.go). - -#### Example: Enable the KVM platform +### Example: Enable the KVM platform gVisor enables the use of a number of platforms. This example shows how to configure `containerd-shim-runsc-v1` to use gvisor with the KVM platform. @@ -49,11 +64,42 @@ Find out more about platform in the ```shell cat <<EOF | sudo tee /etc/containerd/runsc.toml [runsc_config] -platform = "kvm" + platform = "kvm" +EOF +``` + +## Debug + +When `shim_debug` is enabled in `/etc/containerd/config.toml`, containerd will +forward shim logs to its own log. You can additionally set `level = "debug"` to +enable debug logs. To see the logs run `sudo journalctl -u containerd`. Here is +a containerd configuration file that enables both options: + +```shell +cat <<EOF | sudo tee /etc/containerd/config.toml +disabled_plugins = ["restart"] +[debug] + level = "debug" +[plugins.linux] + shim_debug = true +[plugins.cri.containerd.runtimes.runsc] + runtime_type = "io.containerd.runsc.v1" +[plugins.cri.containerd.runtimes.runsc.options] + TypeUrl = "io.containerd.runsc.v1.options" + ConfigPath = "/etc/containerd/runsc.toml" EOF ``` -### Example: Enable gVisor debug logging +It can be hard to separate containerd messages from the shim's though. To create +a log file dedicated to the shim, you can set the `log_path` and `log_level` +values in the shim configuration file: + +- `log_path` is the directory where the shim logs will be created. `%ID%` is + the path is replaced with the container ID. +- `log_level` sets the logs level. It is normally set to "debug" as there is + not much interesting happening with other log levels. + +### Example: Enable shim and gVisor debug logging gVisor debug logging can be enabled by setting the `debug` and `debug-log` flag. The shim will replace "%ID%" with the container ID, and "%COMMAND%" with the @@ -63,8 +109,9 @@ Find out more about debugging in the [debugging guide](../debugging.md). ```shell cat <<EOF | sudo tee /etc/containerd/runsc.toml +log_path = "/var/log/runsc/%ID%/shim.log" +log_level = "debug" [runsc_config] - debug=true - debug-log=/var/log/%ID%/gvisor.%COMMAND%.log -EOF + debug = "true" + debug-log = "/var/log/runsc/%ID%/gvisor.%COMMAND%.log" ``` diff --git a/g3doc/user_guide/containerd/quick_start.md b/g3doc/user_guide/containerd/quick_start.md index b6a3186d8..a98fe5c4a 100644 --- a/g3doc/user_guide/containerd/quick_start.md +++ b/g3doc/user_guide/containerd/quick_start.md @@ -1,7 +1,7 @@ # Containerd Quick Start -This document describes how to install and configure `containerd-shim-runsc-v1` -using the containerd runtime handler support on `containerd` 1.2 or later. +This document describes how to use `containerd-shim-runsc-v1` with the +containerd runtime handler support on `containerd` 1.2 or later. > ⚠️ NOTE: If you are using Kubernetes and set up your cluster using kubeadm you > may run into issues. See the [FAQ](../FAQ.md#runtime-handler) for details. diff --git a/g3doc/user_guide/install.md b/g3doc/user_guide/install.md index abb9e8582..c3ced9d61 100644 --- a/g3doc/user_guide/install.md +++ b/g3doc/user_guide/install.md @@ -13,15 +13,19 @@ To download and install the latest release manually follow these steps: ( set -e URL=https://storage.googleapis.com/gvisor/releases/release/latest - wget ${URL}/runsc ${URL}/runsc.sha512 - sha512sum -c runsc.sha512 - rm -f runsc.sha512 - sudo mv runsc /usr/local/bin - sudo chmod a+rx /usr/local/bin/runsc + wget ${URL}/runsc ${URL}/runsc.sha512 \ + ${URL}/gvisor-containerd-shim ${URL}/gvisor-containerd-shim.sha512 \ + ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512 + sha512sum -c runsc.sha512 \ + -c gvisor-containerd-shim.sha512 \ + -c containerd-shim-runsc-v1.sha512 + rm -f *.sha512 + chmod a+rx runsc gvisor-containerd-shim containerd-shim-runsc-v1 + sudo mv runsc gvisor-containerd-shim containerd-shim-runsc-v1 /usr/local/bin ) ``` -To install gVisor with Docker, run the following commands: +To install gVisor as a Docker runtime, run the following commands: ```bash /usr/local/bin/runsc install @@ -165,5 +169,6 @@ You can use this link with the steps described in Note that `apt` installation of a specific point release is not supported. After installation, try out `runsc` by following the -[Docker Quick Start](./quick_start/docker.md) or +[Docker Quick Start](./quick_start/docker.md), +[Containerd QuickStart](./containerd/quick_start.md), or [OCI Quick Start](./quick_start/oci.md). |