summaryrefslogtreecommitdiffhomepage
path: root/g3doc
diff options
context:
space:
mode:
Diffstat (limited to 'g3doc')
-rw-r--r--g3doc/proposals/runtime_dedicate_os_thread.md188
-rw-r--r--g3doc/user_guide/containerd/configuration.md87
-rw-r--r--g3doc/user_guide/containerd/quick_start.md4
-rw-r--r--g3doc/user_guide/install.md19
4 files changed, 269 insertions, 29 deletions
diff --git a/g3doc/proposals/runtime_dedicate_os_thread.md b/g3doc/proposals/runtime_dedicate_os_thread.md
new file mode 100644
index 000000000..dc70055b0
--- /dev/null
+++ b/g3doc/proposals/runtime_dedicate_os_thread.md
@@ -0,0 +1,188 @@
+# `runtime.DedicateOSThread`
+
+Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that
+this may be difficult to support in the Go runtime due to issues with GC.
+
+## Summary
+
+Allow goroutines to bind to kernel threads in a way that allows their scheduling
+to be kernel-managed rather than runtime-managed.
+
+## Objectives
+
+* Reduce Go runtime overhead in the gVisor sentry (#2184).
+
+* Minimize intrusiveness of changes to the Go runtime.
+
+## Background
+
+In Go, execution contexts are referred to as goroutines, which the runtime calls
+Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the
+runtime) on which Gs are executed, as well as a pool of "virtual processors"
+(called Ps by the runtime) of size equal to `runtime.GOMAXPROCS()`. Usually,
+each M requires a P in order to execute Gs, limiting the number of concurrently
+executing goroutines to `runtime.GOMAXPROCS()`.
+
+The `runtime.LockOSThread` function temporarily locks the invoking goroutine to
+its current thread. It is primarily useful for interacting with OS or non-Go
+library facilities that are per-thread. It does not reduce interactions with the
+Go runtime scheduler: locked Ms relinquish their P when they become blocked, and
+only continue execution after another M "chooses" their locked G to run and
+donates their P to the locked M instead.
+
+## Problems
+
+### Context Switch Overhead
+
+Most goroutines in the gVisor sentry are task goroutines, which back application
+threads. Task goroutines spend large amounts of time blocked on syscalls that
+execute untrusted application code. When invoking said syscall (which varies by
+gVisor platform), the task goroutine may interact with the Go runtime in one of
+three ways:
+
+* It can invoke the syscall without informing the runtime. In this case, the
+ task goroutine will continue to hold its P during the syscall, limiting the
+ number of application threads that can run concurrently to
+ `runtime.GOMAXPROCS()`. This is problematic because the Go runtime scheduler
+ is known to scale poorly with `GOMAXPROCS`; see #1942 and
+ https://github.com/golang/go/issues/28808. It also means that preemption of
+ application threads must be driven by sentry or runtime code, which is
+ strictly slower than kernel-driven preemption (since the sentry must invoke
+ another syscall to preempt the application thread).
+
+* It can call `runtime.entersyscallblock` before invoking the syscall, and
+ `runtime.exitsyscall` after the syscall returns. In this case, the task
+ goroutine will release its P while the syscall is executing. This allows the
+ number of threads concurrently executing application code to exceed
+ `GOMAXPROCS`. However, this incurs additional latency on syscall entry (to
+ hand off the released P to another M, often requiring a `futex(FUTEX_WAKE)`
+ syscall) and on syscall exit (to acquire a new P). It also drastically
+ increases the number of threads that concurrently interact with the runtime
+ scheduler, which is also problematic for performance (both in terms of CPU
+ utilization and in terms of context switch latency); see #205.
+
+- It can call `runtime.entersyscall` before invoking the syscall, and
+ `runtime.exitsyscall` after the syscall returns. In this case, the task
+ goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to
+ steal it on behalf of another M after a 20us delay. This mitigates the
+ context switch latency problem when there are few task goroutines and the
+ interval between switches to application code (i.e. the interval between
+ application syscalls, page faults, or signal delivery) is short. (Cynically,
+ this means that it's most effective in microbenchmarks). However, the delay
+ before a P is stolen can also be problematic for performance when there are
+ both many task goroutines switching to application code (lazily releasing
+ their Ps) *and* many task goroutines switching to sentry code (contending
+ for Ps), which is likely in larger heterogeneous workloads.
+
+### Blocking Overhead
+
+Task goroutines block on behalf of application syscalls like `futex` and
+`epoll_wait` by receiving from a Go channel. (Future work may convert task
+goroutine blocking to use the `syncevent` package to avoid overhead associated
+with channels and `select`, but this does not change how blocking interacts with
+the Go runtime scheduler.)
+
+If `runtime.LockOSThread()` is not in effect when a task goroutine blocks, then
+when the task goroutine is unblocked (by e.g. an application `FUTEX_WAKE`,
+signal delivery, or a timeout) by sending to the blocked channel,
+`runtime.ready` migrates the unblocked G to the unblocking P. In most cases,
+this implies that every application thread block/unblock cycle results in a
+migration of the thread between Ps, and therefore Ms, and therefore cores,
+resulting in reduced application performance due to loss of CPU caches.
+Furthermore, in most cases, the unblocking P cannot immediately switch to the
+unblocked G (instead resuming execution of its current application thread after
+completing the application's `futex(FUTEX_WAKE)`, `tgkill`, etc. syscall), often
+requiring that another P steal the unblocked G before it can resume execution.
+
+If `runtime.LockOSThread()` is in effect when a task goroutine blocks, then the
+G will remain locked to its M, avoiding the core migration described above;
+however, wakeup latency is significantly increased since, as described in
+"Background", the G still needs to be selected by the scheduler before it can
+run, and the M that selects the G then needs to transfer its P to the locked M,
+incurring an additional `FUTEX_WAKE` syscall and round of kernel scheduling.
+
+## Proposal
+
+We propose to add a function, tentatively called `DedicateOSThread`, to the Go
+`runtime` package, documented as follows:
+
+```go
+// DedicateOSThread wires the calling goroutine to its current operating system
+// thread, and exempts it from counting against GOMAXPROCS. The calling
+// goroutine will always execute in that thread, and no other goroutine will
+// execute in it, until the calling goroutine has made as many calls to
+// UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits
+// without unlocking the thread, the thread will be terminated.
+//
+// DedicateOSThread should only be used by long-lived goroutines that usually
+// block due to blocking system calls, rather than interaction with other
+// goroutines.
+func DedicateOSThread()
+```
+
+Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the
+invoking G to a M), but additionally locks the invoking M to a P. Ps locked by
+`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual
+number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number
+of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`).
+Corollaries:
+
+* If `runtime.ready` observes that a readied G is locked to a M locked to a P,
+ it immediately wakes the locked M without migrating the G to the readying P
+ or waiting for a future call to `runtime.schedule` to select the readied G
+ in `runtime.findrunnable`.
+
+* `runtime.stoplockedm` and `runtime.reentersyscall` skip the release of
+ locked Ps; the latter also skips sysmon wakeup. `runtime.stoplockedm` and
+ `runtime.exitsyscall` skip re-acquisition of Ps if one is locked.
+
+* sysmon does not attempt to preempt Gs that are locked to Ps, avoiding
+ fruitless overhead from `tgkill` syscalls and signal delivery.
+
+* `runtime.findrunnable`'s work stealing skips locked Ps (suggesting that
+ unlocked Ps be tracked in a separate array). `runtime.findrunnable` on
+ locked Ps skip the global run queue, work stealing, and possibly netpoll.
+
+* New goroutines created by goroutines with locked Ps are enqueued on the
+ global run queue rather than the invoking P's local run queue.
+
+While gVisor's use case does not strictly require that the association is
+reversible (with `runtime.UndedicateOSThread`), such a feature is required to
+allow reuse of locked Ms, which is likely to be critical for performance.
+
+## Alternatives Considered
+
+* Make the runtime scale well with `GOMAXPROCS`. While we are also
+ concurrently investigating this problem, this would not address the issues
+ of increased preemption cost or blocking overhead.
+
+* Make the runtime scale well with number of Ms. It is unclear if this is
+ actually feasible, and would not address blocking overhead.
+
+* Make P-locking part of `LockOSThread`'s behavior. This would likely
+ introduce performance regressions in existing uses of `LockOSThread` that do
+ not fit this usage pattern. In particular, since `DedicateOSThread`
+ transitions the invoker's P from "counted against `GOMAXPROCS`" to "not
+ counted against `GOMAXPROCS`", it may need to wake another M to run a new P
+ (that is counted against `GOMAXPROCS`), and the converse applies to
+ `UndedicateOSThread`.
+
+* Rewrite the gVisor sentry in a language that does not force userspace
+ scheduling. This is a last resort due to the amount of code involved.
+
+## Related Issues
+
+The proposed functionality is directly analogous to `spawn_blocking` in Rust
+async runtimes
+[`async_std`](https://docs.rs/async-std/1.8.0/async_std/task/fn.spawn_blocking.html)
+and [`tokio`](https://docs.rs/tokio/0.3.5/tokio/task/fn.spawn_blocking.html).
+
+Outside of gVisor:
+
+* https://github.com/golang/go/issues/21827#issuecomment-595152452 describes a
+ use case for this feature in go-delve, where the goroutine that would use
+ this feature spends much of its time blocked in `ptrace` syscalls.
+
+* This feature may improve performance in the use case described in
+ https://github.com/golang/go/issues/18237, given the prominence of
+ syscall.Syscall in the profile given in that bug report.
diff --git a/g3doc/user_guide/containerd/configuration.md b/g3doc/user_guide/containerd/configuration.md
index 5d485c24b..bb65aa514 100644
--- a/g3doc/user_guide/containerd/configuration.md
+++ b/g3doc/user_guide/containerd/configuration.md
@@ -4,41 +4,56 @@ This document describes how to configure runtime options for
`containerd-shim-runsc-v1`. This follows the
[Containerd Quick Start](./quick_start.md) and requires containerd 1.2 or later.
-### Update `/etc/containerd/config.toml` to point to a configuration file for `containerd-shim-runsc-v1`.
+## Shim Configuration
-`containerd-shim-runsc-v1` supports a few different configuration options based
-on the version of containerd that is used. For versions >= 1.3, it supports a
-configurable `ConfigPath` in the containerd runtime configuration.
+The shim can be provided with a configuration file containing options to the
+shim itself as well as a set of flags to runsc. Here is a quick example:
+
+```shell
+cat <<EOF | sudo tee /etc/containerd/runsc.toml
+option = "value"
+[runsc_config]
+ flag = "value"
+```
+
+The set of options that can be configured can be found in
+[options.go](https://github.com/google/gvisor/blob/master/pkg/shim/v2/options.go).
+Values under `[runsc_config]` can be used to set arbitrary flags to runsc.
+`flag = "value"` is converted to `--flag="value"` when runsc is invoked. Run
+`runsc flags` so see which flags are available
+
+Next, containerd needs to be configured to send the configuration file to the
+shim.
+
+### Containerd 1.3+
+
+Starting in 1.3, containerd supports a configurable `ConfigPath` in the runtime
+configuration. Here is an example:
```shell
cat <<EOF | sudo tee /etc/containerd/config.toml
disabled_plugins = ["restart"]
-[plugins.linux]
- shim_debug = true
[plugins.cri.containerd.runtimes.runsc]
runtime_type = "io.containerd.runsc.v1"
[plugins.cri.containerd.runtimes.runsc.options]
TypeUrl = "io.containerd.runsc.v1.options"
- # containerd 1.3 only!
ConfigPath = "/etc/containerd/runsc.toml"
EOF
```
-When you are done restart containerd to pick up the new configuration files.
+When you are done, restart containerd to pick up the changes.
```shell
sudo systemctl restart containerd
```
-### Configure `/etc/containerd/runsc.toml`
+### Containerd 1.2
-> Note: For containerd 1.2, the config file should named `config.toml` and
-> located in the runtime root. By default, this is `/run/containerd/runsc`.
+For containerd 1.2, the config file is not configurable. It should be named
+`config.toml` and located in the runtime root. By default, this is
+`/run/containerd/runsc`.
-The set of options that can be configured can be found in
-[options.go](https://github.com/google/gvisor/blob/master/pkg/shim/v2/options/options.go).
-
-#### Example: Enable the KVM platform
+### Example: Enable the KVM platform
gVisor enables the use of a number of platforms. This example shows how to
configure `containerd-shim-runsc-v1` to use gvisor with the KVM platform.
@@ -49,11 +64,42 @@ Find out more about platform in the
```shell
cat <<EOF | sudo tee /etc/containerd/runsc.toml
[runsc_config]
-platform = "kvm"
+ platform = "kvm"
+EOF
+```
+
+## Debug
+
+When `shim_debug` is enabled in `/etc/containerd/config.toml`, containerd will
+forward shim logs to its own log. You can additionally set `level = "debug"` to
+enable debug logs. To see the logs run `sudo journalctl -u containerd`. Here is
+a containerd configuration file that enables both options:
+
+```shell
+cat <<EOF | sudo tee /etc/containerd/config.toml
+disabled_plugins = ["restart"]
+[debug]
+ level = "debug"
+[plugins.linux]
+ shim_debug = true
+[plugins.cri.containerd.runtimes.runsc]
+ runtime_type = "io.containerd.runsc.v1"
+[plugins.cri.containerd.runtimes.runsc.options]
+ TypeUrl = "io.containerd.runsc.v1.options"
+ ConfigPath = "/etc/containerd/runsc.toml"
EOF
```
-### Example: Enable gVisor debug logging
+It can be hard to separate containerd messages from the shim's though. To create
+a log file dedicated to the shim, you can set the `log_path` and `log_level`
+values in the shim configuration file:
+
+- `log_path` is the directory where the shim logs will be created. `%ID%` is
+ the path is replaced with the container ID.
+- `log_level` sets the logs level. It is normally set to "debug" as there is
+ not much interesting happening with other log levels.
+
+### Example: Enable shim and gVisor debug logging
gVisor debug logging can be enabled by setting the `debug` and `debug-log` flag.
The shim will replace "%ID%" with the container ID, and "%COMMAND%" with the
@@ -63,8 +109,9 @@ Find out more about debugging in the [debugging guide](../debugging.md).
```shell
cat <<EOF | sudo tee /etc/containerd/runsc.toml
+log_path = "/var/log/runsc/%ID%/shim.log"
+log_level = "debug"
[runsc_config]
- debug=true
- debug-log=/var/log/%ID%/gvisor.%COMMAND%.log
-EOF
+ debug = "true"
+ debug-log = "/var/log/runsc/%ID%/gvisor.%COMMAND%.log"
```
diff --git a/g3doc/user_guide/containerd/quick_start.md b/g3doc/user_guide/containerd/quick_start.md
index b6a3186d8..a98fe5c4a 100644
--- a/g3doc/user_guide/containerd/quick_start.md
+++ b/g3doc/user_guide/containerd/quick_start.md
@@ -1,7 +1,7 @@
# Containerd Quick Start
-This document describes how to install and configure `containerd-shim-runsc-v1`
-using the containerd runtime handler support on `containerd` 1.2 or later.
+This document describes how to use `containerd-shim-runsc-v1` with the
+containerd runtime handler support on `containerd` 1.2 or later.
> ⚠️ NOTE: If you are using Kubernetes and set up your cluster using kubeadm you
> may run into issues. See the [FAQ](../FAQ.md#runtime-handler) for details.
diff --git a/g3doc/user_guide/install.md b/g3doc/user_guide/install.md
index abb9e8582..c3ced9d61 100644
--- a/g3doc/user_guide/install.md
+++ b/g3doc/user_guide/install.md
@@ -13,15 +13,19 @@ To download and install the latest release manually follow these steps:
(
set -e
URL=https://storage.googleapis.com/gvisor/releases/release/latest
- wget ${URL}/runsc ${URL}/runsc.sha512
- sha512sum -c runsc.sha512
- rm -f runsc.sha512
- sudo mv runsc /usr/local/bin
- sudo chmod a+rx /usr/local/bin/runsc
+ wget ${URL}/runsc ${URL}/runsc.sha512 \
+ ${URL}/gvisor-containerd-shim ${URL}/gvisor-containerd-shim.sha512 \
+ ${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
+ sha512sum -c runsc.sha512 \
+ -c gvisor-containerd-shim.sha512 \
+ -c containerd-shim-runsc-v1.sha512
+ rm -f *.sha512
+ chmod a+rx runsc gvisor-containerd-shim containerd-shim-runsc-v1
+ sudo mv runsc gvisor-containerd-shim containerd-shim-runsc-v1 /usr/local/bin
)
```
-To install gVisor with Docker, run the following commands:
+To install gVisor as a Docker runtime, run the following commands:
```bash
/usr/local/bin/runsc install
@@ -165,5 +169,6 @@ You can use this link with the steps described in
Note that `apt` installation of a specific point release is not supported.
After installation, try out `runsc` by following the
-[Docker Quick Start](./quick_start/docker.md) or
+[Docker Quick Start](./quick_start/docker.md),
+[Containerd QuickStart](./containerd/quick_start.md), or
[OCI Quick Start](./quick_start/oci.md).