g3doc/proposals/runtime_dedicate_os_thread.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188

# `runtime.DedicateOSThread`

Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that
this may be difficult to support in the Go runtime due to issues with GC.

## Summary

Allow goroutines to bind to kernel threads in a way that allows their scheduling
to be kernel-managed rather than runtime-managed.

## Objectives

*   Reduce Go runtime overhead in the gVisor sentry (#2184).

*   Minimize intrusiveness of changes to the Go runtime.

## Background

In Go, execution contexts are referred to as goroutines, which the runtime calls
Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the
runtime) on which Gs are executed, as well as a pool of "virtual processors"
(called Ps by the runtime) of size equal to `runtime.GOMAXPROCS()`. Usually,
each M requires a P in order to execute Gs, limiting the number of concurrently
executing goroutines to `runtime.GOMAXPROCS()`.

The `runtime.LockOSThread` function temporarily locks the invoking goroutine to
its current thread. It is primarily useful for interacting with OS or non-Go
library facilities that are per-thread. It does not reduce interactions with the
Go runtime scheduler: locked Ms relinquish their P when they become blocked, and
only continue execution after another M "chooses" their locked G to run and
donates their P to the locked M instead.

## Problems

### Context Switch Overhead

Most goroutines in the gVisor sentry are task goroutines, which back application
threads. Task goroutines spend large amounts of time blocked on syscalls that
execute untrusted application code. When invoking said syscall (which varies by
gVisor platform), the task goroutine may interact with the Go runtime in one of
three ways:

*   It can invoke the syscall without informing the runtime. In this case, the
    task goroutine will continue to hold its P during the syscall, limiting the
    number of application threads that can run concurrently to
    `runtime.GOMAXPROCS()`. This is problematic because the Go runtime scheduler
    is known to scale poorly with `GOMAXPROCS`; see #1942 and
    https://github.com/golang/go/issues/28808. It also means that preemption of
    application threads must be driven by sentry or runtime code, which is
    strictly slower than kernel-driven preemption (since the sentry must invoke
    another syscall to preempt the application thread).

*   It can call `runtime.entersyscallblock` before invoking the syscall, and
    `runtime.exitsyscall` after the syscall returns. In this case, the task
    goroutine will release its P while the syscall is executing. This allows the
    number of threads concurrently executing application code to exceed
    `GOMAXPROCS`. However, this incurs additional latency on syscall entry (to
    hand off the released P to another M, often requiring a `futex(FUTEX_WAKE)`
    syscall) and on syscall exit (to acquire a new P). It also drastically
    increases the number of threads that concurrently interact with the runtime
    scheduler, which is also problematic for performance (both in terms of CPU
    utilization and in terms of context switch latency); see #205.

-   It can call `runtime.entersyscall` before invoking the syscall, and
    `runtime.exitsyscall` after the syscall returns. In this case, the task
    goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to
    steal it on behalf of another M after a 20us delay. This mitigates the
    context switch latency problem when there are few task goroutines and the
    interval between switches to application code (i.e. the interval between
    application syscalls, page faults, or signal delivery) is short. (Cynically,
    this means that it's most effective in microbenchmarks). However, the delay
    before a P is stolen can also be problematic for performance when there are
    both many task goroutines switching to application code (lazily releasing
    their Ps) *and* many task goroutines switching to sentry code (contending
    for Ps), which is likely in larger heterogeneous workloads.

### Blocking Overhead

Task goroutines block on behalf of application syscalls like `futex` and
`epoll_wait` by receiving from a Go channel. (Future work may convert task
goroutine blocking to use the `syncevent` package to avoid overhead associated
with channels and `select`, but this does not change how blocking interacts with
the Go runtime scheduler.)

If `runtime.LockOSThread()` is not in effect when a task goroutine blocks, then
when the task goroutine is unblocked (by e.g. an application `FUTEX_WAKE`,
signal delivery, or a timeout) by sending to the blocked channel,
`runtime.ready` migrates the unblocked G to the unblocking P. In most cases,
this implies that every application thread block/unblock cycle results in a
migration of the thread between Ps, and therefore Ms, and therefore cores,
resulting in reduced application performance due to loss of CPU caches.
Furthermore, in most cases, the unblocking P cannot immediately switch to the
unblocked G (instead resuming execution of its current application thread after
completing the application's `futex(FUTEX_WAKE)`, `tgkill`, etc. syscall), often
requiring that another P steal the unblocked G before it can resume execution.

If `runtime.LockOSThread()` is in effect when a task goroutine blocks, then the
G will remain locked to its M, avoiding the core migration described above;
however, wakeup latency is significantly increased since, as described in
"Background", the G still needs to be selected by the scheduler before it can
run, and the M that selects the G then needs to transfer its P to the locked M,
incurring an additional `FUTEX_WAKE` syscall and round of kernel scheduling.

## Proposal

We propose to add a function, tentatively called `DedicateOSThread`, to the Go
`runtime` package, documented as follows:

```go
// DedicateOSThread wires the calling goroutine to its current operating system
// thread, and exempts it from counting against GOMAXPROCS. The calling
// goroutine will always execute in that thread, and no other goroutine will
// execute in it, until the calling goroutine has made as many calls to
// UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits
// without unlocking the thread, the thread will be terminated.
//
// DedicateOSThread should only be used by long-lived goroutines that usually
// block due to blocking system calls, rather than interaction with other
// goroutines.
func DedicateOSThread()
```

Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the
invoking G to a M), but additionally locks the invoking M to a P. Ps locked by
`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual
number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number
of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`).
Corollaries:

*   If `runtime.ready` observes that a readied G is locked to a M locked to a P,
    it immediately wakes the locked M without migrating the G to the readying P
    or waiting for a future call to `runtime.schedule` to select the readied G
    in `runtime.findrunnable`.

*   `runtime.stoplockedm` and `runtime.reentersyscall` skip the release of
    locked Ps; the latter also skips sysmon wakeup. `runtime.stoplockedm` and
    `runtime.exitsyscall` skip re-acquisition of Ps if one is locked.

*   sysmon does not attempt to preempt Gs that are locked to Ps, avoiding
    fruitless overhead from `tgkill` syscalls and signal delivery.

*   `runtime.findrunnable`'s work stealing skips locked Ps (suggesting that
    unlocked Ps be tracked in a separate array). `runtime.findrunnable` on
    locked Ps skip the global run queue, work stealing, and possibly netpoll.

*   New goroutines created by goroutines with locked Ps are enqueued on the
    global run queue rather than the invoking P's local run queue.

While gVisor's use case does not strictly require that the association is
reversible (with `runtime.UndedicateOSThread`), such a feature is required to
allow reuse of locked Ms, which is likely to be critical for performance.

## Alternatives Considered

*   Make the runtime scale well with `GOMAXPROCS`. While we are also
    concurrently investigating this problem, this would not address the issues
    of increased preemption cost or blocking overhead.

*   Make the runtime scale well with number of Ms. It is unclear if this is
    actually feasible, and would not address blocking overhead.

*   Make P-locking part of `LockOSThread`'s behavior. This would likely
    introduce performance regressions in existing uses of `LockOSThread` that do
    not fit this usage pattern. In particular, since `DedicateOSThread`
    transitions the invoker's P from "counted against `GOMAXPROCS`" to "not
    counted against `GOMAXPROCS`", it may need to wake another M to run a new P
    (that is counted against `GOMAXPROCS`), and the converse applies to
    `UndedicateOSThread`.

*   Rewrite the gVisor sentry in a language that does not force userspace
    scheduling. This is a last resort due to the amount of code involved.

## Related Issues

The proposed functionality is directly analogous to `spawn_blocking` in Rust
async runtimes
[`async_std`](https://docs.rs/async-std/1.8.0/async_std/task/fn.spawn_blocking.html)
and [`tokio`](https://docs.rs/tokio/0.3.5/tokio/task/fn.spawn_blocking.html).

Outside of gVisor:

*   https://github.com/golang/go/issues/21827#issuecomment-595152452 describes a
    use case for this feature in go-delve, where the goroutine that would use
    this feature spends much of its time blocked in `ptrace` syscalls.

*   This feature may improve performance in the use case described in
    https://github.com/golang/go/issues/18237, given the prominence of
    syscall.Syscall in the profile given in that bug report.