1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
|
# `runtime.DedicateOSThread`
Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that
this may be difficult to support in the Go runtime due to issues with GC.
## Summary
Allow goroutines to bind to kernel threads in a way that allows their scheduling
to be kernel-managed rather than runtime-managed.
## Objectives
* Reduce Go runtime overhead in the gVisor sentry (#2184).
* Minimize intrusiveness of changes to the Go runtime.
## Background
In Go, execution contexts are referred to as goroutines, which the runtime calls
Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the
runtime) on which Gs are executed, as well as a pool of "virtual processors"
(called Ps by the runtime) of size equal to `runtime.GOMAXPROCS()`. Usually,
each M requires a P in order to execute Gs, limiting the number of concurrently
executing goroutines to `runtime.GOMAXPROCS()`.
The `runtime.LockOSThread` function temporarily locks the invoking goroutine to
its current thread. It is primarily useful for interacting with OS or non-Go
library facilities that are per-thread. It does not reduce interactions with the
Go runtime scheduler: locked Ms relinquish their P when they become blocked, and
only continue execution after another M "chooses" their locked G to run and
donates their P to the locked M instead.
## Problems
### Context Switch Overhead
Most goroutines in the gVisor sentry are task goroutines, which back application
threads. Task goroutines spend large amounts of time blocked on syscalls that
execute untrusted application code. When invoking said syscall (which varies by
gVisor platform), the task goroutine may interact with the Go runtime in one of
three ways:
* It can invoke the syscall without informing the runtime. In this case, the
task goroutine will continue to hold its P during the syscall, limiting the
number of application threads that can run concurrently to
`runtime.GOMAXPROCS()`. This is problematic because the Go runtime scheduler
is known to scale poorly with `GOMAXPROCS`; see #1942 and
https://github.com/golang/go/issues/28808. It also means that preemption of
application threads must be driven by sentry or runtime code, which is
strictly slower than kernel-driven preemption (since the sentry must invoke
another syscall to preempt the application thread).
* It can call `runtime.entersyscallblock` before invoking the syscall, and
`runtime.exitsyscall` after the syscall returns. In this case, the task
goroutine will release its P while the syscall is executing. This allows the
number of threads concurrently executing application code to exceed
`GOMAXPROCS`. However, this incurs additional latency on syscall entry (to
hand off the released P to another M, often requiring a `futex(FUTEX_WAKE)`
syscall) and on syscall exit (to acquire a new P). It also drastically
increases the number of threads that concurrently interact with the runtime
scheduler, which is also problematic for performance (both in terms of CPU
utilization and in terms of context switch latency); see #205.
- It can call `runtime.entersyscall` before invoking the syscall, and
`runtime.exitsyscall` after the syscall returns. In this case, the task
goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to
steal it on behalf of another M after a 20us delay. This mitigates the
context switch latency problem when there are few task goroutines and the
interval between switches to application code (i.e. the interval between
application syscalls, page faults, or signal delivery) is short. (Cynically,
this means that it's most effective in microbenchmarks). However, the delay
before a P is stolen can also be problematic for performance when there are
both many task goroutines switching to application code (lazily releasing
their Ps) *and* many task goroutines switching to sentry code (contending
for Ps), which is likely in larger heterogeneous workloads.
### Blocking Overhead
Task goroutines block on behalf of application syscalls like `futex` and
`epoll_wait` by receiving from a Go channel. (Future work may convert task
goroutine blocking to use the `syncevent` package to avoid overhead associated
with channels and `select`, but this does not change how blocking interacts with
the Go runtime scheduler.)
If `runtime.LockOSThread()` is not in effect when a task goroutine blocks, then
when the task goroutine is unblocked (by e.g. an application `FUTEX_WAKE`,
signal delivery, or a timeout) by sending to the blocked channel,
`runtime.ready` migrates the unblocked G to the unblocking P. In most cases,
this implies that every application thread block/unblock cycle results in a
migration of the thread between Ps, and therefore Ms, and therefore cores,
resulting in reduced application performance due to loss of CPU caches.
Furthermore, in most cases, the unblocking P cannot immediately switch to the
unblocked G (instead resuming execution of its current application thread after
completing the application's `futex(FUTEX_WAKE)`, `tgkill`, etc. syscall), often
requiring that another P steal the unblocked G before it can resume execution.
If `runtime.LockOSThread()` is in effect when a task goroutine blocks, then the
G will remain locked to its M, avoiding the core migration described above;
however, wakeup latency is significantly increased since, as described in
"Background", the G still needs to be selected by the scheduler before it can
run, and the M that selects the G then needs to transfer its P to the locked M,
incurring an additional `FUTEX_WAKE` syscall and round of kernel scheduling.
## Proposal
We propose to add a function, tentatively called `DedicateOSThread`, to the Go
`runtime` package, documented as follows:
```go
// DedicateOSThread wires the calling goroutine to its current operating system
// thread, and exempts it from counting against GOMAXPROCS. The calling
// goroutine will always execute in that thread, and no other goroutine will
// execute in it, until the calling goroutine has made as many calls to
// UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits
// without unlocking the thread, the thread will be terminated.
//
// DedicateOSThread should only be used by long-lived goroutines that usually
// block due to blocking system calls, rather than interaction with other
// goroutines.
func DedicateOSThread()
```
Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the
invoking G to a M), but additionally locks the invoking M to a P. Ps locked by
`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual
number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number
of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`).
Corollaries:
* If `runtime.ready` observes that a readied G is locked to a M locked to a P,
it immediately wakes the locked M without migrating the G to the readying P
or waiting for a future call to `runtime.schedule` to select the readied G
in `runtime.findrunnable`.
* `runtime.stoplockedm` and `runtime.reentersyscall` skip the release of
locked Ps; the latter also skips sysmon wakeup. `runtime.stoplockedm` and
`runtime.exitsyscall` skip re-acquisition of Ps if one is locked.
* sysmon does not attempt to preempt Gs that are locked to Ps, avoiding
fruitless overhead from `tgkill` syscalls and signal delivery.
* `runtime.findrunnable`'s work stealing skips locked Ps (suggesting that
unlocked Ps be tracked in a separate array). `runtime.findrunnable` on
locked Ps skip the global run queue, work stealing, and possibly netpoll.
* New goroutines created by goroutines with locked Ps are enqueued on the
global run queue rather than the invoking P's local run queue.
While gVisor's use case does not strictly require that the association is
reversible (with `runtime.UndedicateOSThread`), such a feature is required to
allow reuse of locked Ms, which is likely to be critical for performance.
## Alternatives Considered
* Make the runtime scale well with `GOMAXPROCS`. While we are also
concurrently investigating this problem, this would not address the issues
of increased preemption cost or blocking overhead.
* Make the runtime scale well with number of Ms. It is unclear if this is
actually feasible, and would not address blocking overhead.
* Make P-locking part of `LockOSThread`'s behavior. This would likely
introduce performance regressions in existing uses of `LockOSThread` that do
not fit this usage pattern. In particular, since `DedicateOSThread`
transitions the invoker's P from "counted against `GOMAXPROCS`" to "not
counted against `GOMAXPROCS`", it may need to wake another M to run a new P
(that is counted against `GOMAXPROCS`), and the converse applies to
`UndedicateOSThread`.
* Rewrite the gVisor sentry in a language that does not force userspace
scheduling. This is a last resort due to the amount of code involved.
## Related Issues
The proposed functionality is directly analogous to `spawn_blocking` in Rust
async runtimes
[`async_std`](https://docs.rs/async-std/1.8.0/async_std/task/fn.spawn_blocking.html)
and [`tokio`](https://docs.rs/tokio/0.3.5/tokio/task/fn.spawn_blocking.html).
Outside of gVisor:
* https://github.com/golang/go/issues/21827#issuecomment-595152452 describes a
use case for this feature in go-delve, where the goroutine that would use
this feature spends much of its time blocked in `ptrace` syscalls.
* This feature may improve performance in the use case described in
https://github.com/golang/go/issues/18237, given the prominence of
syscall.Syscall in the profile given in that bug report.
|