diff options
Diffstat (limited to 'g3doc/proposals')
-rw-r--r-- | g3doc/proposals/BUILD | 16 | ||||
-rw-r--r-- | g3doc/proposals/gsoc-2021-ideas.md | 146 | ||||
-rw-r--r-- | g3doc/proposals/runtime_dedicate_os_thread.md | 188 |
3 files changed, 0 insertions, 350 deletions
diff --git a/g3doc/proposals/BUILD b/g3doc/proposals/BUILD deleted file mode 100644 index 710283142..000000000 --- a/g3doc/proposals/BUILD +++ /dev/null @@ -1,16 +0,0 @@ -load("//website:defs.bzl", "doc") - -package( - default_visibility = ["//website:__pkg__"], - licenses = ["notice"], -) - -doc( - name = "gsoc_2021", - src = "gsoc-2021-ideas.md", - category = "Project", - include_in_menu = False, - permalink = "/community/gsoc_2021/", - subcategory = "Community", - weight = "99", -) diff --git a/g3doc/proposals/gsoc-2021-ideas.md b/g3doc/proposals/gsoc-2021-ideas.md deleted file mode 100644 index ecaf0dfe1..000000000 --- a/g3doc/proposals/gsoc-2021-ideas.md +++ /dev/null @@ -1,146 +0,0 @@ -# Project Ideas for Google Summer of Code 2021 - -This is a collection of project ideas for -[Google Summer of Code 2021][gsoc-2021-site]. These projects are intended to be -relatively self-contained and should be good starting projects for new -contributors to gVisor. We expect individual contributors to be able to make -reasonable progress on these projects over the course of several weeks. -Familiarity with Golang and knowledge about systems programming in Linux will be -helpful. - -If you're interested in contributing to gVisor through Google Summer of Code -2021, but would like to propose your own idea for a project, please see our -[roadmap](../roadmap.md) for areas of development, and get in touch through our -[mailing list][gvisor-mailing-list] or [chat][gvisor-chat]! - -## Implement the `setns` syscall - -Estimated complexity: *easy* - -This project involves implementing the [`setns`][man-setns] syscall. gVisor -currently supports manipulation of namespaces through the `clone` and `unshare` -syscalls. These two syscalls essentially implement the requisite logic for -`setns`, but there is currently no way to obtain a file descriptor referring to -a namespace in gVisor. As described in the `setns` man page, the two typical -ways of obtaining such a file descriptor in Linux are by opening a file in -`/proc/[pid]/ns`, or through the `pidfd_open` syscall. - -For gVisor, we recommend implementing the `/proc/[pid]/ns` mechanism first, -which would involve implementing a trivial namespace file type in procfs. - -## Implement `fanotify` - -Estimated complexity: *medium* - -Implement [`fanotify`][man-fanotify] in gVisor, which is a filesystem event -notification mechanism. gVisor currently supports `inotify`, which is a similar -mechanism with slightly different capabilities, but which should serve as a good -reference. - -The `fanotify` interface adds two new syscalls: - -- `fanotify_init` creates a new notification group, which is a collection of - filesystem objects watched by the kernel. The group is represented by a file - descriptor returned by this syscall. Events on the watched objects can be - retrieved by reading from this file descriptor. - -- `fanotify_mark` adds a filesystem object to a watch group, or modifies the - parameters of an existing watch. - -Unlike `inotify`, `fanotify` can set watches on filesystems and mount points, -which will require some additional data tracking on the corresponding filesystem -objects within the sentry. - -A well-designed implementation should reuse the notifications from `inotify` for -files and directories (this is also how Linux implements these mechanisms), and -should implement the necessary tracking and notifications for filesystems and -mount points. - -## Implement `io_uring` - -Estimated complexity: *hard* - -`io_uring` is the latest asynchronous I/O API in Linux. This project will -involve implementing the system interfaces required to support `io_uring` in -gVisor. A successful implementation should have similar relatively performance -and scalability characteristics compared to synchronous I/O syscalls, as in -Linux. - -The core of the `io_uring` interface is deceptively simple, involving only three -new syscalls: - -- `io_uring_setup(2)` creates a new `io_uring` instance represented by a file - descriptor, including a set of request submission and completion queues - backed by shared memory ring buffers. - -- `io_uring_register(2)` optionally binds kernel resources such as files and - memory buffers to handles, which can then be passed to `io_uring` - operations. Pre-registering resources in this way moves the cost of looking - up and validating these resources to registration time rather than paying - the cost during the operation. - -- `io_uring_enter(2)` is the syscall used to submit queued operations and wait - for completions. This is the most complex part of the mechanism, requiring - the kernel to process queued request from the submission queue, dispatching - the appropriate I/O operation based on the request arguments and blocking - for the requested number of operations to be completed before returning. - -An `io_uring` request is effectively an opcode specifying the I/O operation to -perform, and corresponding arguments. The opcodes and arguments closely relate -to the the corresponding synchronous I/O syscall. In addition, there are some -`io_uring`-specific arguments that specify things like how to process requests, -how to interpret the arguments and communicate the status of the ring buffers. - -For a detailed description of the `io_uring` interface, see the -[design doc][io-uring-doc] by the `io_uring` authors. - -Due to the complexity of the full `io_uring` mechanism and the numerous -supported operations, it should be implemented in two stages: - -In the first stage, a simplified version of the `io_uring_setup` and -`io_uring_enter` syscalls should be implemented, which will only support a -minimal set of arguments and just one or two simple opcodes. This simplified -implementation can be used to figure out how to integrate `io_uring` with -gVisor's virtual filesystem and memory management subsystems, as well as -benchmark the implementation to ensure it has the desired performance -characteristics. The goal in this stage should be to implement the smallest -subset of features required to perform a basic operation through `io_uring`s. - -In the second stage, support can be added for all the I/O operations supported -by Linux, as well as advanced `io_uring` features such as fixed files and -buffers (via `io_uring_register`), polled I/O and kernel-side request polling. - -A single contributor can expect to make reasonable progress on the first stage -within the scope of Google Summer of Code. The second stage, while not -necessarily difficult, is likely to be very time consuming. However it also -lends itself well to parallel development by multiple contributors. - -## Implement message queues - -Estimated complexity: *hard* - -Linux provides two alternate message queues: -[System V message queues][man-sysvmq] and [POSIX message queues][man-posixmq]. -gVisor currently doesn't implement either. - -Both mechanisms add multiple syscalls for managing and using the message queues, -see the relevant man pages above for their full description. - -The core of both mechanisms are very similar, it may be possible to back both -mechanisms with a common implementation in gVisor. Linux however has two -distinct implementations. - -An individual contributor can reasonably implement a minimal version of one of -these two mechanisms within the scope of Google Summer of Code. The System V -queue may be slightly easier to implement, as gVisor already implements System V -semaphores and shared memory regions, so the code for managing IPC objects and -the registry already exist. - -[gsoc-2021-site]: https://summerofcode.withgoogle.com -[gvisor-chat]: https://gitter.im/gvisor/community -[gvisor-mailing-list]: https://groups.google.com/g/gvisor-dev -[io-uring-doc]: https://kernel.dk/io_uring.pdf -[man-fanotify]: https://man7.org/linux/man-pages/man7/fanotify.7.html -[man-sysvmq]: https://man7.org/linux/man-pages/man7/sysvipc.7.html -[man-posixmq]: https://man7.org/linux/man-pages//man7/mq_overview.7.html -[man-setns]: https://man7.org/linux/man-pages/man2/setns.2.html diff --git a/g3doc/proposals/runtime_dedicate_os_thread.md b/g3doc/proposals/runtime_dedicate_os_thread.md deleted file mode 100644 index dc70055b0..000000000 --- a/g3doc/proposals/runtime_dedicate_os_thread.md +++ /dev/null @@ -1,188 +0,0 @@ -# `runtime.DedicateOSThread` - -Status as of 2020-09-18: Deprioritized; initial studies in #2180 suggest that -this may be difficult to support in the Go runtime due to issues with GC. - -## Summary - -Allow goroutines to bind to kernel threads in a way that allows their scheduling -to be kernel-managed rather than runtime-managed. - -## Objectives - -* Reduce Go runtime overhead in the gVisor sentry (#2184). - -* Minimize intrusiveness of changes to the Go runtime. - -## Background - -In Go, execution contexts are referred to as goroutines, which the runtime calls -Gs. The Go runtime maintains a variably-sized pool of threads (called Ms by the -runtime) on which Gs are executed, as well as a pool of "virtual processors" -(called Ps by the runtime) of size equal to `runtime.GOMAXPROCS()`. Usually, -each M requires a P in order to execute Gs, limiting the number of concurrently -executing goroutines to `runtime.GOMAXPROCS()`. - -The `runtime.LockOSThread` function temporarily locks the invoking goroutine to -its current thread. It is primarily useful for interacting with OS or non-Go -library facilities that are per-thread. It does not reduce interactions with the -Go runtime scheduler: locked Ms relinquish their P when they become blocked, and -only continue execution after another M "chooses" their locked G to run and -donates their P to the locked M instead. - -## Problems - -### Context Switch Overhead - -Most goroutines in the gVisor sentry are task goroutines, which back application -threads. Task goroutines spend large amounts of time blocked on syscalls that -execute untrusted application code. When invoking said syscall (which varies by -gVisor platform), the task goroutine may interact with the Go runtime in one of -three ways: - -* It can invoke the syscall without informing the runtime. In this case, the - task goroutine will continue to hold its P during the syscall, limiting the - number of application threads that can run concurrently to - `runtime.GOMAXPROCS()`. This is problematic because the Go runtime scheduler - is known to scale poorly with `GOMAXPROCS`; see #1942 and - https://github.com/golang/go/issues/28808. It also means that preemption of - application threads must be driven by sentry or runtime code, which is - strictly slower than kernel-driven preemption (since the sentry must invoke - another syscall to preempt the application thread). - -* It can call `runtime.entersyscallblock` before invoking the syscall, and - `runtime.exitsyscall` after the syscall returns. In this case, the task - goroutine will release its P while the syscall is executing. This allows the - number of threads concurrently executing application code to exceed - `GOMAXPROCS`. However, this incurs additional latency on syscall entry (to - hand off the released P to another M, often requiring a `futex(FUTEX_WAKE)` - syscall) and on syscall exit (to acquire a new P). It also drastically - increases the number of threads that concurrently interact with the runtime - scheduler, which is also problematic for performance (both in terms of CPU - utilization and in terms of context switch latency); see #205. - -- It can call `runtime.entersyscall` before invoking the syscall, and - `runtime.exitsyscall` after the syscall returns. In this case, the task - goroutine "lazily releases" its P, allowing the runtime's "sysmon" thread to - steal it on behalf of another M after a 20us delay. This mitigates the - context switch latency problem when there are few task goroutines and the - interval between switches to application code (i.e. the interval between - application syscalls, page faults, or signal delivery) is short. (Cynically, - this means that it's most effective in microbenchmarks). However, the delay - before a P is stolen can also be problematic for performance when there are - both many task goroutines switching to application code (lazily releasing - their Ps) *and* many task goroutines switching to sentry code (contending - for Ps), which is likely in larger heterogeneous workloads. - -### Blocking Overhead - -Task goroutines block on behalf of application syscalls like `futex` and -`epoll_wait` by receiving from a Go channel. (Future work may convert task -goroutine blocking to use the `syncevent` package to avoid overhead associated -with channels and `select`, but this does not change how blocking interacts with -the Go runtime scheduler.) - -If `runtime.LockOSThread()` is not in effect when a task goroutine blocks, then -when the task goroutine is unblocked (by e.g. an application `FUTEX_WAKE`, -signal delivery, or a timeout) by sending to the blocked channel, -`runtime.ready` migrates the unblocked G to the unblocking P. In most cases, -this implies that every application thread block/unblock cycle results in a -migration of the thread between Ps, and therefore Ms, and therefore cores, -resulting in reduced application performance due to loss of CPU caches. -Furthermore, in most cases, the unblocking P cannot immediately switch to the -unblocked G (instead resuming execution of its current application thread after -completing the application's `futex(FUTEX_WAKE)`, `tgkill`, etc. syscall), often -requiring that another P steal the unblocked G before it can resume execution. - -If `runtime.LockOSThread()` is in effect when a task goroutine blocks, then the -G will remain locked to its M, avoiding the core migration described above; -however, wakeup latency is significantly increased since, as described in -"Background", the G still needs to be selected by the scheduler before it can -run, and the M that selects the G then needs to transfer its P to the locked M, -incurring an additional `FUTEX_WAKE` syscall and round of kernel scheduling. - -## Proposal - -We propose to add a function, tentatively called `DedicateOSThread`, to the Go -`runtime` package, documented as follows: - -```go -// DedicateOSThread wires the calling goroutine to its current operating system -// thread, and exempts it from counting against GOMAXPROCS. The calling -// goroutine will always execute in that thread, and no other goroutine will -// execute in it, until the calling goroutine has made as many calls to -// UndedicateOSThread as to DedicateOSThread. If the calling goroutine exits -// without unlocking the thread, the thread will be terminated. -// -// DedicateOSThread should only be used by long-lived goroutines that usually -// block due to blocking system calls, rather than interaction with other -// goroutines. -func DedicateOSThread() -``` - -Mechanically, `DedicateOSThread` implies `LockOSThread` (i.e. it locks the -invoking G to a M), but additionally locks the invoking M to a P. Ps locked by -`DedicateOSThread` are not counted against `GOMAXPROCS`; that is, the actual -number of Ps in the system (`len(runtime.allp)`) is `GOMAXPROCS` plus the number -of bound Ps (plus some slack to avoid frequent changes to `runtime.allp`). -Corollaries: - -* If `runtime.ready` observes that a readied G is locked to a M locked to a P, - it immediately wakes the locked M without migrating the G to the readying P - or waiting for a future call to `runtime.schedule` to select the readied G - in `runtime.findrunnable`. - -* `runtime.stoplockedm` and `runtime.reentersyscall` skip the release of - locked Ps; the latter also skips sysmon wakeup. `runtime.stoplockedm` and - `runtime.exitsyscall` skip re-acquisition of Ps if one is locked. - -* sysmon does not attempt to preempt Gs that are locked to Ps, avoiding - fruitless overhead from `tgkill` syscalls and signal delivery. - -* `runtime.findrunnable`'s work stealing skips locked Ps (suggesting that - unlocked Ps be tracked in a separate array). `runtime.findrunnable` on - locked Ps skip the global run queue, work stealing, and possibly netpoll. - -* New goroutines created by goroutines with locked Ps are enqueued on the - global run queue rather than the invoking P's local run queue. - -While gVisor's use case does not strictly require that the association is -reversible (with `runtime.UndedicateOSThread`), such a feature is required to -allow reuse of locked Ms, which is likely to be critical for performance. - -## Alternatives Considered - -* Make the runtime scale well with `GOMAXPROCS`. While we are also - concurrently investigating this problem, this would not address the issues - of increased preemption cost or blocking overhead. - -* Make the runtime scale well with number of Ms. It is unclear if this is - actually feasible, and would not address blocking overhead. - -* Make P-locking part of `LockOSThread`'s behavior. This would likely - introduce performance regressions in existing uses of `LockOSThread` that do - not fit this usage pattern. In particular, since `DedicateOSThread` - transitions the invoker's P from "counted against `GOMAXPROCS`" to "not - counted against `GOMAXPROCS`", it may need to wake another M to run a new P - (that is counted against `GOMAXPROCS`), and the converse applies to - `UndedicateOSThread`. - -* Rewrite the gVisor sentry in a language that does not force userspace - scheduling. This is a last resort due to the amount of code involved. - -## Related Issues - -The proposed functionality is directly analogous to `spawn_blocking` in Rust -async runtimes -[`async_std`](https://docs.rs/async-std/1.8.0/async_std/task/fn.spawn_blocking.html) -and [`tokio`](https://docs.rs/tokio/0.3.5/tokio/task/fn.spawn_blocking.html). - -Outside of gVisor: - -* https://github.com/golang/go/issues/21827#issuecomment-595152452 describes a - use case for this feature in go-delve, where the goroutine that would use - this feature spends much of its time blocked in `ptrace` syscalls. - -* This feature may improve performance in the use case described in - https://github.com/golang/go/issues/18237, given the prominence of - syscall.Syscall in the profile given in that bug report. |