diff options
Diffstat (limited to 'g3doc/architecture_guide')
-rw-r--r-- | g3doc/architecture_guide/README.md | 4 | ||||
-rw-r--r-- | g3doc/architecture_guide/performance.md | 64 | ||||
-rw-r--r-- | g3doc/architecture_guide/platforms.md | 20 | ||||
-rw-r--r-- | g3doc/architecture_guide/resources.md | 9 | ||||
-rw-r--r-- | g3doc/architecture_guide/security.md | 120 |
5 files changed, 119 insertions, 98 deletions
diff --git a/g3doc/architecture_guide/README.md b/g3doc/architecture_guide/README.md index ce4c4ae69..1364a5358 100644 --- a/g3doc/architecture_guide/README.md +++ b/g3doc/architecture_guide/README.md @@ -3,8 +3,8 @@ gVisor provides a virtualized environment in order to sandbox untrusted containers. The system interfaces normally implemented by the host kernel are moved into a distinct, per-sandbox user space kernel in order to minimize the -risk of an exploit. gVisor does not introduce large fixed overheads however, -and still retains a process-like model with respect to resource utilization. +risk of an exploit. gVisor does not introduce large fixed overheads however, and +still retains a process-like model with respect to resource utilization. ## How is this different? diff --git a/g3doc/architecture_guide/performance.md b/g3doc/architecture_guide/performance.md index 2f83c0d20..3862d78ee 100644 --- a/g3doc/architecture_guide/performance.md +++ b/g3doc/architecture_guide/performance.md @@ -67,7 +67,9 @@ accesses. Page faults and other Operating System (OS) mechanisms are translated through the Sentry, but once mappings are installed and available to the application, there is no additional overhead. -{% include graph.html id="sysbench-memory" url="/performance/sysbench-memory.csv" title="perf.py sysbench.memory --runtime=runc --runtime=runsc" %} +{% include graph.html id="sysbench-memory" +url="/performance/sysbench-memory.csv" title="perf.py sysbench.memory +--runtime=runc --runtime=runsc" %} The above figure demonstrates the memory transfer rate as measured by `sysbench`. @@ -83,7 +85,8 @@ For many use cases, fixed memory overheads are a primary concern. This may be because sandboxed containers handle a low volume of requests, and it is therefore important to achieve high densities for efficiency. -{% include graph.html id="density" url="/performance/density.csv" title="perf.py density --runtime=runc --runtime=runsc" log="true" y_min="100000" %} +{% include graph.html id="density" url="/performance/density.csv" title="perf.py +density --runtime=runc --runtime=runsc" log="true" y_min="100000" %} The above figure demonstrates these costs based on three sample applications. This test is the result of running many instances of a container (50, or 5 in @@ -106,7 +109,8 @@ gVisor does not perform emulation or otherwise interfere with the raw execution of CPU instructions by the application. Therefore, there is no runtime cost imposed for CPU operations. -{% include graph.html id="sysbench-cpu" url="/performance/sysbench-cpu.csv" title="perf.py sysbench.cpu --runtime=runc --runtime=runsc" %} +{% include graph.html id="sysbench-cpu" url="/performance/sysbench-cpu.csv" +title="perf.py sysbench.cpu --runtime=runc --runtime=runsc" %} The above figure demonstrates the `sysbench` measurement of CPU events per second. Events per second is based on a CPU-bound loop that calculates all prime @@ -117,7 +121,8 @@ This has important consequences for classes of workloads that are often CPU-bound, such as data processing or machine learning. In these cases, `runsc` will similarly impose minimal runtime overhead. -{% include graph.html id="tensorflow" url="/performance/tensorflow.csv" title="perf.py tensorflow --runtime=runc --runtime=runsc" %} +{% include graph.html id="tensorflow" url="/performance/tensorflow.csv" +title="perf.py tensorflow --runtime=runc --runtime=runsc" %} For example, the above figure shows a sample TensorFlow workload, the [convolutional neural network example][cnn]. The time indicated includes the @@ -125,13 +130,16 @@ full start-up and run time for the workload, which trains a model. ## System calls -Some **structural costs** of gVisor are heavily influenced by the [platform -choice](../platforms/), which implements system call interception. Today, gVisor -supports a variety of platforms. These platforms present distinct performance, -compatibility and security trade-offs. For example, the KVM platform has low -overhead system call interception but runs poorly with nested virtualization. +Some **structural costs** of gVisor are heavily influenced by the +[platform choice](../platforms/), which implements system call interception. +Today, gVisor supports a variety of platforms. These platforms present distinct +performance, compatibility and security trade-offs. For example, the KVM +platform has low overhead system call interception but runs poorly with nested +virtualization. -{% include graph.html id="syscall" url="/performance/syscall.csv" title="perf.py syscall --runtime=runc --runtime=runsc-ptrace --runtime=runsc-kvm" y_min="100" log="true" %} +{% include graph.html id="syscall" url="/performance/syscall.csv" title="perf.py +syscall --runtime=runc --runtime=runsc-ptrace --runtime=runsc-kvm" y_min="100" +log="true" %} The above figure demonstrates the time required for a raw system call on various platforms. The test is implemented by a custom binary which performs a large @@ -142,7 +150,8 @@ tend to be high-performance data stores and static network services. In general, the impact of system call interception will be lower the more work an application does. -{% include graph.html id="redis" url="/performance/redis.csv" title="perf.py redis --runtime=runc --runtime=runsc" %} +{% include graph.html id="redis" url="/performance/redis.csv" title="perf.py +redis --runtime=runc --runtime=runsc" %} For example, `redis` is an application that performs relatively little work in userspace: in general it reads from a connected socket, reads or modifies some @@ -162,7 +171,8 @@ For many use cases, the ability to spin-up containers quickly and efficiently is important. A sandbox may be short-lived and perform minimal user work (e.g. a function invocation). -{% include graph.html id="startup" url="/performance/startup.csv" title="perf.py startup --runtime=runc --runtime=runsc" %} +{% include graph.html id="startup" url="/performance/startup.csv" title="perf.py +startup --runtime=runc --runtime=runsc" %} The above figure indicates how total time required to start a container through [Docker][docker]. This benchmark uses three different applications. First, an @@ -173,26 +183,29 @@ similarly loads a number of modules and binds an HTTP server. > Note: most of the time overhead above is associated Docker itself. This is > evident with the empty `runc` benchmark. To avoid these costs with `runsc`, -> you may also consider using `runsc do` mode or invoking the [OCI -> runtime](../../user_guide/quick_start/oci/) directly. +> you may also consider using `runsc do` mode or invoking the +> [OCI runtime](../../user_guide/quick_start/oci/) directly. ## Network -Networking is mostly bound by **implementation costs**, and gVisor's network stack -is improving quickly. +Networking is mostly bound by **implementation costs**, and gVisor's network +stack is improving quickly. While typically not an important metric in practice for common sandbox use cases, nevertheless `iperf` is a common microbenchmark used to measure raw throughput. -{% include graph.html id="iperf" url="/performance/iperf.csv" title="perf.py iperf --runtime=runc --runtime=runsc" %} +{% include graph.html id="iperf" url="/performance/iperf.csv" title="perf.py +iperf --runtime=runc --runtime=runsc" %} The above figure shows the result of an `iperf` test between two instances. For the upload case, the specified runtime is used for the `iperf` client, and in the download case, the specified runtime is the server. A native runtime is always used for the other endpoint in the test. -{% include graph.html id="applications" metric="requests_per_second" url="/performance/applications.csv" title="perf.py http.(node|ruby) --connections=25 --runtime=runc --runtime=runsc" %} +{% include graph.html id="applications" metric="requests_per_second" +url="/performance/applications.csv" title="perf.py http.(node|ruby) +--connections=25 --runtime=runc --runtime=runsc" %} The above figure shows the result of simple `node` and `ruby` web services that render a template upon receiving a request. Because these synthetic benchmarks @@ -213,20 +226,26 @@ through the [Gofer](../) as a result of our [security model](../security/), but in most cases are dominated by **implementation costs**, due to an internal [Virtual File System][vfs] (VFS) implementation that needs improvement. -{% include graph.html id="fio-bw" url="/performance/fio.csv" title="perf.py fio --engine=sync --runtime=runc --runtime=runsc" log="true" %} +{% include graph.html id="fio-bw" url="/performance/fio.csv" title="perf.py fio +--engine=sync --runtime=runc --runtime=runsc" log="true" %} The above figures demonstrate the results of `fio` for reads and writes to and from the disk. In this case, the disk quickly becomes the bottleneck and dominates other costs. -{% include graph.html id="fio-tmpfs-bw" url="/performance/fio-tmpfs.csv" title="perf.py fio --engine=sync --runtime=runc --tmpfs=True --runtime=runsc" log="true" %} +{% include graph.html id="fio-tmpfs-bw" url="/performance/fio-tmpfs.csv" +title="perf.py fio --engine=sync --runtime=runc --tmpfs=True --runtime=runsc" +log="true" %} The above figure shows the raw I/O performance of using a `tmpfs` mount which is sandbox-internal in the case of `runsc`. Generally these operations are similarly bound to the cost of copying around data in-memory, and we don't see the cost of VFS operations. -{% include graph.html id="httpd100k" metric="transfer_rate" url="/performance/httpd100k.csv" title="perf.py http.httpd --connections=1 --connections=5 --connections=10 --connections=25 --runtime=runc --runtime=runsc" %} +{% include graph.html id="httpd100k" metric="transfer_rate" +url="/performance/httpd100k.csv" title="perf.py http.httpd --connections=1 +--connections=5 --connections=10 --connections=25 --runtime=runc +--runtime=runsc" %} The high costs of VFS operations can manifest in benchmarks that execute many such operations in the hot path for serving requests, for example. The above @@ -239,7 +258,8 @@ internal serialization points (since all requests are reading the same file). Note that some of some of network stack performance issues also impact this benchmark. -{% include graph.html id="ffmpeg" url="/performance/ffmpeg.csv" title="perf.py media.ffmpeg --runtime=runc --runtime=runsc" %} +{% include graph.html id="ffmpeg" url="/performance/ffmpeg.csv" title="perf.py +media.ffmpeg --runtime=runc --runtime=runsc" %} For benchmarks that are bound by raw disk I/O and a mix of compute, file system operations are less of an issue. The above figure shows the total time required diff --git a/g3doc/architecture_guide/platforms.md b/g3doc/architecture_guide/platforms.md index 1f79971d1..6e63da8ce 100644 --- a/g3doc/architecture_guide/platforms.md +++ b/g3doc/architecture_guide/platforms.md @@ -6,12 +6,12 @@ be run. Each sandbox has its own isolated instance of: -* The **Sentry**, A user-space kernel that runs the container and intercepts - and responds to system calls made by the application. +* The **Sentry**, A user-space kernel that runs the container and intercepts + and responds to system calls made by the application. Each container running in the sandbox has its own isolated instance of: -* A **Gofer** which provides file system access to the container. +* A **Gofer** which provides file system access to the container. ![gVisor architecture diagram](Sentry-Gofer.png "gVisor architecture diagram") @@ -20,9 +20,9 @@ Each container running in the sandbox has its own isolated instance of: The entrypoint to running a sandboxed container is the `runsc` executable. `runsc` implements the [Open Container Initiative (OCI)][oci] runtime specification. This means that OCI compatible _filesystem bundles_ can be run by -`runsc`. Filesystem bundles are comprised of a `config.json` file containing -container configuration, and a root filesystem for the container. Please see -the [OCI runtime spec][runtime-spec] for more information on filesystem bundles. +`runsc`. Filesystem bundles are comprised of a `config.json` file containing +container configuration, and a root filesystem for the container. Please see the +[OCI runtime spec][runtime-spec] for more information on filesystem bundles. `runsc` implements multiple commands that perform various functions such as starting, stopping, listing, and querying the status of containers. @@ -31,8 +31,8 @@ starting, stopping, listing, and querying the status of containers. The Sentry is the largest component of gVisor. It can be thought of as a userspace OS kernel. The Sentry implements all the kernel functionality needed by the untrusted application. It implements all of the supported system calls, -signal delivery, memory management and page faulting logic, the threading -model, and more. +signal delivery, memory management and page faulting logic, the threading model, +and more. When the untrusted application makes a system call, the currently used platform redirects the call to the Sentry, which will do the necessary work to service @@ -43,8 +43,8 @@ application to directly control the system calls it makes. The Sentry aims to present an equivalent environment to (upstream) Linux v4.4. -File system operations that extend beyond the sandbox (not internal /proc -files, pipes, etc) are sent to the Gofer, described below. +File system operations that extend beyond the sandbox (not internal /proc files, +pipes, etc) are sent to the Gofer, described below. ## Platforms diff --git a/g3doc/architecture_guide/resources.md b/g3doc/architecture_guide/resources.md index 4580bf9f4..894f995ae 100644 --- a/g3doc/architecture_guide/resources.md +++ b/g3doc/architecture_guide/resources.md @@ -40,8 +40,8 @@ Sentry via [SCM_RIGHTS][scmrights][^1]. These files may be read from and written to through standard system calls, and also mapped into the associated application's address space. This allows the same host memory to be shared across multiple sandboxes, although this mechanism -does not preclude the use of side-channels (see the [security -model](../security/)). +does not preclude the use of side-channels (see the +[security model](../security/)). Note that some file systems exist only within the context of the sandbox. For example, in many cases a `tmpfs` mount will be available at `/tmp` or @@ -138,5 +138,6 @@ to purge internal caches. [scmrights]: http://man7.org/linux/man-pages/man7/unix.7.html [madvise]: http://man7.org/linux/man-pages/man2/madvise.2.html [exec]: https://docs.docker.com/engine/reference/commandline/exec/ - -[^1]: Unless host networking is enabled, the Sentry is not able to create or open host file descriptors itself, it can only receive them in this way from the Gofer. +[^1]: Unless host networking is enabled, the Sentry is not able to create or + open host file descriptors itself, it can only receive them in this way + from the Gofer. diff --git a/g3doc/architecture_guide/security.md b/g3doc/architecture_guide/security.md index afafe5c05..f78586291 100644 --- a/g3doc/architecture_guide/security.md +++ b/g3doc/architecture_guide/security.md @@ -27,14 +27,14 @@ namespaces. Although the System API is exposed to applications by design, bugs and race conditions within the kernel or hypervisor may occasionally be exploitable via -the API. This is common in part due to the fact that most kernels and hypervisors -are written in [C][clang], which is well-suited to interfacing with hardware but -often prone to security issues. In order to exploit these issues, a typical attack -might involve some combination of the following: +the API. This is common in part due to the fact that most kernels and +hypervisors are written in [C][clang], which is well-suited to interfacing with +hardware but often prone to security issues. In order to exploit these issues, a +typical attack might involve some combination of the following: -1. Opening or creating some combination of files, sockets or other descriptors. -1. Passing crafted, malicious arguments, structures or packets. -1. Racing with multiple threads in order to hit specific code paths. +1. Opening or creating some combination of files, sockets or other descriptors. +1. Passing crafted, malicious arguments, structures or packets. +1. Racing with multiple threads in order to hit specific code paths. For example, for the [Dirty Cow][dirtycow] privilege escalation bug, an application would open a specific file in `/proc` or use a specific `ptrace` @@ -74,8 +74,8 @@ hyperthread. The above categories in no way represent an exhaustive list of exploits, as we focus only on running untrusted code from within the operating system or -hypervisor. We do not consider other ways that a more generic adversary -may interact with a system, such as inserting a portable storage device with a +hypervisor. We do not consider other ways that a more generic adversary may +interact with a system, such as inserting a portable storage device with a malicious filesystem image, using a combination of crafted keyboard or touch inputs, or saturating a network device with ill-formed packets. @@ -100,30 +100,30 @@ The first principle is similar to the security basis for a Virtual Machine (VM). With a VM, an application’s interactions with the host are replaced by interactions with a guest operating system and a set of virtualized hardware devices. These hardware devices are then implemented via the host System API by -a Virtual Machine Monitor (VMM). The Sentry similarly prevents direct interactions -by providing its own implementation of the System API that the application -must interact with. Applications are not able to to directly craft specific -arguments or flags for the host System API, or interact directly with host -primitives. +a Virtual Machine Monitor (VMM). The Sentry similarly prevents direct +interactions by providing its own implementation of the System API that the +application must interact with. Applications are not able to to directly craft +specific arguments or flags for the host System API, or interact directly with +host primitives. For both the Sentry and a VMM, it’s worth noting that while direct interactions are not possible, indirect interactions are still possible. For example, a read on a host-backed file in the Sentry may ultimately result in a host read system -call (made by the Sentry, not by passing through arguments from the application), -similar to how a read on a block device in a VM may result in the VMM issuing -a corresponding host read system call from a backing file. - -An important distinction from a VM is that the Sentry implements a System API based -directly on host System API primitives instead of relying on virtualized hardware -and a guest operating system. This selects a distinct set of trade-offs, largely -in the performance, efficiency and compatibility domains. Since transitions in -and out of the sandbox are relatively expensive, a guest operating system will -typically take ownership of resources. For example, in the above case, the -guest operating system may read the block device data in a local page cache, -to avoid subsequent reads. This may lead to better performance but lower -efficiency, since memory may be wasted or duplicated. The Sentry opts instead -to defer to the host for many operations during runtime, for improved efficiency -but lower performance in some use cases. +call (made by the Sentry, not by passing through arguments from the +application), similar to how a read on a block device in a VM may result in the +VMM issuing a corresponding host read system call from a backing file. + +An important distinction from a VM is that the Sentry implements a System API +based directly on host System API primitives instead of relying on virtualized +hardware and a guest operating system. This selects a distinct set of +trade-offs, largely in the performance, efficiency and compatibility domains. +Since transitions in and out of the sandbox are relatively expensive, a guest +operating system will typically take ownership of resources. For example, in the +above case, the guest operating system may read the block device data in a local +page cache, to avoid subsequent reads. This may lead to better performance but +lower efficiency, since memory may be wasted or duplicated. The Sentry opts +instead to defer to the host for many operations during runtime, for improved +efficiency but lower performance in some use cases. ### What can a sandbox do? @@ -140,15 +140,15 @@ filesystem attributes) and not underlying host system resources. While the sandbox virtualizes many operations for the application, we limit the sandbox's own interactions with the host to the following high-level operations: -1. Communicate with a Gofer process via a connected socket. The sandbox may - receive new file descriptors from the Gofer process, corresponding to opened - files. These files can then be read from and written to by the sandbox. -1. Make a minimal set of host system calls. The calls do not include the - creation of new sockets (unless host networking mode is enabled) or opening - files. The calls include duplication and closing of file descriptors, - synchronization, timers and signal management. -1. Read and write packets to a virtual ethernet device. This is not required if - host networking is enabled (or networking is disabled). +1. Communicate with a Gofer process via a connected socket. The sandbox may + receive new file descriptors from the Gofer process, corresponding to opened + files. These files can then be read from and written to by the sandbox. +1. Make a minimal set of host system calls. The calls do not include the + creation of new sockets (unless host networking mode is enabled) or opening + files. The calls include duplication and closing of file descriptors, + synchronization, timers and signal management. +1. Read and write packets to a virtual ethernet device. This is not required if + host networking is enabled (or networking is disabled). ### System ABI, Side Channels and Other Vectors @@ -173,32 +173,32 @@ less likely to exploit or override these controls through other means. For gVisor development, there are several engineering principles that are employed in order to ensure that the system meets its design goals. -1. No system call is passed through directly to the host. Every supported call - has an independent implementation in the Sentry, that is unlikely to suffer - from identical vulnerabilities that may appear in the host. This has the - consequence that all kernel features used by applications require an - implementation within the Sentry. -1. Only common, universal functionality is implemented. Some filesystems, - network devices or modules may expose specialized functionality to user - space applications via mechanisms such as extended attributes, raw sockets - or ioctls. Since the Sentry is responsible for implementing the full system - call surface, we do not implement or pass through these specialized APIs. -1. The host surface exposed to the Sentry is minimized. While the system call - surface is not trivial, it is explicitly enumerated and controlled. The - Sentry is not permitted to open new files, create new sockets or do many - other interesting things on the host. +1. No system call is passed through directly to the host. Every supported call + has an independent implementation in the Sentry, that is unlikely to suffer + from identical vulnerabilities that may appear in the host. This has the + consequence that all kernel features used by applications require an + implementation within the Sentry. +1. Only common, universal functionality is implemented. Some filesystems, + network devices or modules may expose specialized functionality to user + space applications via mechanisms such as extended attributes, raw sockets + or ioctls. Since the Sentry is responsible for implementing the full system + call surface, we do not implement or pass through these specialized APIs. +1. The host surface exposed to the Sentry is minimized. While the system call + surface is not trivial, it is explicitly enumerated and controlled. The + Sentry is not permitted to open new files, create new sockets or do many + other interesting things on the host. Additionally, we have practical restrictions that are imposed on the project to minimize the risk of Sentry exploitability. For example: -1. Unsafe code is carefully controlled. All unsafe code is isolated in files - that end with "unsafe.go", in order to facilitate validation and auditing. - No file without the unsafe suffix may import the unsafe package. -1. No CGo is allowed. The Sentry must be a pure Go binary. -1. External imports are not generally allowed within the core packages. Only - limited external imports are used within the setup code. The code available - inside the Sentry is carefully controlled, to ensure that the above rules - are effective. +1. Unsafe code is carefully controlled. All unsafe code is isolated in files + that end with "unsafe.go", in order to facilitate validation and auditing. + No file without the unsafe suffix may import the unsafe package. +1. No CGo is allowed. The Sentry must be a pure Go binary. +1. External imports are not generally allowed within the core packages. Only + limited external imports are used within the setup code. The code available + inside the Sentry is carefully controlled, to ensure that the above rules + are effective. Finally, we recognize that security is a process, and that vigilance is critical. Beyond our security disclosure process, the Sentry is fuzzed |