summaryrefslogtreecommitdiffhomepage
path: root/g3doc/architecture_guide
diff options
context:
space:
mode:
authorIan Lewis <ianmlewis@gmail.com>2020-08-17 21:44:31 -0400
committerIan Lewis <ianmlewis@gmail.com>2020-08-17 21:44:31 -0400
commitac324f646ee3cb7955b0b45a7453aeb9671cbdf1 (patch)
tree0cbc5018e8807421d701d190dc20525726c7ca76 /g3doc/architecture_guide
parent352ae1022ce19de28fc72e034cc469872ad79d06 (diff)
parent6d0c5803d557d453f15ac6f683697eeb46dab680 (diff)
Merge branch 'master' into ip-forwarding
- Merges aleksej-paschenko's with HEAD - Adds vfs2 support for ip_forward
Diffstat (limited to 'g3doc/architecture_guide')
-rw-r--r--g3doc/architecture_guide/BUILD50
-rw-r--r--g3doc/architecture_guide/performance.md277
-rw-r--r--g3doc/architecture_guide/platforms.md61
-rw-r--r--g3doc/architecture_guide/platforms.pngbin0 -> 21384 bytes
-rw-r--r--g3doc/architecture_guide/platforms.svg334
-rw-r--r--g3doc/architecture_guide/resources.md144
-rw-r--r--g3doc/architecture_guide/resources.pngbin0 -> 16621 bytes
-rw-r--r--g3doc/architecture_guide/resources.svg208
-rw-r--r--g3doc/architecture_guide/security.md255
-rw-r--r--g3doc/architecture_guide/security.pngbin0 -> 16932 bytes
-rw-r--r--g3doc/architecture_guide/security.svg153
11 files changed, 1482 insertions, 0 deletions
diff --git a/g3doc/architecture_guide/BUILD b/g3doc/architecture_guide/BUILD
new file mode 100644
index 000000000..404f627a4
--- /dev/null
+++ b/g3doc/architecture_guide/BUILD
@@ -0,0 +1,50 @@
+load("//website:defs.bzl", "doc")
+
+package(
+ default_visibility = ["//website:__pkg__"],
+ licenses = ["notice"],
+)
+
+doc(
+ name = "platforms",
+ src = "platforms.md",
+ category = "Architecture Guide",
+ data = [
+ "platforms.png",
+ "platforms.svg",
+ ],
+ permalink = "/docs/architecture_guide/platforms/",
+ weight = "40",
+)
+
+doc(
+ name = "resources",
+ src = "resources.md",
+ category = "Architecture Guide",
+ data = [
+ "resources.png",
+ "resources.svg",
+ ],
+ permalink = "/docs/architecture_guide/resources/",
+ weight = "30",
+)
+
+doc(
+ name = "security",
+ src = "security.md",
+ category = "Architecture Guide",
+ data = [
+ "security.png",
+ "security.svg",
+ ],
+ permalink = "/docs/architecture_guide/security/",
+ weight = "10",
+)
+
+doc(
+ name = "performance",
+ src = "performance.md",
+ category = "Architecture Guide",
+ permalink = "/docs/architecture_guide/performance/",
+ weight = "20",
+)
diff --git a/g3doc/architecture_guide/performance.md b/g3doc/architecture_guide/performance.md
new file mode 100644
index 000000000..39dbb0045
--- /dev/null
+++ b/g3doc/architecture_guide/performance.md
@@ -0,0 +1,277 @@
+# Performance Guide
+
+[TOC]
+
+gVisor is designed to provide a secure, virtualized environment while preserving
+key benefits of containerization, such as small fixed overheads and a dynamic
+resource footprint. For containerized infrastructure, this can provide a
+turn-key solution for sandboxing untrusted workloads: there are no changes to
+the fundamental resource model.
+
+gVisor imposes runtime costs over native containers. These costs come in two
+forms: additional cycles and memory usage, which may manifest as increased
+latency, reduced throughput or density, or not at all. In general, these costs
+come from two different sources.
+
+First, the existence of the [Sentry](../README.md#sentry) means that additional
+memory will be required, and application system calls must traverse additional
+layers of software. The design emphasizes
+[security](/docs/architecture_guide/security/) and therefore we chose to use a
+language for the Sentry that provides benefits in this domain but may not yet
+offer the raw performance of other choices. Costs imposed by these design
+choices are **structural costs**.
+
+Second, as gVisor is an independent implementation of the system call surface,
+many of the subsystems or specific calls are not as optimized as more mature
+implementations. A good example here is the network stack, which is continuing
+to evolve but does not support all the advanced recovery mechanisms offered by
+other stacks and is less CPU efficient. This is an **implementation cost** and
+is distinct from **structural costs**. Improvements here are ongoing and driven
+by the workloads that matter to gVisor users and contributors.
+
+This page provides a guide for understanding baseline performance, and calls out
+distint **structural costs** and **implementation costs**, highlighting where
+improvements are possible and not possible.
+
+While we include a variety of workloads here, it’s worth emphasizing that gVisor
+may not be an appropriate solution for every workload, for reasons other than
+performance. For example, a sandbox may provide minimal benefit for a trusted
+database, since _user data would already be inside the sandbox_ and there is no
+need for an attacker to break out in the first place.
+
+## Methodology
+
+All data below was generated using the [benchmark tools][benchmark-tools]
+repository, and the machines under test are uniform [Google Compute Engine][gce]
+Virtual Machines (VMs) with the following specifications:
+
+ Machine type: n1-standard-4 (broadwell)
+ Image: Debian GNU/Linux 9 (stretch) 4.19.0-0
+ BootDisk: 2048GB SSD persistent disk
+
+Through this document, `runsc` is used to indicate the runtime provided by
+gVisor. When relevant, we use the name `runsc-platform` to describe a specific
+[platform choice](/docs/architecture_guide/platforms/).
+
+**Except where specified, all tests below are conducted with the `ptrace`
+platform. The `ptrace` platform works everywhere and does not require hardware
+virtualization or kernel modifications but suffers from the highest structural
+costs by far. This platform is used to provide a clear understanding of the
+performance model, but in no way represents an ideal scenario. In the future,
+this guide will be extended to bare metal environments and include additional
+platforms.**
+
+## Memory access
+
+gVisor does not introduce any additional costs with respect to raw memory
+accesses. Page faults and other Operating System (OS) mechanisms are translated
+through the Sentry, but once mappings are installed and available to the
+application, there is no additional overhead.
+
+{% include graph.html id="sysbench-memory"
+url="/performance/sysbench-memory.csv" title="perf.py sysbench.memory
+--runtime=runc --runtime=runsc" %}
+
+The above figure demonstrates the memory transfer rate as measured by
+`sysbench`.
+
+## Memory usage
+
+The Sentry provides an additional layer of indirection, and it requires memory
+in order to store state associated with the application. This memory generally
+consists of a fixed component, plus an amount that varies with the usage of
+operating system resources (e.g. how many sockets or files are opened).
+
+For many use cases, fixed memory overheads are a primary concern. This may be
+because sandboxed containers handle a low volume of requests, and it is
+therefore important to achieve high densities for efficiency.
+
+{% include graph.html id="density" url="/performance/density.csv" title="perf.py
+density --runtime=runc --runtime=runsc" log="true" y_min="100000" %}
+
+The above figure demonstrates these costs based on three sample applications.
+This test is the result of running many instances of a container (50, or 5 in
+the case of redis) and calculating available memory on the host before and
+afterwards, and dividing the difference by the number of containers. This
+technique is used for measuring memory usage over the `usage_in_bytes` value of
+the container cgroup because we found that some container runtimes, other than
+`runc` and `runsc`, do not use an individual container cgroup.
+
+The first application is an instance of `sleep`: a trivial application that does
+nothing. The second application is a synthetic `node` application which imports
+a number of modules and listens for requests. The third application is a similar
+synthetic `ruby` application which does the same. Finally, we include an
+instance of `redis` storing approximately 1GB of data. In all cases, the sandbox
+itself is responsible for a small, mostly fixed amount of memory overhead.
+
+## CPU performance
+
+gVisor does not perform emulation or otherwise interfere with the raw execution
+of CPU instructions by the application. Therefore, there is no runtime cost
+imposed for CPU operations.
+
+{% include graph.html id="sysbench-cpu" url="/performance/sysbench-cpu.csv"
+title="perf.py sysbench.cpu --runtime=runc --runtime=runsc" %}
+
+The above figure demonstrates the `sysbench` measurement of CPU events per
+second. Events per second is based on a CPU-bound loop that calculates all prime
+numbers in a specified range. We note that `runsc` does not impose a performance
+penalty, as the code is executing natively in both cases.
+
+This has important consequences for classes of workloads that are often
+CPU-bound, such as data processing or machine learning. In these cases, `runsc`
+will similarly impose minimal runtime overhead.
+
+{% include graph.html id="tensorflow" url="/performance/tensorflow.csv"
+title="perf.py tensorflow --runtime=runc --runtime=runsc" %}
+
+For example, the above figure shows a sample TensorFlow workload, the
+[convolutional neural network example][cnn]. The time indicated includes the
+full start-up and run time for the workload, which trains a model.
+
+## System calls
+
+Some **structural costs** of gVisor are heavily influenced by the
+[platform choice](/docs/architecture_guide/platforms/), which implements system
+call interception. Today, gVisor supports a variety of platforms. These
+platforms present distinct performance, compatibility and security trade-offs.
+For example, the KVM platform has low overhead system call interception but runs
+poorly with nested virtualization.
+
+{% include graph.html id="syscall" url="/performance/syscall.csv" title="perf.py
+syscall --runtime=runc --runtime=runsc-ptrace --runtime=runsc-kvm" y_min="100"
+log="true" %}
+
+The above figure demonstrates the time required for a raw system call on various
+platforms. The test is implemented by a custom binary which performs a large
+number of system calls and calculates the average time required.
+
+This cost will principally impact applications that are system call bound, which
+tend to be high-performance data stores and static network services. In general,
+the impact of system call interception will be lower the more work an
+application does.
+
+{% include graph.html id="redis" url="/performance/redis.csv" title="perf.py
+redis --runtime=runc --runtime=runsc" %}
+
+For example, `redis` is an application that performs relatively little work in
+userspace: in general it reads from a connected socket, reads or modifies some
+data, and writes a result back to the socket. The above figure shows the results
+of running [comprehensive set of benchmarks][redis-benchmark]. We can see that
+small operations impose a large overhead, while larger operations, such as
+`LRANGE`, where more work is done in the application, have a smaller relative
+overhead.
+
+Some of these costs above are **structural costs**, and `redis` is likely to
+remain a challenging performance scenario. However, optimizing the
+[platform](/docs/architecture_guide/platforms/) will also have a dramatic
+impact.
+
+## Start-up time
+
+For many use cases, the ability to spin-up containers quickly and efficiently is
+important. A sandbox may be short-lived and perform minimal user work (e.g. a
+function invocation).
+
+{% include graph.html id="startup" url="/performance/startup.csv" title="perf.py
+startup --runtime=runc --runtime=runsc" %}
+
+The above figure indicates how total time required to start a container through
+[Docker][docker]. This benchmark uses three different applications. First, an
+alpine Linux-container that executes `true`. Second, a `node` application that
+loads a number of modules and binds an HTTP server. The time is measured by a
+successful request to the bound port. Finally, a `ruby` application that
+similarly loads a number of modules and binds an HTTP server.
+
+> Note: most of the time overhead above is associated Docker itself. This is
+> evident with the empty `runc` benchmark. To avoid these costs with `runsc`,
+> you may also consider using `runsc do` mode or invoking the
+> [OCI runtime](../user_guide/quick_start/oci.md) directly.
+
+## Network
+
+Networking is mostly bound by **implementation costs**, and gVisor's network
+stack is improving quickly.
+
+While typically not an important metric in practice for common sandbox use
+cases, nevertheless `iperf` is a common microbenchmark used to measure raw
+throughput.
+
+{% include graph.html id="iperf" url="/performance/iperf.csv" title="perf.py
+iperf --runtime=runc --runtime=runsc" %}
+
+The above figure shows the result of an `iperf` test between two instances. For
+the upload case, the specified runtime is used for the `iperf` client, and in
+the download case, the specified runtime is the server. A native runtime is
+always used for the other endpoint in the test.
+
+{% include graph.html id="applications" metric="requests_per_second"
+url="/performance/applications.csv" title="perf.py http.(node|ruby)
+--connections=25 --runtime=runc --runtime=runsc" %}
+
+The above figure shows the result of simple `node` and `ruby` web services that
+render a template upon receiving a request. Because these synthetic benchmarks
+do minimal work per request, must like the `redis` case, they suffer from high
+overheads. In practice, the more work an application does the smaller the impact
+of **structural costs** become.
+
+## File system
+
+Some aspects of file system performance are also reflective of **implementation
+costs**, and an area where gVisor's implementation is improving quickly.
+
+In terms of raw disk I/O, gVisor does not introduce significant fundamental
+overhead. For general file operations, gVisor introduces a small fixed overhead
+for data that transitions across the sandbox boundary. This manifests as
+**structural costs** in some cases, since these operations must be routed
+through the [Gofer](../README.md#gofer) as a result of our
+[Security Model](/docs/architecture_guide/security/), but in most cases are
+dominated by **implementation costs**, due to an internal
+[Virtual File System][vfs] (VFS) implementation that needs improvement.
+
+{% include graph.html id="fio-bw" url="/performance/fio.csv" title="perf.py fio
+--engine=sync --runtime=runc --runtime=runsc" log="true" %}
+
+The above figures demonstrate the results of `fio` for reads and writes to and
+from the disk. In this case, the disk quickly becomes the bottleneck and
+dominates other costs.
+
+{% include graph.html id="fio-tmpfs-bw" url="/performance/fio-tmpfs.csv"
+title="perf.py fio --engine=sync --runtime=runc --tmpfs=True --runtime=runsc"
+log="true" %}
+
+The above figure shows the raw I/O performance of using a `tmpfs` mount which is
+sandbox-internal in the case of `runsc`. Generally these operations are
+similarly bound to the cost of copying around data in-memory, and we don't see
+the cost of VFS operations.
+
+{% include graph.html id="httpd100k" metric="transfer_rate"
+url="/performance/httpd100k.csv" title="perf.py http.httpd --connections=1
+--connections=5 --connections=10 --connections=25 --runtime=runc
+--runtime=runsc" %}
+
+The high costs of VFS operations can manifest in benchmarks that execute many
+such operations in the hot path for serving requests, for example. The above
+figure shows the result of using gVisor to serve small pieces of static content
+with predictably poor results. This workload represents `apache` serving a
+single file sized 100k from the container image to a client running
+[ApacheBench][ab] with varying levels of concurrency. The high overhead comes
+principally from the VFS implementation that needs improvement, with several
+internal serialization points (since all requests are reading the same file).
+Note that some of some of network stack performance issues also impact this
+benchmark.
+
+{% include graph.html id="ffmpeg" url="/performance/ffmpeg.csv" title="perf.py
+media.ffmpeg --runtime=runc --runtime=runsc" %}
+
+For benchmarks that are bound by raw disk I/O and a mix of compute, file system
+operations are less of an issue. The above figure shows the total time required
+for an `ffmpeg` container to start, load and transcode a 27MB input video.
+
+[ab]: https://en.wikipedia.org/wiki/ApacheBench
+[benchmark-tools]: https://github.com/google/gvisor/tree/master/benchmarks
+[gce]: https://cloud.google.com/compute/
+[cnn]: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py
+[docker]: https://docker.io
+[redis-benchmark]: https://redis.io/topics/benchmarks
+[vfs]: https://en.wikipedia.org/wiki/Virtual_file_system
diff --git a/g3doc/architecture_guide/platforms.md b/g3doc/architecture_guide/platforms.md
new file mode 100644
index 000000000..d112c9a28
--- /dev/null
+++ b/g3doc/architecture_guide/platforms.md
@@ -0,0 +1,61 @@
+# Platform Guide
+
+[TOC]
+
+gVisor requires a platform to implement interception of syscalls, basic context
+switching, and memory mapping functionality. Internally, gVisor uses an
+abstraction sensibly called [Platform][platform]. A simplified version of this
+interface looks like:
+
+```golang
+type Platform interface {
+ NewAddressSpace() (AddressSpace, error)
+ NewContext() Context
+}
+
+type Context interface {
+ Switch(as AddressSpace, ac arch.Context) (..., error)
+}
+
+type AddressSpace interface {
+ MapFile(addr usermem.Addr, f File, fr FileRange, at usermem.AccessType, ...) error
+ Unmap(addr usermem.Addr, length uint64)
+}
+```
+
+There are a number of different ways to implement this interface that come with
+various trade-offs, generally around performance and hardware requirements.
+
+## Implementations
+
+The choice of platform depends on the context in which `runsc` is executing. In
+general, virtualized platforms may be limited to platforms that do not require
+hardware virtualized support (since the hardware is already in use):
+
+![Platforms](platforms.png "Platform examples.")
+
+### ptrace
+
+The ptrace platform uses [PTRACE_SYSEMU][ptrace] to execute user code without
+allowing it to execute host system calls. This platform can run anywhere that
+`ptrace` works (even VMs without nested virtualization), which is ubiquitous.
+
+Unfortunately, the ptrace platform has high context switch overhead, so system
+call-heavy applications may pay a [performance penalty](./performance.md).
+
+### KVM
+
+The KVM platform uses the kernel's [KVM][kvm] functionality to allow the Sentry
+to act as both guest OS and VMM. The KVM platform can run on bare-metal or in a
+VM with nested virtualization enabled. While there is no virtualized hardware
+layer -- the sandbox retains a process model -- gVisor leverages virtualization
+extensions available on modern processors in order to improve isolation and
+performance of address space switches.
+
+## Changing Platforms
+
+See [Changing Platforms](../user_guide/platforms.md).
+
+[kvm]: https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt
+[platform]: https://cs.opensource.google/gvisor/gvisor/+/release-20190304.1:pkg/sentry/platform/platform.go;l=33
+[ptrace]: http://man7.org/linux/man-pages/man2/ptrace.2.html
diff --git a/g3doc/architecture_guide/platforms.png b/g3doc/architecture_guide/platforms.png
new file mode 100644
index 000000000..005d56feb
--- /dev/null
+++ b/g3doc/architecture_guide/platforms.png
Binary files differ
diff --git a/g3doc/architecture_guide/platforms.svg b/g3doc/architecture_guide/platforms.svg
new file mode 100644
index 000000000..b0bac9ba7
--- /dev/null
+++ b/g3doc/architecture_guide/platforms.svg
@@ -0,0 +1,334 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+
+<svg
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ xmlns:cc="http://creativecommons.org/ns#"
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+ xmlns:svg="http://www.w3.org/2000/svg"
+ xmlns="http://www.w3.org/2000/svg"
+ xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+ xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+ width="142.67763mm"
+ height="67.063133mm"
+ viewBox="0 0 142.67763 67.063134"
+ version="1.1"
+ id="svg8"
+ inkscape:export-filename="/home/ascannell/resources.png"
+ inkscape:export-xdpi="53.50127"
+ inkscape:export-ydpi="53.50127"
+ inkscape:version="0.92.4 (5da689c313, 2019-01-14)"
+ sodipodi:docname="platforms.svg">
+ <defs
+ id="defs2" />
+ <sodipodi:namedview
+ id="base"
+ pagecolor="#ffffff"
+ bordercolor="#666666"
+ borderopacity="1.0"
+ inkscape:pageopacity="0.0"
+ inkscape:pageshadow="2"
+ inkscape:zoom="0.98994949"
+ inkscape:cx="86.443612"
+ inkscape:cy="102.88104"
+ inkscape:document-units="mm"
+ inkscape:current-layer="layer1"
+ showgrid="false"
+ fit-margin-top="0"
+ fit-margin-left="0"
+ fit-margin-right="0"
+ fit-margin-bottom="0"
+ inkscape:window-width="1920"
+ inkscape:window-height="1005"
+ inkscape:window-x="0"
+ inkscape:window-y="0"
+ inkscape:window-maximized="1" />
+ <metadata
+ id="metadata5">
+ <rdf:RDF>
+ <cc:Work
+ rdf:about="">
+ <dc:format>image/svg+xml</dc:format>
+ <dc:type
+ rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
+ <dc:title></dc:title>
+ </cc:Work>
+ </rdf:RDF>
+ </metadata>
+ <g
+ inkscape:label="Layer 1"
+ inkscape:groupmode="layer"
+ id="layer1"
+ transform="translate(-36.081387,-98.953278)">
+ <rect
+ id="rect10"
+ width="33.408691"
+ height="33.408691"
+ x="36.081387"
+ y="120.06757"
+ style="fill:#44aa00;stroke-width:0.26458332" />
+ <rect
+ style="fill:#b3b3b3;stroke-width:0.23881446"
+ id="rect16"
+ width="142.45465"
+ height="10.423517"
+ x="36.08139"
+ y="155.5929" />
+ <rect
+ id="rect10-7"
+ width="30.52453"
+ height="18.976137"
+ x="37.416695"
+ y="121.65508"
+ style="fill:#ff8080;stroke-width:0.19060372" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.40292525px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08507314"
+ x="41.03727"
+ y="148.58765"
+ id="text65"><tspan
+ sodipodi:role="line"
+ id="tspan63"
+ x="41.03727"
+ y="148.58765"
+ style="stroke-width:0.08507314">gVisor</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.33113885px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08327847"
+ x="45.473087"
+ y="132.50232"
+ id="text123"><tspan
+ sodipodi:role="line"
+ id="tspan121"
+ x="45.473087"
+ y="132.50232"
+ style="stroke-width:0.08327847">workload</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:6.43922186px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.16098055"
+ x="97.768547"
+ y="163.15665"
+ id="text163"><tspan
+ sodipodi:role="line"
+ id="tspan161"
+ x="97.768547"
+ y="163.15665"
+ style="stroke-width:0.16098055">host</tspan></text>
+ <rect
+ style="fill:#e9afdd;stroke-width:0.39185274"
+ id="rect16-7"
+ width="72.9646"
+ height="54.79026"
+ x="105.79441"
+ y="98.953278" />
+ <rect
+ id="rect10-5"
+ width="33.408691"
+ height="33.408691"
+ x="108.24348"
+ y="100.53072"
+ style="fill:#44aa00;stroke-width:0.26458332" />
+ <rect
+ id="rect10-7-6"
+ width="30.52453"
+ height="20.045216"
+ x="109.57877"
+ y="102.11823"
+ style="fill:#ff8080;stroke-width:0.19589928" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.40292525px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08507314"
+ x="112.86765"
+ y="129.01863"
+ id="text65-2"><tspan
+ sodipodi:role="line"
+ id="tspan63-9"
+ x="112.86765"
+ y="129.01863"
+ style="stroke-width:0.08507314">gVisor</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.33113885px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08327847"
+ x="117.63519"
+ y="114.02371"
+ id="text123-1"><tspan
+ sodipodi:role="line"
+ id="tspan121-2"
+ x="117.63519"
+ y="114.02371"
+ style="stroke-width:0.08327847">workload</tspan></text>
+ <rect
+ id="rect10-7-7"
+ width="11.815663"
+ height="8.0126781"
+ x="54.538059"
+ y="143.27702"
+ style="fill:#aaccff;stroke-width:0.07705856" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:4.35074377px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.10876859"
+ x="55.931114"
+ y="148.90578"
+ id="text144"><tspan
+ sodipodi:role="line"
+ id="tspan142"
+ x="55.931114"
+ y="148.90578"
+ style="stroke-width:0.10876859">KVM</tspan></text>
+ <rect
+ id="rect10-6"
+ width="33.408691"
+ height="33.408691"
+ x="71.044685"
+ y="119.73112"
+ style="fill:#44aa00;stroke-width:0.26458332" />
+ <rect
+ id="rect10-7-0"
+ width="30.52453"
+ height="18.976137"
+ x="72.37999"
+ y="121.31865"
+ style="fill:#ff8080;stroke-width:0.19060372" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.40292525px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08507314"
+ x="76.000565"
+ y="148.25128"
+ id="text65-6"><tspan
+ sodipodi:role="line"
+ id="tspan63-2"
+ x="76.000565"
+ y="148.25128"
+ style="stroke-width:0.08507314">gVisor</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.33113885px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08327847"
+ x="80.436386"
+ y="132.16595"
+ id="text123-6"><tspan
+ sodipodi:role="line"
+ id="tspan121-1"
+ x="80.436386"
+ y="132.16595"
+ style="stroke-width:0.08327847">workload</tspan></text>
+ <rect
+ id="rect10-7-7-8"
+ width="11.815664"
+ height="8.0126781"
+ x="89.501358"
+ y="142.94067"
+ style="fill:#ffeeaa;stroke-width:0.07705856" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.39456654px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08486416"
+ x="89.92292"
+ y="147.89806"
+ id="text144-7"><tspan
+ sodipodi:role="line"
+ id="tspan142-9"
+ x="89.92292"
+ y="147.89806"
+ style="stroke-width:0.08486416">ptrace</tspan></text>
+ <rect
+ id="rect10-7-7-8-3"
+ width="11.815665"
+ height="8.0126781"
+ x="127.08897"
+ y="123.97878"
+ style="fill:#ffeeaa;stroke-width:0.07705856" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.39456654px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08486416"
+ x="127.51052"
+ y="128.9362"
+ id="text144-7-7"><tspan
+ sodipodi:role="line"
+ id="tspan142-9-5"
+ x="127.51052"
+ y="128.9362"
+ style="stroke-width:0.08486416">ptrace</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:5.45061255px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.13626531"
+ x="138.49318"
+ y="152.11841"
+ id="text229"><tspan
+ sodipodi:role="line"
+ id="tspan227"
+ x="138.49318"
+ y="152.11841"
+ style="stroke-width:0.13626531">VM</tspan></text>
+ <rect
+ style="fill:#b3b3b3;stroke-width:0.16518368"
+ id="rect16-9"
+ width="68.15374"
+ height="10.423517"
+ x="108.24348"
+ y="134.99774" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:6.17854786px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.15446369"
+ x="132.91473"
+ y="142.07658"
+ id="text248"><tspan
+ sodipodi:role="line"
+ id="tspan246"
+ x="132.91473"
+ y="142.07658"
+ style="stroke-width:0.15446369">guest</tspan></text>
+ <rect
+ id="rect10-5-2"
+ width="33.408691"
+ height="33.408691"
+ x="143.32402"
+ y="100.35877"
+ style="fill:#44aa00;stroke-width:0.26458332" />
+ <rect
+ id="rect10-7-6-2"
+ width="30.52453"
+ height="20.045216"
+ x="144.65933"
+ y="101.94627"
+ style="fill:#ff8080;stroke-width:0.19589929" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.40292525px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08507314"
+ x="147.94815"
+ y="128.84665"
+ id="text65-2-8"><tspan
+ sodipodi:role="line"
+ id="tspan63-9-9"
+ x="147.94815"
+ y="128.84665"
+ style="stroke-width:0.08507314">gVisor</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.33113885px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08327847"
+ x="152.71565"
+ y="113.85176"
+ id="text123-1-7"><tspan
+ sodipodi:role="line"
+ id="tspan121-2-3"
+ x="152.71565"
+ y="113.85176"
+ style="stroke-width:0.08327847">workload</tspan></text>
+ <rect
+ id="rect10-7-7-8-3-6"
+ width="11.815666"
+ height="8.0126781"
+ x="162.16933"
+ y="123.80682"
+ style="fill:#ffeeaa;stroke-width:0.07705856" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.39456654px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08486416"
+ x="162.59088"
+ y="128.76421"
+ id="text144-7-7-1"><tspan
+ sodipodi:role="line"
+ id="tspan142-9-5-2"
+ x="162.59088"
+ y="128.76421"
+ style="stroke-width:0.08486416">ptrace</tspan></text>
+ </g>
+</svg>
diff --git a/g3doc/architecture_guide/resources.md b/g3doc/architecture_guide/resources.md
new file mode 100644
index 000000000..1dec37bd1
--- /dev/null
+++ b/g3doc/architecture_guide/resources.md
@@ -0,0 +1,144 @@
+# Resource Model
+
+[TOC]
+
+The resource model for gVisor does not assume a fixed number of threads of
+execution (i.e. vCPUs) or amount of physical memory. Where possible, decisions
+about underlying physical resources are delegated to the host system, where
+optimizations can be made with global information. This delegation allows the
+sandbox to be highly dynamic in terms of resource usage: spanning a large number
+of cores and large amount of memory when busy, and yielding those resources back
+to the host when not.
+
+In order words, the shape of the sandbox should closely track the shape of the
+sandboxed process:
+
+![Resource model](resources.png "Workloads of different shapes.")
+
+## Processes
+
+Much like a Virtual Machine (VM), a gVisor sandbox appears as an opaque process
+on the system. Processes within the sandbox do not manifest as processes on the
+host system, and process-level interactions within the sandbox requires entering
+the sandbox (e.g. via a [Docker exec][exec]).
+
+## Networking
+
+The sandbox attaches a network endpoint to the system, but runs it's own network
+stack. All network resources, other than packets in flight on the host, exist
+only inside the sandbox, bound by relevant resource limits.
+
+You can interact with network endpoints exposed by the sandbox, just as you
+would any other container, but network introspection similarly requires entering
+the sandbox.
+
+## Files
+
+Files in the sandbox may be backed by different implementations. For host-native
+files (where a file descriptor is available), the Gofer may return a file
+descriptor to the Sentry via [SCM_RIGHTS][scmrights][^1].
+
+These files may be read from and written to through standard system calls, and
+also mapped into the associated application's address space. This allows the
+same host memory to be shared across multiple sandboxes, although this mechanism
+does not preclude the use of side-channels (see [Security Model](./security.md).
+
+Note that some file systems exist only within the context of the sandbox. For
+example, in many cases a `tmpfs` mount will be available at `/tmp` or
+`/dev/shm`, which allocates memory directly from the sandbox memory file (see
+below). Ultimately, these will be accounted against relevant limits in a similar
+way as the host native case.
+
+## Threads
+
+The Sentry models individual task threads with [goroutines][goroutine]. As a
+result, each task thread is a lightweight [green thread][greenthread], and may
+not correspond to an underlying host thread.
+
+However, application execution is modelled as a blocking system call with the
+Sentry. This means that additional host threads may be created, *depending on
+the number of active application threads*. In practice, a busy application will
+converge on the number of active threads, and the host will be able to make
+scheduling decisions about all application threads.
+
+## Time
+
+Time in the sandbox is provided by the Sentry, through its own [vDSO][vdso] and
+time-keeping implementation. This is distinct from the host time, and no state
+is shared with the host, although the time will be initialized with the host
+clock.
+
+The Sentry runs timers to note the passage of time, much like a kernel running
+on hardware (though the timers are software timers, in this case). These timers
+provide updates to the vDSO, the time returned through system calls, and the
+time recorded for usage or limit tracking (e.g. [RLIMIT_CPU][rlimit]).
+
+When all application threads are idle, the Sentry disables timers until an event
+occurs that wakes either the Sentry or an application thread, similar to a
+[tickless kernel][tickless]. This allows the Sentry to achieve near zero CPU
+usage for idle applications.
+
+## Memory
+
+The Sentry implements its own memory management, including demand-paging and a
+Sentry internal page cache for files that cannot be used natively. A single
+[memfd][memfd] backs all application memory.
+
+### Address spaces
+
+The creation of address spaces is platform-specific. For some platforms,
+additional "stub" processes may be created on the host in order to support
+additional address spaces. These stubs are subject to various limits applied at
+the sandbox level (e.g. PID limits).
+
+### Physical memory
+
+The host is able to manage physical memory using regular means (e.g. tracking
+working sets, reclaiming and swapping under pressure). The Sentry lazily
+populates host mappings for applications, and allow the host to demand-page
+those regions, which is critical for the functioning of those mechanisms.
+
+In order to avoid excessive overhead, the Sentry does not demand-page individual
+pages. Instead, it selects appropriate regions based on heuristics. There is a
+trade-off here: the Sentry is unable to trivially determine which pages are
+active and which are not. Even if pages were individually faulted, the host may
+select pages to be reclaimed or swapped without the Sentry's knowledge.
+
+Therefore, memory usage statistics within the sandbox (e.g. via `proc`) are
+approximations. The Sentry maintains an internal breakdown of memory usage, and
+can collect accurate information but only through a relatively expensive API
+call. In any case, it would likely be considered unwise to share precise
+information about how the host is managing memory with the sandbox.
+
+Finally, when an application marks a region of memory as no longer needed, for
+example via a call to [madvise][madvise], the Sentry *releases this memory back
+to the host*. There can be performance penalties for this, since it may be
+cheaper in many cases to retain the memory and use it to satisfy some other
+request. However, releasing it immediately to the host allows the host to more
+effectively multiplex resources and apply an efficient global policy.
+
+## Limits
+
+All Sentry threads and Sentry memory are subject to a container cgroup. However,
+application usage will not appear as anonymous memory usage, and will instead be
+accounted to the `memfd`. All anonymous memory will correspond to Sentry usage,
+and host memory charged to the container will work as standard.
+
+The cgroups can be monitored for standard signals: pressure indicators,
+threshold notifiers, etc. and can also be adjusted dynamically. Note that the
+Sentry itself may listen for pressure signals in its containing cgroup, in order
+to purge internal caches.
+
+[goroutine]: https://tour.golang.org/concurrency/1
+[greenthread]: https://en.wikipedia.org/wiki/Green_threads
+[scheduler]: https://morsmachine.dk/go-scheduler
+[vdso]: https://en.wikipedia.org/wiki/VDSO
+[rlimit]: http://man7.org/linux/man-pages/man2/getrlimit.2.html
+[tickless]: https://en.wikipedia.org/wiki/Tickless_kernel
+[memfd]: http://man7.org/linux/man-pages/man2/memfd_create.2.html
+[scmrights]: http://man7.org/linux/man-pages/man7/unix.7.html
+[madvise]: http://man7.org/linux/man-pages/man2/madvise.2.html
+[exec]: https://docs.docker.com/engine/reference/commandline/exec/
+[^1]: Unless host networking is enabled, the Sentry is not able to create or
+ open host file descriptors itself, it can only receive them in this way
+ from the Gofer.
diff --git a/g3doc/architecture_guide/resources.png b/g3doc/architecture_guide/resources.png
new file mode 100644
index 000000000..f715008ec
--- /dev/null
+++ b/g3doc/architecture_guide/resources.png
Binary files differ
diff --git a/g3doc/architecture_guide/resources.svg b/g3doc/architecture_guide/resources.svg
new file mode 100644
index 000000000..fd7805d90
--- /dev/null
+++ b/g3doc/architecture_guide/resources.svg
@@ -0,0 +1,208 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+
+<svg
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ xmlns:cc="http://creativecommons.org/ns#"
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+ xmlns:svg="http://www.w3.org/2000/svg"
+ xmlns="http://www.w3.org/2000/svg"
+ xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+ xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+ width="108.24417mm"
+ height="47.513165mm"
+ viewBox="0 0 108.24417 47.513165"
+ version="1.1"
+ id="svg8"
+ inkscape:export-filename="/home/ascannell/resources.png"
+ inkscape:export-xdpi="53.50127"
+ inkscape:export-ydpi="53.50127"
+ inkscape:version="0.92.4 (5da689c313, 2019-01-14)"
+ sodipodi:docname="resources.svg">
+ <defs
+ id="defs2" />
+ <sodipodi:namedview
+ id="base"
+ pagecolor="#ffffff"
+ bordercolor="#666666"
+ borderopacity="1.0"
+ inkscape:pageopacity="0.0"
+ inkscape:pageshadow="2"
+ inkscape:zoom="0.98994949"
+ inkscape:cx="16.897058"
+ inkscape:cy="41.261746"
+ inkscape:document-units="mm"
+ inkscape:current-layer="layer1"
+ showgrid="false"
+ fit-margin-top="0"
+ fit-margin-left="0"
+ fit-margin-right="0"
+ fit-margin-bottom="0"
+ inkscape:window-width="1920"
+ inkscape:window-height="1005"
+ inkscape:window-x="0"
+ inkscape:window-y="0"
+ inkscape:window-maximized="1" />
+ <metadata
+ id="metadata5">
+ <rdf:RDF>
+ <cc:Work
+ rdf:about="">
+ <dc:format>image/svg+xml</dc:format>
+ <dc:type
+ rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
+ <dc:title></dc:title>
+ </cc:Work>
+ </rdf:RDF>
+ </metadata>
+ <g
+ inkscape:label="Layer 1"
+ inkscape:groupmode="layer"
+ id="layer1"
+ transform="translate(-36.081387,-118.50325)">
+ <rect
+ id="rect10"
+ width="33.408691"
+ height="33.408691"
+ x="36.081387"
+ y="120.06757"
+ style="fill:#44aa00;stroke-width:0.26458332" />
+ <circle
+ style="fill:#44aa00;stroke-width:0.21849461"
+ id="path12"
+ cx="87.958534"
+ cy="136.63828"
+ r="17.105247" />
+ <path
+ sodipodi:type="star"
+ style="fill:#44aa00;stroke-width:0.26458332"
+ id="path14"
+ sodipodi:sides="3"
+ sodipodi:cx="124.13387"
+ sodipodi:cy="141.81859"
+ sodipodi:r1="23.31534"
+ sodipodi:r2="11.65767"
+ sodipodi:arg1="0.52359878"
+ sodipodi:arg2="1.5707963"
+ inkscape:flatsided="false"
+ inkscape:rounded="0"
+ inkscape:randomized="0"
+ d="m 144.32555,153.47626 -20.19168,0 -20.19167,0 10.09583,-17.48651 10.09584,-17.4865 10.09584,17.4865 z"
+ inkscape:transform-center-x="1.8384776e-06"
+ inkscape:transform-center-y="-5.8288369" />
+ <rect
+ style="fill:#b3b3b3;stroke-width:0.20817307"
+ id="rect16"
+ width="108.24416"
+ height="10.423517"
+ x="36.08139"
+ y="155.5929" />
+ <path
+ sodipodi:type="star"
+ style="fill:#ff8080;stroke-width:0.20018946"
+ id="path14-3"
+ sodipodi:sides="3"
+ sodipodi:cx="124.13387"
+ sodipodi:cy="139.31911"
+ sodipodi:r1="17.640888"
+ sodipodi:r2="8.8204451"
+ sodipodi:arg1="0.52359878"
+ sodipodi:arg2="1.5707963"
+ inkscape:flatsided="false"
+ inkscape:rounded="0"
+ inkscape:randomized="0"
+ d="m 139.41133,148.13955 -15.27746,0 -15.27745,0 7.63872,-13.23067 7.63873,-13.23066 7.63873,13.23066 z"
+ inkscape:transform-center-x="3.9117172e-06"
+ inkscape:transform-center-y="-4.4102243" />
+ <circle
+ style="fill:#ff8080;stroke-width:0.18094084"
+ id="path12-6"
+ cx="87.93705"
+ cy="134.75125"
+ r="14.165282" />
+ <rect
+ id="rect10-7"
+ width="30.52453"
+ height="25.657875"
+ x="37.416695"
+ y="121.65508"
+ style="fill:#ff8080;stroke-width:0.22163473" />
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.40292525px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08507314"
+ x="47.387276"
+ y="151.7626"
+ id="text65"><tspan
+ sodipodi:role="line"
+ id="tspan63"
+ x="47.387276"
+ y="151.7626"
+ style="stroke-width:0.08507314">gVisor</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.40292525px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08507314"
+ x="82.156319"
+ y="151.71547"
+ id="text65-5"><tspan
+ sodipodi:role="line"
+ id="tspan63-3"
+ x="82.156319"
+ y="151.71547"
+ style="stroke-width:0.08507314">gVisor</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.40292525px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08507314"
+ x="118.66879"
+ y="151.71547"
+ id="text65-5-5"><tspan
+ sodipodi:role="line"
+ id="tspan63-3-6"
+ x="118.66879"
+ y="151.71547"
+ style="stroke-width:0.08507314">gVisor</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.33113885px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08327847"
+ x="45.473087"
+ y="136.20644"
+ id="text123"><tspan
+ sodipodi:role="line"
+ id="tspan121"
+ x="45.473087"
+ y="136.20644"
+ style="stroke-width:0.08327847">workload</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.33113885px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08327847"
+ x="80.153076"
+ y="136.00925"
+ id="text123-1"><tspan
+ sodipodi:role="line"
+ id="tspan121-2"
+ x="80.153076"
+ y="136.00925"
+ style="stroke-width:0.08327847">workload</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:3.33113885px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.08327847"
+ x="116.50173"
+ y="138.68195"
+ id="text123-1-7"><tspan
+ sodipodi:role="line"
+ id="tspan121-2-0"
+ x="116.50173"
+ y="138.68195"
+ style="stroke-width:0.08327847">workload</tspan></text>
+ <text
+ xml:space="preserve"
+ style="font-style:normal;font-weight:normal;font-size:6.43922186px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.16098055"
+ x="81.893562"
+ y="163.15665"
+ id="text163"><tspan
+ sodipodi:role="line"
+ id="tspan161"
+ x="81.893562"
+ y="163.15665"
+ style="stroke-width:0.16098055">host</tspan></text>
+ </g>
+</svg>
diff --git a/g3doc/architecture_guide/security.md b/g3doc/architecture_guide/security.md
new file mode 100644
index 000000000..b99b86332
--- /dev/null
+++ b/g3doc/architecture_guide/security.md
@@ -0,0 +1,255 @@
+# Security Model
+
+[TOC]
+
+gVisor was created in order to provide additional defense against the
+exploitation of kernel bugs by untrusted userspace code. In order to understand
+how gVisor achieves this goal, it is first necessary to understand the basic
+threat model.
+
+## Threats: The Anatomy of an Exploit
+
+An exploit takes advantage of a software or hardware bug in order to escalate
+privileges, gain access to privileged data, or disrupt services. All of the
+possible interactions that a malicious application can have with the rest of the
+system (attack vectors) define the attack surface. We categorize these attack
+vectors into several common classes.
+
+### System API
+
+An operating system or hypervisor exposes an abstract System API in the form of
+system calls and traps. This API may be documented and stable, as with Linux, or
+it may be abstracted behind a library, as with Windows (i.e. win32.dll or
+ntdll.dll). The System API includes all standard interfaces that application
+code uses to interact with the system. This includes high-level abstractions
+that are derived from low-level system calls, such as system files, sockets and
+namespaces.
+
+Although the System API is exposed to applications by design, bugs and race
+conditions within the kernel or hypervisor may occasionally be exploitable via
+the API. This is common in part due to the fact that most kernels and
+hypervisors are written in [C][clang], which is well-suited to interfacing with
+hardware but often prone to security issues. In order to exploit these issues, a
+typical attack might involve some combination of the following:
+
+1. Opening or creating some combination of files, sockets or other descriptors.
+1. Passing crafted, malicious arguments, structures or packets.
+1. Racing with multiple threads in order to hit specific code paths.
+
+For example, for the [Dirty Cow][dirtycow] privilege escalation bug, an
+application would open a specific file in `/proc` or use a specific `ptrace`
+system call, and use multiple threads in order to trigger a race condition when
+touching a fresh page of memory. The attacker then gains control over a page of
+memory belonging to the system. With additional privileges or access to
+privileged data in the kernel, an attacker will often be able to employ
+additional techniques to gain full access to the rest of the system.
+
+While bugs in the implementation of the System API are readily fixed, they are
+also the most common form of exploit. The exposure created by this class of
+exploit is what gVisor aims to minimize and control, described in detail below.
+
+### System ABI
+
+Hardware and software exploits occasionally exist in execution paths that are
+not part of an intended System API. In this case, exploits may be found as part
+of implicit actions the hardware or privileged system code takes in response to
+certain events, such as traps or interrupts. For example, the recent
+[POPSS][popss] flaw required only native code execution (no specific system call
+or file access). In that case, the Xen hypervisor was similarly vulnerable,
+highlighting that hypervisors are not immune to this vector.
+
+### Side Channels
+
+Hardware side channels may be exploitable by any code running on a system:
+native, sandboxed, or virtualized. However, many host-level mitigations against
+hardware side channels are still effective with a sandbox. For example, kernels
+built with retpoline protect against some speculative execution attacks
+(Spectre) and frame poisoning may protect against L1 terminal fault (L1TF)
+attacks. Hypervisors may introduce additional complications in this regard, as
+there is no mitigation against an application in a normally functioning Virtual
+Machine (VM) exploiting the L1TF vulnerability for another VM on the sibling
+hyperthread.
+
+### Other Vectors
+
+The above categories in no way represent an exhaustive list of exploits, as we
+focus only on running untrusted code from within the operating system or
+hypervisor. We do not consider other ways that a more generic adversary may
+interact with a system, such as inserting a portable storage device with a
+malicious filesystem image, using a combination of crafted keyboard or touch
+inputs, or saturating a network device with ill-formed packets.
+
+Furthermore, high-level systems may contain exploitable components. An attacker
+need not escalate privileges within a container if there’s an exploitable
+network-accessible service on the host or some other API path. *A sandbox is not
+a substitute for a secure architecture*.
+
+## Goals: Limiting Exposure
+
+![Threat model](security.png "Threat model.")
+
+gVisor’s primary design goal is to minimize the System API attack vector through
+multiple layers of defense, while still providing a process model. There are two
+primary security principles that inform this design. First, the application’s
+direct interactions with the host System API are intercepted by the Sentry,
+which implements the System API instead. Second, the System API accessible to
+the Sentry itself is minimized to a safer, restricted set. The first principle
+minimizes the possibility of direct exploitation of the host System API by
+applications, and the second principle minimizes indirect exploitability, which
+is the exploitation by an exploited or buggy Sentry (e.g. chaining an exploit).
+
+The first principle is similar to the security basis for a Virtual Machine (VM).
+With a VM, an application’s interactions with the host are replaced by
+interactions with a guest operating system and a set of virtualized hardware
+devices. These hardware devices are then implemented via the host System API by
+a Virtual Machine Monitor (VMM). The Sentry similarly prevents direct
+interactions by providing its own implementation of the System API that the
+application must interact with. Applications are not able to to directly craft
+specific arguments or flags for the host System API, or interact directly with
+host primitives.
+
+For both the Sentry and a VMM, it’s worth noting that while direct interactions
+are not possible, indirect interactions are still possible. For example, a read
+on a host-backed file in the Sentry may ultimately result in a host read system
+call (made by the Sentry, not by passing through arguments from the
+application), similar to how a read on a block device in a VM may result in the
+VMM issuing a corresponding host read system call from a backing file.
+
+An important distinction from a VM is that the Sentry implements a System API
+based directly on host System API primitives instead of relying on virtualized
+hardware and a guest operating system. This selects a distinct set of
+trade-offs, largely in the performance, efficiency and compatibility domains.
+Since transitions in and out of the sandbox are relatively expensive, a guest
+operating system will typically take ownership of resources. For example, in the
+above case, the guest operating system may read the block device data in a local
+page cache, to avoid subsequent reads. This may lead to better performance but
+lower efficiency, since memory may be wasted or duplicated. The Sentry opts
+instead to defer to the host for many operations during runtime, for improved
+efficiency but lower performance in some use cases.
+
+### What can a sandbox do?
+
+An application in a gVisor sandbox is permitted to do most things a standard
+container can do: for example, applications can read and write files mapped
+within the container, make network connections, etc. As described above,
+gVisor's primary goal is to limit exposure to bugs and exploits while still
+allowing most applications to run. Even so, gVisor will limit some operations
+that might be permitted with a standard container. Even with appropriate
+capabilities, a user in a gVisor sandbox will only be able to manipulate
+virtualized system resources (e.g. the system time, kernel settings or
+filesystem attributes) and not underlying host system resources.
+
+While the sandbox virtualizes many operations for the application, we limit the
+sandbox's own interactions with the host to the following high-level operations:
+
+1. Communicate with a Gofer process via a connected socket. The sandbox may
+ receive new file descriptors from the Gofer process, corresponding to opened
+ files. These files can then be read from and written to by the sandbox.
+1. Make a minimal set of host system calls. The calls do not include the
+ creation of new sockets (unless host networking mode is enabled) or opening
+ files. The calls include duplication and closing of file descriptors,
+ synchronization, timers and signal management.
+1. Read and write packets to a virtual ethernet device. This is not required if
+ host networking is enabled (or networking is disabled).
+
+### System ABI, Side Channels and Other Vectors
+
+gVisor relies on the host operating system and the platform for defense against
+hardware-based attacks. Given the nature of these vulnerabilities, there is
+little defense that gVisor can provide (there’s no guarantee that additional
+hardware measures, such as virtualization, memory encryption, etc. would
+actually decrease the attack surface). Note that this is true even when using
+hardware virtualization for acceleration, as the host kernel or hypervisor is
+ultimately responsible for defending against attacks from within malicious
+guests.
+
+gVisor similarly relies on the host resource mechanisms (cgroups) for defense
+against resource exhaustion and denial of service attacks. Network policy
+controls should be applied at the container level to ensure appropriate network
+policy enforcement. Note that the sandbox itself is not capable of altering or
+configuring these mechanisms, and the sandbox itself should make an attacker
+less likely to exploit or override these controls through other means.
+
+## Principles: Defense-in-Depth
+
+For gVisor development, there are several engineering principles that are
+employed in order to ensure that the system meets its design goals.
+
+1. No system call is passed through directly to the host. Every supported call
+ has an independent implementation in the Sentry, that is unlikely to suffer
+ from identical vulnerabilities that may appear in the host. This has the
+ consequence that all kernel features used by applications require an
+ implementation within the Sentry.
+1. Only common, universal functionality is implemented. Some filesystems,
+ network devices or modules may expose specialized functionality to user
+ space applications via mechanisms such as extended attributes, raw sockets
+ or ioctls. Since the Sentry is responsible for implementing the full system
+ call surface, we do not implement or pass through these specialized APIs.
+1. The host surface exposed to the Sentry is minimized. While the system call
+ surface is not trivial, it is explicitly enumerated and controlled. The
+ Sentry is not permitted to open new files, create new sockets or do many
+ other interesting things on the host.
+
+Additionally, we have practical restrictions that are imposed on the project to
+minimize the risk of Sentry exploitability. For example:
+
+1. Unsafe code is carefully controlled. All unsafe code is isolated in files
+ that end with "unsafe.go", in order to facilitate validation and auditing.
+ No file without the unsafe suffix may import the unsafe package.
+1. No CGo is allowed. The Sentry must be a pure Go binary.
+1. External imports are not generally allowed within the core packages. Only
+ limited external imports are used within the setup code. The code available
+ inside the Sentry is carefully controlled, to ensure that the above rules
+ are effective.
+
+Finally, we recognize that security is a process, and that vigilance is
+critical. Beyond our security disclosure process, the Sentry is fuzzed
+continuously to identify potential bugs and races proactively, and production
+crashes are recorded and triaged to similarly identify material issues.
+
+## FAQ
+
+### Is this more or less secure than a Virtual Machine?
+
+The security of a VM depends to a large extent on what is exposed from the host
+kernel and userspace support code. For example, device emulation code in the
+host kernel (e.g. APIC) or optimizations (e.g. vhost) can be more complex than a
+simple system call, and exploits carry the same risks. Similarly, the userspace
+support code is frequently unsandboxed, and exploits, while rare, may allow
+unfettered access to the system.
+
+Some platforms leverage the same virtualization hardware as VMs in order to
+provide better system call interception performance. However, gVisor does not
+implement any device emulation, and instead opts to use a sandboxed host System
+API directly. Both approaches significantly reduce the original attack surface.
+Ultimately, since gVisor is capable of using the same hardware mechanism, one
+should not assume that the mere use of virtualization hardware makes a system
+more or less secure, just as it would be a mistake to make the claim that the
+use of a unibody alone makes a car safe.
+
+### Does this stop hardware side channels?
+
+In general, gVisor does not provide protection against hardware side channels,
+although it may make exploits that rely on direct access to the host System API
+more difficult to use. To minimize exposure, you should follow relevant guidance
+from vendors and keep your host kernel and firmware up-to-date.
+
+### Is this just a ptrace sandbox?
+
+No: the term “ptrace sandbox” generally refers to software that uses the Linux
+ptrace facility to inspect and authorize system calls made by applications,
+enforcing a specific policy. These commonly suffer from two issues. First,
+vulnerable system calls may be authorized by the sandbox, as the application
+still has direct access to some System API. Second, it’s impossible to avoid
+time-of-check, time-of-use race conditions without disabling multi-threading.
+
+In gVisor, the platforms that use ptrace operate differently. The stubs that are
+traced are never allowed to continue execution into the host kernel and complete
+a call directly. Instead, all system calls are interpreted and handled by the
+Sentry itself, who reflects resulting register state back into the tracee before
+continuing execution in userspace. This is very similar to the mechanism used by
+User-Mode Linux (UML).
+
+[dirtycow]: https://en.wikipedia.org/wiki/Dirty_COW
+[clang]: https://en.wikipedia.org/wiki/C_(programming_language)
+[popss]: https://nvd.nist.gov/vuln/detail/CVE-2018-8897
diff --git a/g3doc/architecture_guide/security.png b/g3doc/architecture_guide/security.png
new file mode 100644
index 000000000..c29befbf6
--- /dev/null
+++ b/g3doc/architecture_guide/security.png
Binary files differ
diff --git a/g3doc/architecture_guide/security.svg b/g3doc/architecture_guide/security.svg
new file mode 100644
index 000000000..0575e2dec
--- /dev/null
+++ b/g3doc/architecture_guide/security.svg
@@ -0,0 +1,153 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+
+<svg
+ xmlns:dc="http://purl.org/dc/elements/1.1/"
+ xmlns:cc="http://creativecommons.org/ns#"
+ xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
+ xmlns:svg="http://www.w3.org/2000/svg"
+ xmlns="http://www.w3.org/2000/svg"
+ xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd"
+ xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape"
+ width="92.963379mm"
+ height="107.18885mm"
+ viewBox="0 0 92.963379 107.18885"
+ version="1.1"
+ id="svg8"
+ inkscape:version="0.92.4 (5da689c313, 2019-01-14)"
+ sodipodi:docname="defense.svg">
+ <defs
+ id="defs2" />
+ <sodipodi:namedview
+ id="base"
+ pagecolor="#ffffff"
+ bordercolor="#666666"
+ borderopacity="1.0"
+ inkscape:pageopacity="0.0"
+ inkscape:pageshadow="2"
+ inkscape:zoom="0.98994949"
+ inkscape:cx="-242.99254"
+ inkscape:cy="136.90181"
+ inkscape:document-units="mm"
+ inkscape:current-layer="layer4"
+ showgrid="false"
+ inkscape:object-nodes="true"
+ inkscape:window-width="1920"
+ inkscape:window-height="1005"
+ inkscape:window-x="0"
+ inkscape:window-y="0"
+ inkscape:window-maximized="1"
+ fit-margin-top="0"
+ fit-margin-left="0"
+ fit-margin-right="0"
+ fit-margin-bottom="0" />
+ <metadata
+ id="metadata5">
+ <rdf:RDF>
+ <cc:Work
+ rdf:about="">
+ <dc:format>image/svg+xml</dc:format>
+ <dc:type
+ rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
+ <dc:title></dc:title>
+ </cc:Work>
+ </rdf:RDF>
+ </metadata>
+ <g
+ inkscape:groupmode="layer"
+ id="layer2"
+ inkscape:label="Layer 2"
+ transform="translate(-61.112559,-78.160466)">
+ <g
+ id="g4644"
+ style="fill:none;fill-opacity:0.34351148;stroke:#00a500;stroke-width:1;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:0.25572576"
+ transform="matrix(1,0,0,-1,2.138671,277.94235)">
+ <path
+ transform="scale(0.26458333)"
+ inkscape:connector-curvature="0"
+ style="opacity:1;fill:none;fill-opacity:0.34351148;stroke:#00a500;stroke-width:3.77952766;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:0.25572576"
+ d="M 398.57227,351.84766 224.7832,452.18359 398.57227,552.51953 572.35938,452.18359 Z"
+ id="path4638" />
+ <path
+ inkscape:connector-curvature="0"
+ style="opacity:1;fill:none;fill-opacity:0.34351148;stroke:#00a500;stroke-width:3.77952766;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:0.25572576"
+ d="M 572.35938,452.18359 398.57227,552.51953 V 753.19141 L 572.35938,652.85547 Z"
+ transform="scale(0.26458333)"
+ id="path4640" />
+ <path
+ id="path4642"
+ d="m 59.473888,119.64024 45.981172,26.54722 v 53.09443 L 59.473888,172.73467 Z"
+ style="opacity:1;fill:none;fill-opacity:0.34351148;stroke:#00a500;stroke-width:1;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:0.25572576"
+ inkscape:connector-curvature="0" />
+ </g>
+ </g>
+ <g
+ inkscape:groupmode="layer"
+ id="layer3"
+ inkscape:label="Layer 3"
+ transform="translate(-61.112559,-78.160466)">
+ <g
+ id="g4554"
+ transform="matrix(-0.39771468,0.69855937,-0.69855937,-0.39771468,366.58103,126.65261)">
+ <g
+ id="g4662"
+ transform="translate(59.46839,130.66062)">
+ <path
+ inkscape:connector-curvature="0"
+ id="path4548"
+ transform="scale(0.26458333)"
+ d="M 398.57227,351.84766 224.7832,452.18359 398.57227,552.51953 572.35938,452.18359 Z"
+ style="opacity:1;fill:#0066ff;fill-opacity:0.34509804;stroke:#00a5ff;stroke-width:4.70182848;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
+ <path
+ inkscape:connector-curvature="0"
+ id="path4550"
+ transform="scale(0.26458333)"
+ d="M 572.35938,452.18359 398.57227,552.51953 V 753.19141 L 572.35938,652.85547 Z"
+ style="opacity:1;fill:#0044aa;fill-opacity:0.34509804;stroke:#00a5ff;stroke-width:4.29276943;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1" />
+ <path
+ inkscape:connector-curvature="0"
+ style="opacity:1;fill:#5599ff;fill-opacity:0.34509804;stroke:#00a5ff;stroke-width:1.24402535;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
+ d="m 59.473888,119.64024 45.981172,26.54722 v 53.09443 L 59.473888,172.73467 Z"
+ id="path4552" />
+ </g>
+ </g>
+ </g>
+ <g
+ inkscape:groupmode="layer"
+ id="layer4"
+ inkscape:label="Layer 4"
+ transform="translate(-61.112559,-78.160466)">
+ <path
+ style="fill:#e000ae;fill-opacity:1;stroke-width:0.12476727"
+ d="m 84.610811,107.36071 v 2.55773 2.55772 h 2.49535 2.49534 v -2.55772 -2.55773 h -2.49534 z m 40.674129,0 v 2.55773 2.55772 h 2.49535 2.49534 v -2.55772 -2.55773 h -2.49534 z m -35.558669,5.11545 v 2.55773 2.55773 h 2.49535 2.49534 v -2.55773 -2.55773 h -2.49534 z m 4.99069,5.11546 v 2.55773 2.55773 h -2.49534 -2.49535 v 2.49534 2.49535 h -2.55773 -2.55773 v 2.55773 2.55773 h -2.55773 -2.55773 v 10.16853 10.16853 h 2.55773 2.55773 v -7.67562 -7.67587 l 2.52654,0.0339 2.52654,0.0336 0.0327,5.08427 0.0327,5.08426 h 2.49388 2.49388 v 2.55919 2.5592 l 5.08427,-0.0327 5.084269,-0.0326 v -2.49534 -2.49535 l -5.084269,-0.0324 -5.08427,-0.0327 v -2.55626 -2.55651 h 12.726269 12.72626 v 2.55651 2.55626 l -5.05868,0.0327 -5.05893,0.0324 v 2.49535 2.49534 l 5.05893,0.0326 5.05868,0.0327 v -2.55919 -2.55919 h 2.49388 2.49413 l 0.0324,-5.08426 0.0327,-5.08427 2.52653,-0.0336 2.52654,-0.0339 v 7.67586 7.67563 h 2.55773 2.55773 v -10.16854 -10.16853 h -2.55773 -2.55773 v -2.55773 -2.55773 h -2.55773 -2.55773 v -2.49535 -2.49534 h -2.49535 -2.49534 v -2.55773 -2.55773 h -2.55773 -2.55773 v 2.55773 2.55773 h -7.6108 -7.610809 v -2.55773 -2.55773 h -2.55774 z m 25.452519,0 h 2.49535 2.49535 v -2.55773 -2.55773 h -2.49535 -2.49535 v 2.55773 z m -25.452519,10.10615 h 5.11546 5.115459 v 2.55773 2.55773 h -5.115459 -5.11546 v -2.55773 z m 15.221609,0 h 5.11546 5.11545 v 2.55773 2.55773 h -5.11545 -5.11546 v -2.55773 z"
+ id="path4732"
+ inkscape:connector-curvature="0" />
+ </g>
+ <g
+ inkscape:label="Layer 1"
+ inkscape:groupmode="layer"
+ id="layer1"
+ style="display:inline"
+ transform="translate(-61.112559,-78.160466)">
+ <g
+ transform="translate(-131.49557,42.495842)"
+ style="fill:#007200;fill-opacity:0.34351148;stroke:#00a500;stroke-width:1;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
+ id="g4628">
+ <path
+ id="path4529"
+ d="m 239.09034,36.164616 -45.98169,26.547215 45.98169,26.547217 45.98117,-26.547217 z"
+ style="opacity:1;fill:#4aba19;fill-opacity:0.34509804;stroke:#00a500;stroke-width:1;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
+ inkscape:connector-curvature="0" />
+ <path
+ id="path4531"
+ d="m 285.07151,62.711828 -45.98117,26.54722 v 53.094432 l 45.98117,-26.54722 z"
+ style="opacity:1;fill:#007900;fill-opacity:0.34351148;stroke:#00a500;stroke-width:1;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
+ inkscape:connector-curvature="0" />
+ <path
+ inkscape:connector-curvature="0"
+ style="opacity:1;fill:#003d00;fill-opacity:0.34509804;stroke:#00a500;stroke-width:1;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-opacity:1"
+ d="m 193.10865,62.711831 45.98117,26.54722 v 53.094429 l -45.98117,-26.54722 z"
+ id="path4541" />
+ </g>
+ </g>
+</svg>