From 957e26a6f30d40e2bff042d76a327d0a2cfbabae Mon Sep 17 00:00:00 2001 From: Adin Scannell Date: Mon, 18 Nov 2019 13:40:27 -0800 Subject: Move website to a simpler jekyll-based template This will allow us to merge the site into the main repository. This merge allows the documentation to be kept up-to-date and synchronized with the main project. Builds will be triggered on any update, removing the need for the cron-based reploy. --- website/content/docs/architecture_guide/Layers.png | Bin 0 -> 11044 bytes website/content/docs/architecture_guide/Layers.svg | 1 + .../architecture_guide/Machine-Virtualization.png | Bin 0 -> 13205 bytes .../architecture_guide/Machine-Virtualization.svg | 1 + .../architecture_guide/Rule-Based-Execution.png | Bin 0 -> 6780 bytes .../architecture_guide/Rule-Based-Execution.svg | 1 + .../docs/architecture_guide/Sentry-Gofer.png | Bin 0 -> 9064 bytes .../docs/architecture_guide/Sentry-Gofer.svg | 1 + website/content/docs/architecture_guide/index.md | 86 +++++++ .../content/docs/architecture_guide/performance.md | 264 +++++++++++++++++++++ .../content/docs/architecture_guide/platforms.md | 92 +++++++ .../content/docs/architecture_guide/resource.md | 7 + .../content/docs/architecture_guide/security.md | 257 ++++++++++++++++++++ website/content/docs/community/governance.md | 10 + website/content/docs/community/index.md | 42 ++++ website/content/docs/includes/index.md | 3 + website/content/docs/index.md | 30 +++ website/content/docs/tutorials/add-node-pool.png | Bin 0 -> 70208 bytes website/content/docs/tutorials/cni.md | 179 ++++++++++++++ website/content/docs/tutorials/docker.md | 75 ++++++ website/content/docs/tutorials/kubernetes.md | 241 +++++++++++++++++++ .../content/docs/tutorials/node-pool-button.png | Bin 0 -> 13757 bytes website/content/docs/user_guide/FAQ.md | 118 +++++++++ .../content/docs/user_guide/checkpoint_restore.md | 104 ++++++++ .../docs/user_guide/compatibility/.gitignore | 1 + .../content/docs/user_guide/compatibility/index.md | 93 ++++++++ website/content/docs/user_guide/debugging.md | 135 +++++++++++ website/content/docs/user_guide/filesystem.md | 63 +++++ website/content/docs/user_guide/install.md | 163 +++++++++++++ website/content/docs/user_guide/networking.md | 89 +++++++ website/content/docs/user_guide/platforms.md | 122 ++++++++++ .../content/docs/user_guide/quick_start/docker.md | 98 ++++++++ .../docs/user_guide/quick_start/kubernetes.md | 43 ++++ website/content/docs/user_guide/quick_start/oci.md | 51 ++++ 34 files changed, 2370 insertions(+) create mode 100755 website/content/docs/architecture_guide/Layers.png create mode 100755 website/content/docs/architecture_guide/Layers.svg create mode 100755 website/content/docs/architecture_guide/Machine-Virtualization.png create mode 100755 website/content/docs/architecture_guide/Machine-Virtualization.svg create mode 100755 website/content/docs/architecture_guide/Rule-Based-Execution.png create mode 100755 website/content/docs/architecture_guide/Rule-Based-Execution.svg create mode 100755 website/content/docs/architecture_guide/Sentry-Gofer.png create mode 100755 website/content/docs/architecture_guide/Sentry-Gofer.svg create mode 100755 website/content/docs/architecture_guide/index.md create mode 100755 website/content/docs/architecture_guide/performance.md create mode 100755 website/content/docs/architecture_guide/platforms.md create mode 100755 website/content/docs/architecture_guide/resource.md create mode 100755 website/content/docs/architecture_guide/security.md create mode 100644 website/content/docs/community/governance.md create mode 100755 website/content/docs/community/index.md create mode 100755 website/content/docs/includes/index.md create mode 100755 website/content/docs/index.md create mode 100755 website/content/docs/tutorials/add-node-pool.png create mode 100644 website/content/docs/tutorials/cni.md create mode 100755 website/content/docs/tutorials/docker.md create mode 100755 website/content/docs/tutorials/kubernetes.md create mode 100755 website/content/docs/tutorials/node-pool-button.png create mode 100755 website/content/docs/user_guide/FAQ.md create mode 100755 website/content/docs/user_guide/checkpoint_restore.md create mode 100755 website/content/docs/user_guide/compatibility/.gitignore create mode 100755 website/content/docs/user_guide/compatibility/index.md create mode 100755 website/content/docs/user_guide/debugging.md create mode 100755 website/content/docs/user_guide/filesystem.md create mode 100755 website/content/docs/user_guide/install.md create mode 100755 website/content/docs/user_guide/networking.md create mode 100755 website/content/docs/user_guide/platforms.md create mode 100755 website/content/docs/user_guide/quick_start/docker.md create mode 100755 website/content/docs/user_guide/quick_start/kubernetes.md create mode 100755 website/content/docs/user_guide/quick_start/oci.md (limited to 'website/content/docs') diff --git a/website/content/docs/architecture_guide/Layers.png b/website/content/docs/architecture_guide/Layers.png new file mode 100755 index 000000000..308c6c451 Binary files /dev/null and b/website/content/docs/architecture_guide/Layers.png differ diff --git a/website/content/docs/architecture_guide/Layers.svg b/website/content/docs/architecture_guide/Layers.svg new file mode 100755 index 000000000..0a366f841 --- /dev/null +++ b/website/content/docs/architecture_guide/Layers.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/website/content/docs/architecture_guide/Machine-Virtualization.png b/website/content/docs/architecture_guide/Machine-Virtualization.png new file mode 100755 index 000000000..1ba2ed6b2 Binary files /dev/null and b/website/content/docs/architecture_guide/Machine-Virtualization.png differ diff --git a/website/content/docs/architecture_guide/Machine-Virtualization.svg b/website/content/docs/architecture_guide/Machine-Virtualization.svg new file mode 100755 index 000000000..5352da07b --- /dev/null +++ b/website/content/docs/architecture_guide/Machine-Virtualization.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/website/content/docs/architecture_guide/Rule-Based-Execution.png b/website/content/docs/architecture_guide/Rule-Based-Execution.png new file mode 100755 index 000000000..b42654a90 Binary files /dev/null and b/website/content/docs/architecture_guide/Rule-Based-Execution.png differ diff --git a/website/content/docs/architecture_guide/Rule-Based-Execution.svg b/website/content/docs/architecture_guide/Rule-Based-Execution.svg new file mode 100755 index 000000000..bd6717043 --- /dev/null +++ b/website/content/docs/architecture_guide/Rule-Based-Execution.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/website/content/docs/architecture_guide/Sentry-Gofer.png b/website/content/docs/architecture_guide/Sentry-Gofer.png new file mode 100755 index 000000000..ca2c27ef7 Binary files /dev/null and b/website/content/docs/architecture_guide/Sentry-Gofer.png differ diff --git a/website/content/docs/architecture_guide/Sentry-Gofer.svg b/website/content/docs/architecture_guide/Sentry-Gofer.svg new file mode 100755 index 000000000..5c10750d2 --- /dev/null +++ b/website/content/docs/architecture_guide/Sentry-Gofer.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/website/content/docs/architecture_guide/index.md b/website/content/docs/architecture_guide/index.md new file mode 100755 index 000000000..7ea331605 --- /dev/null +++ b/website/content/docs/architecture_guide/index.md @@ -0,0 +1,86 @@ +--- +title: Overview +permalink: /docs/architecture_guide/ +layout: docs +category: Architecture Guide +weight: 0 +--- + +gVisor provides a virtualized environment in order to sandbox untrusted +containers. The system interfaces normally implemented by the host kernel are +moved into a distinct, per-sandbox user space kernel in order to minimize the +risk of an exploit. gVisor does not introduce large fixed overheads however, +and still retains a process-like model with respect to resource utilization. + +## How is this different? + +Two other approaches are commonly taken to provide stronger isolation than +native containers. + +**Machine-level virtualization**, such as [KVM][kvm] and [Xen][xen], exposes +virtualized hardware to a guest kernel via a Virtual Machine Monitor (VMM). This +virtualized hardware is generally enlightened (paravirtualized) and additional +mechanisms can be used to improve the visibility between the guest and host +(e.g. balloon drivers, paravirtualized spinlocks). Running containers in +distinct virtual machines can provide great isolation, compatibility and +performance (though nested virtualization may bring challenges in this area), +but for containers it often requires additional proxies and agents, and may +require a larger resource footprint and slower start-up times. + +![Machine-level virtualization](Machine-Virtualization.png "Machine-level virtualization") + +**Rule-based execution**, such as [seccomp][seccomp], [SELinux][selinux] and +[AppArmor][apparmor], allows the specification of a fine-grained security policy +for an application or container. These schemes typically rely on hooks +implemented inside the host kernel to enforce the rules. If the surface can be +made small enough (i.e. a sufficiently complete policy defined), then this is an +excellent way to sandbox applications and maintain native performance. However, +in practice it can be extremely difficult (if not impossible) to reliably define +a policy for arbitrary, previously unknown applications, making this approach +challenging to apply universally. + +![Rule-based execution](Rule-Based-Execution.png "Rule-based execution") + +Rule-based execution is often combined with additional layers for +defense-in-depth. + +**gVisor** provides a third isolation mechanism, distinct from those above. + +gVisor intercepts application system calls and acts as the guest kernel, without +the need for translation through virtualized hardware. gVisor may be thought of +as either a merged guest kernel and VMM, or as seccomp on steroids. This +architecture allows it to provide a flexible resource footprint (i.e. one based +on threads and memory mappings, not fixed guest physical resources) while also +lowering the fixed costs of virtualization. However, this comes at the price of +reduced application compatibility and higher per-system call overhead. + +![gVisor](Layers.png "gVisor") + +On top of this, gVisor employs rule-based execution to provide defense-in-depth +(details below). + +gVisor's approach is similar to [User Mode Linux (UML)][uml], although UML +virtualizes hardware internally and thus provides a fixed resource footprint. + +Each of the above approaches may excel in distinct scenarios. For example, +machine-level virtualization will face challenges achieving high density, while +gVisor may provide poor performance for system call heavy workloads. + +### Why Go? + +gVisor is written in [Go][golang] in order to avoid security pitfalls that can +plague kernels. With Go, there are strong types, built-in bounds checks, no +uninitialized variables, no use-after-free, no stack overflow, and a built-in +race detector. (The use of Go has its challenges too, and isn't free.) + +### What about Gofers? + + + +[apparmor]: https://wiki.ubuntu.com/AppArmor +[golang]: https://golang.org +[kvm]: https://www.linux-kvm.org +[seccomp]: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt +[selinux]: https://selinuxproject.org +[uml]: http://user-mode-linux.sourceforge.net/ +[xen]: https://www.xenproject.org diff --git a/website/content/docs/architecture_guide/performance.md b/website/content/docs/architecture_guide/performance.md new file mode 100755 index 000000000..382772425 --- /dev/null +++ b/website/content/docs/architecture_guide/performance.md @@ -0,0 +1,264 @@ +--- +title: Performance Guide +permalink: /docs/architecture_guide/performance/ +layout: docs +category: Architecture Guide +weight: 20 +--- + +gVisor is designed to provide a secure, virtualized environment while preserving +key benefits of containerization, such as small fixed overheads and a dynamic +resource footprint. For containerized infrastructure, this can provide a +turn-key solution for sandboxing untrusted workloads: there are no changes to +the fundamental resource model. + +gVisor imposes runtime costs over native containers. These costs come in two +forms: additional cycles and memory usage, which may manifest as increased +latency, reduced throughput or density, or not at all. In general, these costs +come from two different sources. + +First, the existence of the [Sentry](../) means that additional memory will be +required, and application system calls must traverse additional layers of +software. The design emphasizes [security](../security/) and therefore we chose +to use a language for the Sentry that provides benefits in this domain but may +not yet offer the raw performance of other choices. Costs imposed by these +design choices are **structural costs**. + +Second, as gVisor is an independent implementation of the system call surface, +many of the subsystems or specific calls are not as optimized as more mature +implementations. A good example here is the network stack, which is continuing +to evolve but does not support all the advanced recovery mechanisms offered by +other stacks and is less CPU efficient. This is an **implementation cost** and +is distinct from **structural costs**. Improvements here are ongoing and driven +by the workloads that matter to gVisor users and contributors. + +This page provides a guide for understanding baseline performance, and calls out +distint **structural costs** and **implementation costs**, highlighting where +improvements are possible and not possible. + +While we include a variety of workloads here, it’s worth emphasizing that gVisor +may not be an appropriate solution for every workload, for reasons other than +performance. For example, a sandbox may provide minimal benefit for a trusted +database, since _user data would already be inside the sandbox_ and there is no +need for an attacker to break out in the first place. + +## Methodology + +All data below was generated using the [benchmark tools][benchmark-tools] +repository, and the machines under test are uniform [Google Compute Engine][gce] +Virtual Machines (VMs) with the following specifications: + + Machine type: n1-standard-4 (broadwell) + Image: Debian GNU/Linux 9 (stretch) 4.19.0-0 + BootDisk: 2048GB SSD persistent disk + +Through this document, `runsc` is used to indicate the runtime provided by +gVisor. When relevant, we use the name `runsc-platform` to describe a specific +[platform choice](../platforms/). + +**Except where specified, all tests below are conducted with the `ptrace` +platform. The `ptrace` platform works everywhere and does not require hardware +virtualization or kernel modifications but suffers from the highest structural +costs by far. This platform is used to provide a clear understanding of the +performance model, but in no way represents an ideal scenario. In the future, +this guide will be extended to bare metal environments and include additional +platforms.** + +## Memory access + +gVisor does not introduce any additional costs with respect to raw memory +accesses. Page faults and other Operating System (OS) mechanisms are translated +through the Sentry, but once mappings are installed and available to the +application, there is no additional overhead. + +{% include graph.html id="sysbench-memory" url="/performance/sysbench-memory.csv" title="perf.py sysbench.memory --runtime=runc --runtime=runsc" %} + +The above figure demonstrates the memory transfer rate as measured by +`sysbench`. + +## Memory usage + +The Sentry provides an additional layer of indirection, and it requires memory +in order to store state associated with the application. This memory generally +consists of a fixed component, plus an amount that varies with the usage of +operating system resources (e.g. how many sockets or files are opened). + +For many use cases, fixed memory overheads are a primary concern. This may be +because sandboxed containers handle a low volume of requests, and it is +therefore important to achieve high densities for efficiency. + +{% include graph.html id="density" url="/performance/density.csv" title="perf.py density --runtime=runc --runtime=runsc" log="true" y_min="100000" %} + +The above figure demonstrates these costs based on three sample applications. +This test is the result of running many instances of a container (50, or 5 in +the case of redis) and calculating available memory on the host before and +afterwards, and dividing the difference by the number of containers. This +technique is used for measuring memory usage over the `usage_in_bytes` value of +the container cgroup because we found that some container runtimes, other than +`runc` and `runsc`, do not use an individual container cgroup. + +The first application is an instance of `sleep`: a trivial application that does +nothing. The second application is a synthetic `node` application which imports +a number of modules and listens for requests. The third application is a similar +synthetic `ruby` application which does the same. Finally, we include an +instance of `redis` storing approximately 1GB of data. In all cases, the sandbox +itself is responsible for a small, mostly fixed amount of memory overhead. + +## CPU performance + +gVisor does not perform emulation or otherwise interfere with the raw execution +of CPU instructions by the application. Therefore, there is no runtime cost +imposed for CPU operations. + +{% include graph.html id="sysbench-cpu" url="/performance/sysbench-cpu.csv" title="perf.py sysbench.cpu --runtime=runc --runtime=runsc" %} + +The above figure demonstrates the `sysbench` measurement of CPU events per +second. Events per second is based on a CPU-bound loop that calculates all prime +numbers in a specified range. We note that `runsc` does not impose a performance +penalty, as the code is executing natively in both cases. + +This has important consequences for classes of workloads that are often +CPU-bound, such as data processing or machine learning. In these cases, `runsc` +will similarly impose minimal runtime overhead. + +{% include graph.html id="tensorflow" url="/performance/tensorflow.csv" title="perf.py tensorflow --runtime=runc --runtime=runsc" %} + +For example, the above figure shows a sample TensorFlow workload, the +[convolutional neural network example][cnn]. The time indicated includes the +full start-up and run time for the workload, which trains a model. + +## System calls + +Some **structural costs** of gVisor are heavily influenced by the [platform +choice](../platforms/), which implements system call interception. Today, gVisor +supports a variety of platforms. These platforms present distinct performance, +compatibility and security trade-offs. For example, the KVM platform has low +overhead system call interception but runs poorly with nested virtualization. + +{% include graph.html id="syscall" url="/performance/syscall.csv" title="perf.py syscall --runtime=runc --runtime=runsc-ptrace --runtime=runsc-kvm" y_min="100" log="true" %} + +The above figure demonstrates the time required for a raw system call on various +platforms. The test is implemented by a custom binary which performs a large +number of system calls and calculates the average time required. + +This cost will principally impact applications that are system call bound, which +tend to be high-performance data stores and static network services. In general, +the impact of system call interception will be lower the more work an +application does. + +{% include graph.html id="redis" url="/performance/redis.csv" title="perf.py redis --runtime=runc --runtime=runsc" %} + +For example, `redis` is an application that performs relatively little work in +userspace: in general it reads from a connected socket, reads or modifies some +data, and writes a result back to the socket. The above figure shows the results +of running [comprehensive set of benchmarks][redis-benchmark]. We can see that +small operations impose a large overhead, while larger operations, such as +`LRANGE`, where more work is done in the application, have a smaller relative +overhead. + +Some of these costs above are **structural costs**, and `redis` is likely to +remain a challenging performance scenario. However, optimizing the +[platform](../platforms/) will also have a dramatic impact. + +## Start-up time + +For many use cases, the ability to spin-up containers quickly and efficiently is +important. A sandbox may be short-lived and perform minimal user work (e.g. a +function invocation). + +{% include graph.html id="startup" url="/performance/startup.csv" title="perf.py startup --runtime=runc --runtime=runsc" %} + +The above figure indicates how total time required to start a container through +[Docker][docker]. This benchmark uses three different applications. First, an +alpine Linux-container that executes `true`. Second, a `node` application that +loads a number of modules and binds an HTTP server. The time is measured by a +successful request to the bound port. Finally, a `ruby` application that +similarly loads a number of modules and binds an HTTP server. + +> Note: most of the time overhead above is associated Docker itself. This is +> evident with the empty `runc` benchmark. To avoid these costs with `runsc`, +> you may also consider using `runsc do` mode or invoking the [OCI +> runtime](../../user_guide/quick_start/oci/) directly. + +## Network + +Networking is mostly bound by **implementation costs**, and gVisor's network stack +is improving quickly. + +While typically not an important metric in practice for common sandbox use +cases, nevertheless `iperf` is a common microbenchmark used to measure raw +throughput. + +{% include graph.html id="iperf" url="/performance/iperf.csv" title="perf.py iperf --runtime=runc --runtime=runsc" %} + +The above figure shows the result of an `iperf` test between two instances. For +the upload case, the specified runtime is used for the `iperf` client, and in +the download case, the specified runtime is the server. A native runtime is +always used for the other endpoint in the test. + +{% include graph.html id="applications" metric="requests_per_second" url="/performance/applications.csv" title="perf.py http.(node|ruby) --connections=25 --runtime=runc --runtime=runsc" %} + +The above figure shows the result of simple `node` and `ruby` web services that +render a template upon receiving a request. Because these synthetic benchmarks +do minimal work per request, must like the `redis` case, they suffer from high +overheads. In practice, the more work an application does the smaller the impact +of **structural costs** become. + +## File system + +Some aspects of file system performance are also reflective of **implementation +costs**, and an area where gVisor's implementation is improving quickly. + +In terms of raw disk I/O, gVisor does not introduce significant fundamental +overhead. For general file operations, gVisor introduces a small fixed overhead +for data that transitions across the sandbox boundary. This manifests as +**structural costs** in some cases, since these operations must be routed +through the [Gofer](../) as a result of our [security model](../security/), but +in most cases are dominated by **implementation costs**, due to an internal +[Virtual File System][vfs] (VFS) implementation that needs improvement. + +{% include graph.html id="fio-bw" url="/performance/fio.csv" title="perf.py fio --engine=sync --runtime=runc --runtime=runsc" log="true" %} + +The above figures demonstrate the results of `fio` for reads and writes to and +from the disk. In this case, the disk quickly becomes the bottleneck and +dominates other costs. + +{% include graph.html id="fio-tmpfs-bw" url="/performance/fio-tmpfs.csv" title="perf.py fio --engine=sync --runtime=runc --tmpfs=True --runtime=runsc" log="true" %} + +The above figure shows the raw I/O performance of using a `tmpfs` mount which is +sandbox-internal in the case of `runsc`. Generally these operations are +similarly bound to the cost of copying around data in-memory, and we don't see +the cost of VFS operations. + +{% include graph.html id="httpd100k" metric="transfer_rate" url="/performance/httpd100k.csv" title="perf.py http.httpd --connections=1 --connections=5 --connections=10 --connections=25 --runtime=runc --runtime=runsc" %} + +The high costs of VFS operations can manifest in benchmarks that execute many +such operations in the hot path for serving requests, for example. The above +figure shows the result of using gVisor to serve small pieces of static content +with predictably poor results. This workload represents `apache` serving a +single file sized 100k from the container image to a client running +[ApacheBench][ab] with varying levels of concurrency. The high overhead comes +principally from the VFS implementation that needs improvement, with several +internal serialization points (since all requests are reading the same file). +Note that some of some of network stack performance issues also impact this +benchmark. + +{% include graph.html id="ffmpeg" url="/performance/ffmpeg.csv" title="perf.py media.ffmpeg --runtime=runc --runtime=runsc" %} + +For benchmarks that are bound by raw disk I/O and a mix of compute, file system +operations are less of an issue. The above figure shows the total time required +for an `ffmpeg` container to start, load and transcode a 27MB input video. + +[ab]: https://en.wikipedia.org/wiki/ApacheBench + +[benchmark-tools]: https://gvisor.googlesource.com/benchmark-tools + +[gce]: https://cloud.google.com/compute/ + +[cnn]: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/convolutional_network.py + +[docker]: https://docker.io + +[redis-benchmark]: https://redis.io/topics/benchmarks + +[vfs]: https://en.wikipedia.org/wiki/Virtual_file_system diff --git a/website/content/docs/architecture_guide/platforms.md b/website/content/docs/architecture_guide/platforms.md new file mode 100755 index 000000000..3a3322fc0 --- /dev/null +++ b/website/content/docs/architecture_guide/platforms.md @@ -0,0 +1,92 @@ +--- +title: Platform Guide +permalink: /docs/architecture_guide/platforms/ +layout: docs +category: Architecture Guide +weight: 50 +--- + +A gVisor sandbox consists of multiple processes when running. These processes +collectively comprise a shared environment in which one or more containers can +be run. + +Each sandbox has its own isolated instance of: + +* The **Sentry**, A user-space kernel that runs the container and intercepts + and responds to system calls made by the application. + +Each container running in the sandbox has its own isolated instance of: + +* A **Gofer** which provides file system access to the container. + +![gVisor architecture diagram](../Sentry-Gofer.png "gVisor architecture diagram") + +## runsc + +The entrypoint to running a sandboxed container is the `runsc` executable. +`runsc` implements the [Open Container Initiative (OCI)][oci] runtime +specification. This means that OCI compatible _filesystem bundles_ can be run by +`runsc`. Filesystem bundles are comprised of a `config.json` file containing +container configuration, and a root filesystem for the container. Please see +the [OCI runtime spec][runtime-spec] for more information on filesystem bundles. +`runsc` implements multiple commands that perform various functions such as +starting, stopping, listing, and querying the status of containers. + +## Sentry + +The Sentry is the largest component of gVisor. It can be thought of as a +userspace OS kernel. The Sentry implements all the kernel functionality needed +by the untrusted application. It implements all of the supported system calls, +signal delivery, memory management and page faulting logic, the threading +model, and more. + +When the untrusted application makes a system call, the currently used platform +redirects the call to the Sentry, which will do the necessary work to service +it. It is important to note that the Sentry will not simply pass through system +calls to the host kernel. As a userspace application, the Sentry will make some +host system calls to support its operation, but it will not allow the +application to directly control the system calls it makes. + +The Sentry aims to present an equivalent environment to (upstream) Linux v4.4. + +File system operations that extend beyond the sandbox (not internal /proc +files, pipes, etc) are sent to the Gofer, described below. + +## Platforms + +gVisor requires a platform to implement interception of syscalls, basic context +switching, and memory mapping functionality. + +### ptrace + +The ptrace platform uses `PTRACE_SYSEMU` to execute user code without allowing +it to execute host system calls. This platform can run anywhere that ptrace +works (even VMs without nested virtualization). + +### KVM (experimental) + +The KVM platform allows the Sentry to act as both guest OS and VMM, switching +back and forth between the two worlds seamlessly. The KVM platform can run on +bare-metal or in a VM with nested virtualization enabled. While there is no +virtualized hardware layer -- the sandbox retains a process model -- gVisor +leverages virtualization extensions available on modern processors in order to +improve isolation and performance of address space switches. + +## Gofer + +The Gofer is a normal host Linux process. The Gofer is started with each sandbox +and connected to the Sentry. The Sentry process is started in a restricted +seccomp container without access to file system resources. The Gofer provides +the Sentry access to file system resources via the 9P protocol and provides an +additional level of isolation. + +## Application + +The application (aka the untrusted application) is a normal Linux binary +provided to gVisor in an OCI runtime bundle. gVisor aims to provide an +environment equivalent to Linux v4.4, so applications should be able to run +unmodified. However, gVisor does not presently implement every system call, +/proc file, or /sys file so some incompatibilities may occur. + +[oci]: https://www.opencontainers.org +[runtime-spec]: https://github.com/opencontainers/runtime-spec diff --git a/website/content/docs/architecture_guide/resource.md b/website/content/docs/architecture_guide/resource.md new file mode 100755 index 000000000..ca0cee7c1 --- /dev/null +++ b/website/content/docs/architecture_guide/resource.md @@ -0,0 +1,7 @@ +--- +title: Resource Model +permalink: /docs/architecture_guide/resources/ +layout: docs +category: Architecture Guide +weight: 30 +--- diff --git a/website/content/docs/architecture_guide/security.md b/website/content/docs/architecture_guide/security.md new file mode 100755 index 000000000..56dfa28f0 --- /dev/null +++ b/website/content/docs/architecture_guide/security.md @@ -0,0 +1,257 @@ +--- +title: "Security Model" +permalink: /docs/architecture_guide/security/ +layout: docs +category: Architecture Guide +weight: 10 +--- + +gVisor was created in order to provide additional defense against the +exploitation of kernel bugs by untrusted userspace code. In order to understand +how gVisor achieves this goal, it is first necessary to understand the basic +threat model. + +## Threats: The Anatomy of an Exploit + +An exploit takes advantage of a software or hardware bug in order to escalate +privileges, gain access to privileged data, or disrupt services. All of the +possible interactions that a malicious application can have with the rest of the +system (attack vectors) define the attack surface. We categorize these attack +vectors into several common classes. + +### System API + +An operating system or hypervisor exposes an abstract System API in the form of +system calls and traps. This API may be documented and stable, as with Linux, or +it may be abstracted behind a library, as with Windows (i.e. win32.dll or +ntdll.dll). The System API includes all standard interfaces that application +code uses to interact with the system. This includes high-level abstractions +that are derived from low-level system calls, such as system files, sockets and +namespaces. + +Although the System API is exposed to applications by design, bugs and race +conditions within the kernel or hypervisor may occasionally be exploitable via +the API. This is common in part due to the fact that most kernels and hypervisors +are written in [C][clang], which is well-suited to interfacing with hardware but +often prone to security issues. In order to exploit these issues, a typical attack +might involve some combination of the following: + +1. Opening or creating some combination of files, sockets or other descriptors. +1. Passing crafted, malicious arguments, structures or packets. +1. Racing with multiple threads in order to hit specific code paths. + +For example, for the [Dirty Cow][dirtycow] privilege escalation bug, an +application would open a specific file in `/proc` or use a specific `ptrace` +system call, and use multiple threads in order to trigger a race condition when +touching a fresh page of memory. The attacker then gains control over a page of +memory belonging to the system. With additional privileges or access to +privileged data in the kernel, an attacker will often be able to employ +additional techniques to gain full access to the rest of the system. + +While bugs in the implementation of the System API are readily fixed, they are +also the most common form of exploit. The exposure created by this class of +exploit is what gVisor aims to minimize and control, described in detail below. + +### System ABI + +Hardware and software exploits occasionally exist in execution paths that are +not part of an intended System API. In this case, exploits may be found as part +of implicit actions the hardware or privileged system code takes in response to +certain events, such as traps or interrupts. For example, the recent +[POPSS][popss] flaw required only native code execution (no specific system call +or file access). In that case, the Xen hypervisor was similarly vulnerable, +highlighting that hypervisors are not immune to this vector. + +### Side Channels + +Hardware side channels may be exploitable by any code running on a system: +native, sandboxed, or virtualized. However, many host-level mitigations against +hardware side channels are still effective with a sandbox. For example, kernels +built with retpoline protect against some speculative execution attacks +(Spectre) and frame poisoning may protect against L1 terminal fault (L1TF) +attacks. Hypervisors may introduce additional complications in this regard, as +there is no mitigation against an application in a normally functioning Virtual +Machine (VM) exploiting the L1TF vulnerability for another VM on the sibling +hyperthread. + +### Other Vectors + +The above categories in no way represent an exhaustive list of exploits, as we +focus only on running untrusted code from within the operating system or +hypervisor. We do not consider other ways that a more generic adversary +may interact with a system, such as inserting a portable storage device with a +malicious filesystem image, using a combination of crafted keyboard or touch +inputs, or saturating a network device with ill-formed packets. + +Furthermore, high-level systems may contain exploitable components. An attacker +need not escalate privileges within a container if there’s an exploitable +network-accessible service on the host or some other API path. *A sandbox is not +a substitute for a secure architecture*. + +## Goals: Limiting Exposure + +gVisor’s primary design goal is to minimize the System API attack vector while +still providing a process model. There are two primary security principles that +inform this design. First, the application’s direct interactions with the host +System API are intercepted by the Sentry, which implements the System API +instead. Second, the System API accessible to the Sentry itself is minimized to +a safer, restricted set. The first principle minimizes the possibility of direct +exploitation of the host System API by applications, and the second principle +minimizes indirect exploitability, which is the exploitation by an exploited or +buggy Sentry (e.g. chaining an exploit). + +The first principle is similar to the security basis for a Virtual Machine (VM). +With a VM, an application’s interactions with the host are replaced by +interactions with a guest operating system and a set of virtualized hardware +devices. These hardware devices are then implemented via the host System API by +a Virtual Machine Monitor (VMM). The Sentry similarly prevents direct interactions +by providing its own implementation of the System API that the application +must interact with. Applications are not able to to directly craft specific +arguments or flags for the host System API, or interact directly with host +primitives. + +For both the Sentry and a VMM, it’s worth noting that while direct interactions +are not possible, indirect interactions are still possible. For example, a read +on a host-backed file in the Sentry may ultimately result in a host read system +call (made by the Sentry, not by passing through arguments from the application), +similar to how a read on a block device in a VM may result in the VMM issuing +a corresponding host read system call from a backing file. + +An important distinction from a VM is that the Sentry implements a System API based +directly on host System API primitives instead of relying on virtualized hardware +and a guest operating system. This selects a distinct set of trade-offs, largely +in the performance, efficiency and compatibility domains. Since transitions in +and out of the sandbox are relatively expensive, a guest operating system will +typically take ownership of resources. For example, in the above case, the +guest operating system may read the block device data in a local page cache, +to avoid subsequent reads. This may lead to better performance but lower +efficiency, since memory may be wasted or duplicated. The Sentry opts instead +to defer to the host for many operations during runtime, for improved efficiency +but lower performance in some use cases. + +### What can a sandbox do? + +An application in a gVisor sandbox is permitted to do most things a standard +container can do: for example, applications can read and write files mapped +within the container, make network connections, etc. As described above, +gVisor's primary goal is to limit exposure to bugs and exploits while still +allowing most applications to run. Even so, gVisor will limit some operations +that might be permitted with a standard container. Even with appropriate +capabilities, a user in a gVisor sandbox will only be able to manipulate +virtualized system resources (e.g. the system time, kernel settings or +filesystem attributes) and not underlying host system resources. + +While the sandbox virtualizes many operations for the application, we limit the +sandbox's own interactions with the host to the following high-level operations: + +1. Communicate with a Gofer process via a connected socket. The sandbox may + receive new file descriptors from the Gofer process, corresponding to opened + files. These files can then be read from and written to by the sandbox. +1. Make a minimal set of host system calls. The calls do not include the + creation of new sockets (unless host networking mode is enabled) or opening + files. The calls include duplication and closing of file descriptors, + synchronization, timers and signal management. +1. Read and write packets to a virtual ethernet device. This is not required if + host networking is enabled (or networking is disabled). + +### System ABI, Side Channels and Other Vectors + +gVisor relies on the host operating system and the platform for defense against +hardware-based attacks. Given the nature of these vulnerabilities, there is +little defense that gVisor can provide (there’s no guarantee that additional +hardware measures, such as virtualization, memory encryption, etc. would +actually decrease the attack surface). Note that this is true even when using +hardware virtualization for acceleration, as the host kernel or hypervisor is +ultimately responsible for defending against attacks from within malicious +guests. + +gVisor similarly relies on the host resource mechanisms (cgroups) for defense +against resource exhaustion and denial of service attacks. Network policy +controls should be applied at the container level to ensure appropriate network +policy enforcement. Note that the sandbox itself is not capable of altering or +configuring these mechanisms, and the sandbox itself should make an attacker +less likely to exploit or override these controls through other means. + +## Principles: Defense-in-Depth + +For gVisor development, there are several engineering principles that are +employed in order to ensure that the system meets its design goals. + +1. No system call is passed through directly to the host. Every supported call + has an independent implementation in the Sentry, that is unlikely to suffer + from identical vulnerabilities that may appear in the host. This has the + consequence that all kernel features used by applications require an + implementation within the Sentry. +1. Only common, universal functionality is implemented. Some filesystems, + network devices or modules may expose specialized functionality to user + space applications via mechanisms such as extended attributes, raw sockets + or ioctls. Since the Sentry is responsible for implementing the full system + call surface, we do not implement or pass through these specialized APIs. +1. The host surface exposed to the Sentry is minimized. While the system call + surface is not trivial, it is explicitly enumerated and controlled. The + Sentry is not permitted to open new files, create new sockets or do many + other interesting things on the host. + +Additionally, we have practical restrictions that are imposed on the project to +minimize the risk of Sentry exploitability. For example: + +1. Unsafe code is carefully controlled. All unsafe code is isolated in files + that end with "unsafe.go", in order to facilitate validation and auditing. + No file without the unsafe suffix may import the unsafe package. +1. No CGo is allowed. The Sentry must be a pure Go binary. +1. External imports are not generally allowed within the core packages. Only + limited external imports are used within the setup code. The code available + inside the Sentry is carefully controlled, to ensure that the above rules + are effective. + +Finally, we recognize that security is a process, and that vigilance is +critical. Beyond our security disclosure process, the Sentry is fuzzed +continuously to identify potential bugs and races proactively, and production +crashes are recorded and triaged to similarly identify material issues. + +## FAQ + +### Is this more or less secure than a Virtual Machine? + +The security of a VM depends to a large extent on what is exposed from the host +kernel and user space support code. For example, device emulation code in the +host kernel (e.g. APIC) or optimizations (e.g. vhost) can be more complex than a +simple system call, and exploits carry the same risks. Similarly, the user space +support code is frequently unsandboxed, and exploits, while rare, may allow +unfettered access to the system. + +Some platforms leverage the same virtualization hardware as VMs in order to +provide better system call interception performance. However, gVisor does not +implement any device emulation, and instead opts to use a sandboxed host System +API directly. Both approaches significantly reduce the original attack surface. +Ultimately, since gVisor is capable of using the same hardware mechanism, one +should not assume that the mere use of virtualization hardware makes a system +more or less secure, just as it would be a mistake to make the claim that the +use of a unibody alone makes a car safe. + +### Does this stop hardware side channels? + +In general, gVisor does not provide protection against hardware side channels, +although it may make exploits that rely on direct access to the host System API +more difficult to use. To minimize exposure, you should follow relevant guidance +from vendors and keep your host kernel and firmware up-to-date. + +### Is this just a ptrace sandbox? + +No: the term “ptrace sandbox” generally refers to software that uses the Linux +ptrace facility to inspect and authorize system calls made by applications, +enforcing a specific policy. These commonly suffer from two issues. First, +vulnerable system calls may be authorized by the sandbox, as the application +still has direct access to some System API. Second, it’s impossible to avoid +time-of-check, time-of-use race conditions without disabling multi-threading. + +In gVisor, the platforms that use ptrace operate differently. The stubs that are +traced are never allowed to continue execution into the host kernel and complete +a call directly. Instead, all system calls are interpreted and handled by the +Sentry itself, who reflects resulting register state back into the tracee before +continuing execution in user space. This is very similar to the mechanism used +by User-Mode Linux (UML). + +[dirtycow]: https://en.wikipedia.org/wiki/Dirty_COW +[clang]: https://en.wikipedia.org/wiki/C_(programming_language) +[popss]: https://nvd.nist.gov/vuln/detail/CVE-2018-8897 diff --git a/website/content/docs/community/governance.md b/website/content/docs/community/governance.md new file mode 100644 index 000000000..31db503ac --- /dev/null +++ b/website/content/docs/community/governance.md @@ -0,0 +1,10 @@ +--- +layout: docs +permalink: /docs/community/governance/ +noedit: true +category: Project +display: Governance +weight: 20 +--- + +{% include GOVERNANCE.md %} diff --git a/website/content/docs/community/index.md b/website/content/docs/community/index.md new file mode 100755 index 000000000..8971647d1 --- /dev/null +++ b/website/content/docs/community/index.md @@ -0,0 +1,42 @@ +--- +title: Contributing +layout: docs +category: Project +weight: 20 +permalink: /docs/community/ +--- + +Contributions are accepted through our [GitHub][github] and [Google +Source][googlesource] repositories. Individual projects have their own +[contribution process][contributing]. + +## Community + +The authoritative document for community resources and organization is the +[community repository][community], which contains the project's [governance +model][governance] and [code of conduct][codeofconduct]. Individual repositories +have their own guidelines and processes for contributing. See the [canonical +list of repositories][repositories] for more information. + +The project maintains two mailing lists: + +* [gvisor-users][gvisor-users] for accouncements and general discussion. +* [gvisor-dev][gvisor-dev] for development and contribution. + +We also have a [chat room hosted on Gitter][gitter-chat]. + +The community calendar shows upcoming public meetings and opportunities to +collaborate. + + + +[community]: https://gvisor.googlesource.com/community +[contributing]: https://github.com/google/gvisor/blob/master/CONTRIBUTING.md +[github]: https://github.com/google/gvisor +[gitter-chat]: https://gitter.im/gvisor/community +[governance]: https://gvisor.googlesource.com/community/+/refs/heads/master/README.md +[googlesource]: https://gvisor.googlesource.com/ +[gvisor-dev]: https://groups.google.com/forum/#!forum/gvisor-dev +[gvisor-users]: https://groups.google.com/forum/#!forum/gvisor-users +[codeofconduct]: https://gvisor.googlesource.com/community/+/refs/heads/master/CODE_OF_CONDUCT.md +[repositories]: https://gvisor.googlesource.com/?format=HTML diff --git a/website/content/docs/includes/index.md b/website/content/docs/includes/index.md new file mode 100755 index 000000000..ca03031f1 --- /dev/null +++ b/website/content/docs/includes/index.md @@ -0,0 +1,3 @@ +--- +headless: true +--- diff --git a/website/content/docs/index.md b/website/content/docs/index.md new file mode 100755 index 000000000..ad33ce142 --- /dev/null +++ b/website/content/docs/index.md @@ -0,0 +1,30 @@ +--- +title: Documentation +layout: docs +--- + +gVisor is a user-space kernel, written in Go, that implements a substantial +portion of the [Linux system call interface][linux]. It provides an additional +layer of isolation between running applications and the host operating system. + +gVisor includes an [Open Container Initiative (OCI)][oci] runtime called `runsc` +that makes it easy to work with existing container tooling. The `runsc` runtime +integrates with Docker and Kubernetes, making it simple to run sandboxed +containers. + +gVisor takes a distinct approach to container sandboxing and makes a different +set of technical trade-offs compared to existing sandbox technologies, thus +providing new tools and ideas for the container security landscape. + +gVisor can be used with Docker, Kubernetes, or directly using `runsc`. Use the +links below to see detailed instructions for each of them: + +* [Docker](./user_guide/quick_start/docker/): The quickest and easiest way to + get started. +* [Kubernetes](./user_guide/quick_start/kubernetes/): Isolate Pods in your K8s + cluster with gVisor. +* [OCI Quick Start](./user_guide/quick_start/oci/): Expert mode. Customize + gVisor for your environment. + +[linux]: https://en.wikipedia.org/wiki/Linux_kernel_interfaces +[oci]: https://www.opencontainers.org diff --git a/website/content/docs/tutorials/add-node-pool.png b/website/content/docs/tutorials/add-node-pool.png new file mode 100755 index 000000000..e4560359b Binary files /dev/null and b/website/content/docs/tutorials/add-node-pool.png differ diff --git a/website/content/docs/tutorials/cni.md b/website/content/docs/tutorials/cni.md new file mode 100644 index 000000000..28d58c946 --- /dev/null +++ b/website/content/docs/tutorials/cni.md @@ -0,0 +1,179 @@ +--- +title: "Using CNI" +permalink: /docs/tutorials/cni/ +layout: docs +category: User Guide +subcategory: Tutorials +weight: 12 +--- + +This tutorial will show you how to set up networking for a gVisor sandbox using +the [Container Networking Interface (CNI)](https://github.com/containernetworking/cni). + +## Install CNI Plugins + +First you will need to install the CNI plugins. CNI plugins are used to set up +a network namespace that `runsc` can use with the sandbox. + +Start by creating the directories for CNI plugin binaries: + +``` +sudo mkdir -p /opt/cni/bin +``` + +Download the CNI plugins: + +``` +wget https://github.com/containernetworking/plugins/releases/download/v0.8.3/cni-plugins-linux-amd64-v0.8.3.tgz +``` + +Next, unpack the plugins into the CNI binary directory: + +``` +sudo tar -xvf cni-plugins-linux-amd64-v0.8.3.tgz -C /opt/cni/bin/ +``` + +## Configure CNI Plugins + +This section will show you how to configure CNI plugins. This tutorial will use +the "bridge" and "loopback" plugins which will create the necessary bridge and +loopback devices in our network namespace. However, you should be able to use +any CNI compatible plugin to set up networking for gVisor sandboxes. + +The bridge plugin configuration specifies the IP address subnet range for IP +addresses that will be assigned to sandboxes as well as the network routing +configuration. This tutorial will assign IP addresses from the `10.22.0.0/16` +range and allow all outbound traffic, however you can modify this configuration +to suit your use case. + +Create the bridge and loopback plugin configurations: + +``` +sudo mkdir -p /etc/cni/net.d + +sudo sh -c 'cat > /etc/cni/net.d/10-bridge.conf << EOF +{ + "cniVersion": "0.4.0", + "name": "mynet", + "type": "bridge", + "bridge": "cni0", + "isGateway": true, + "ipMasq": true, + "ipam": { + "type": "host-local", + "subnet": "10.22.0.0/16", + "routes": [ + { "dst": "0.0.0.0/0" } + ] + } +} +EOF' + +sudo sh -c 'cat > /etc/cni/net.d/99-loopback.conf << EOF +{ + "cniVersion": "0.4.0", + "name": "lo", + "type": "loopback" +} +EOF' +``` + +## Create a Network Namespace + +For each gVisor sandbox you will create a network namespace and configure it +using CNI. First, create a random network namespace name and then create +the namespace. + +The network namespace path will then be `/var/run/netns/${CNI_CONTAINERID}`. + +``` +export CNI_PATH=/opt/cni/bin +export CNI_CONTAINERID=$(printf '%x%x%x%x' $RANDOM $RANDOM $RANDOM $RANDOM) +export CNI_COMMAND=ADD +export CNI_NETNS=/var/run/netns/${CNI_CONTAINERID} + +sudo ip netns add ${CNI_CONTAINERID} +``` + +Next, run the bridge and loopback plugins to apply the configuration that was +created earlier to the namespace. Each plugin outputs some JSON indicating the +results of executing the plugin. For example, The bridge plugin's response +includes the IP address assigned to the ethernet device created in the network +namespace. Take note of the IP address for use later. + +``` +export CNI_IFNAME="eth0" +sudo -E /opt/cni/bin/bridge < /etc/cni/net.d/10-bridge.conf +export CNI_IFNAME="lo" +sudo -E /opt/cni/bin/loopback < /etc/cni/net.d/99-loopback.conf +``` + +Get the IP address assigned to our sandbox: + +``` +POD_IP=$(sudo ip netns exec ${CNI_CONTAINERID} ip -4 addr show eth0 | grep -oP '(?<=inet\s)\d+(\.\d+){3}') +``` + +## Create the OCI Bundle + +Now that our network namespace is created and configured, we can create the OCI +bundle for our container. As part of the bundle's `config.json` we will specify +that the container use the network namespace that we created. + +The container will run a simple python webserver that we will be able to +connect to via the IP address assigned to it via the bridge CNI plugin. + +Create the bundle and root filesystem directories: + +``` +sudo mkdir -p bundle +cd bundle +sudo mkdir rootfs +sudo docker export $(docker create python) | sudo tar --same-owner -pxf - -C rootfs +sudo mkdir -p rootfs/var/www/html +sudo sh -c 'echo "Hello World!" > rootfs/var/www/html/index.html' +``` + +Next create the `config.json` specifying the network namespace. +``` +sudo /usr/local/bin/runsc spec +sudo sed -i 's;"sh";"python", "-m", "http.server";' config.json +sudo sed -i "s;\"cwd\": \"/\";\"cwd\": \"/var/www/html\";" config.json +sudo sed -i "s;\"type\": \"network\";\"type\": \"network\",\n\t\t\t\t\"path\": \"/var/run/netns/${CNI_CONTAINERID}\";" config.json +``` + +## Run the Container + +Now we can run and connect to the webserver. Run the container in gVisor. Use +the same ID used for the network namespace to be consistent: + +``` +sudo runsc run -detach ${CNI_CONTAINERID} +``` + +Connect to the server via the sandbox's IP address: + +``` +curl http://${POD_IP}:8000/ +``` + +You should see the server returning `Hello World!`. + +## Cleanup + +After you are finished running the container, you can clean up the network +namespace . + +``` +sudo runsc kill ${CNI_CONTAINERID} +sudo runsc delete ${CNI_CONTAINERID} + +export CNI_COMMAND=DEL + +export CNI_IFNAME="lo" +sudo -E /opt/cni/bin/loopback < /etc/cni/net.d/99-loopback.conf +export CNI_IFNAME="eth0" +sudo -E /opt/cni/bin/bridge < /etc/cni/net.d/10-bridge.conf + +sudo ip netns delete ${CNI_CONTAINERID} +``` diff --git a/website/content/docs/tutorials/docker.md b/website/content/docs/tutorials/docker.md new file mode 100755 index 000000000..ddccbccd6 --- /dev/null +++ b/website/content/docs/tutorials/docker.md @@ -0,0 +1,75 @@ +--- +title: "WordPress with Docker" +permalink: /docs/tutorials/docker/ +layout: docs +category: User Guide +subcategory: Tutorials +weight: 25 +--- + +This page shows you how to deploy a sample [WordPress][wordpress] site using +[Docker][docker]. + +### Before you begin + +[Follow these instructions][docker-install] to install runsc with Docker. +This document assumes that the runtime name chosen is `runsc`. + +### Running WordPress + +Now, let's deploy a WordPress site using Docker. WordPress site requires +two containers: web server in the frontend, MySQL database in the backend. + +First, let's define a few environment variables that are shared between both +containers: + +```bash +export MYSQL_PASSWORD=${YOUR_SECRET_PASSWORD_HERE?} +export MYSQL_DB=wordpress +export MYSQL_USER=wordpress +``` + +Next, let's start the database container running MySQL and wait until the +database is initialized: + +```bash +docker run --runtime=runsc --name mysql -d \ + -e MYSQL_RANDOM_ROOT_PASSWORD=1 \ + -e MYSQL_PASSWORD="${MYSQL_PASSWORD}" \ + -e MYSQL_DATABASE="${MYSQL_DB}" \ + -e MYSQL_USER="${MYSQL_USER}" \ + mysql:5.7 + +# Wait until this message appears in the log. +docker logs mysql |& grep 'port: 3306 MySQL Community Server (GPL)' +``` + +Once the database is running, you can start the WordPress frontend. We use the +`--link` option to connect the frontend to the database, and expose the +WordPress to port 8080 on the localhost. + +```bash +docker run --runtime=runsc --name wordpress -d \ + --link mysql:mysql \ + -p 8080:80 \ + -e WORDPRESS_DB_HOST=mysql \ + -e WORDPRESS_DB_USER="${MYSQL_USER}" \ + -e WORDPRESS_DB_PASSWORD="${MYSQL_PASSWORD}" \ + -e WORDPRESS_DB_NAME="${MYSQL_DB}" \ + -e WORDPRESS_TABLE_PREFIX=wp_ \ + wordpress +``` + +Now, you can access the WordPress website pointing your favorite browser to +. + +Congratulations! You have just deployed a WordPress site using Docker. + +### What's next + +[Learn how to deploy WordPress with Kubernetes][wordpress-k8s]. + +[docker]: https://www.docker.com/ +[docker-install]: /docs/user_guide/quick_start/docker/ +[wordpress]: https://wordpress.com/ +[wordpress-k8s]: /docs/tutorials/kubernetes/ diff --git a/website/content/docs/tutorials/kubernetes.md b/website/content/docs/tutorials/kubernetes.md new file mode 100755 index 000000000..a5383cede --- /dev/null +++ b/website/content/docs/tutorials/kubernetes.md @@ -0,0 +1,241 @@ +--- +title: "WordPress with Kubernetes" +permalink: /docs/tutorials/kubernetes/ +layout: docs +category: User Guide +subcategory: Tutorials +weight: 28 +--- + +This page shows you how to deploy a sample [WordPress][wordpress] site using +[GKE Sandbox][gke-sandbox]. + +### Before you begin + +Take the following steps to enable the Kubernetes Engine API: + +1. Visit the [Kubernetes Engine page][project-selector] in the Google Cloud + Platform Console. +1. Create or select a project. + +### Creating a node pool with gVisor enabled + +Create a node pool inside your cluster with option `--sandbox type=gvisor` added +to the command, like below: + +```bash +gcloud beta container node-pools create sandbox-pool --cluster=${CLUSTER_NAME} --image-type=cos_containerd --sandbox type=gvisor +``` + +If you prefer to use the console, select your cluster and select the **ADD NODE +POOL** button: + +![+ ADD NODE POOL](/docs/tutorials/node-pool-button.png) + +Then select the **Image type** with **Containerd** and select **Enable sandbox +with gVisor** option. Select other options as you like: + +![+ NODE POOL](/docs/tutorials/add-node-pool.png) + +### Check that gVisor is enabled + +The gvisor RuntimeClass is instantiated during node creation. You can check for +the existence of the gvisor RuntimeClass using the following command: + +```bash +kubectl get runtimeclasses +``` + +### Wordpress deployment + +Now, let's deploy a WordPress site using GKE Sandbox. WordPress site requires +two pods: web server in the frontend, MySQL database in the backend. Both +applications use PersistentVolumes to store the site data data. +In addition, they use secret store to share MySQL password between them. + +First, let's download the deployment configuration files to add the runtime +class annotation to them: + +```bash +curl -LO https://k8s.io/examples/application/wordpress/wordpress-deployment.yaml +curl -LO https://k8s.io/examples/application/wordpress/mysql-deployment.yaml +``` + +Add a **spec.template.spec.runtimeClassName** set to **gvisor** to both files, +as shown below: + +**wordpress-deployment.yaml:** +```yaml +apiVersion: v1 +kind: Service +metadata: + name: wordpress + labels: + app: wordpress +spec: + ports: + - port: 80 + selector: + app: wordpress + tier: frontend + type: LoadBalancer +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: wp-pv-claim + labels: + app: wordpress +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 20Gi +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: wordpress + labels: + app: wordpress +spec: + selector: + matchLabels: + app: wordpress + tier: frontend + strategy: + type: Recreate + template: + metadata: + labels: + app: wordpress + tier: frontend + spec: + runtimeClassName: gvisor # ADD THIS LINE + containers: + - image: wordpress:4.8-apache + name: wordpress + env: + - name: WORDPRESS_DB_HOST + value: wordpress-mysql + - name: WORDPRESS_DB_PASSWORD + valueFrom: + secretKeyRef: + name: mysql-pass + key: password + ports: + - containerPort: 80 + name: wordpress + volumeMounts: + - name: wordpress-persistent-storage + mountPath: /var/www/html + volumes: + - name: wordpress-persistent-storage + persistentVolumeClaim: + claimName: wp-pv-claim +``` + +**mysql-deployment.yaml:** +```yaml +apiVersion: v1 +kind: Service +metadata: + name: wordpress-mysql + labels: + app: wordpress +spec: + ports: + - port: 3306 + selector: + app: wordpress + tier: mysql + clusterIP: None +--- +apiVersion: v1 +kind: PersistentVolumeClaim +metadata: + name: mysql-pv-claim + labels: + app: wordpress +spec: + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 20Gi +--- +apiVersion: apps/v1 +kind: Deployment +metadata: + name: wordpress-mysql + labels: + app: wordpress +spec: + selector: + matchLabels: + app: wordpress + tier: mysql + strategy: + type: Recreate + template: + metadata: + labels: + app: wordpress + tier: mysql + spec: + runtimeClassName: gvisor # ADD THIS LINE + containers: + - image: mysql:5.6 + name: mysql + env: + - name: MYSQL_ROOT_PASSWORD + valueFrom: + secretKeyRef: + name: mysql-pass + key: password + ports: + - containerPort: 3306 + name: mysql + volumeMounts: + - name: mysql-persistent-storage + mountPath: /var/lib/mysql + volumes: + - name: mysql-persistent-storage + persistentVolumeClaim: + claimName: mysql-pv-claim +``` + +Note that apart from `runtimeClassName: gvisor`, nothing else about the +Deployment has is changed. + +You are now ready to deploy the entire application. Just create a secret to +store MySQL's password and *apply* both deployments: + +```bash +kubectl create secret generic mysql-pass --from-literal=password=${YOUR_SECRET_PASSWORD_HERE?} +kubectl apply -f mysql-deployment.yaml +kubectl apply -f wordpress-deployment.yaml +``` + +Wait for the deployments to be ready and an external IP to be assigned to the +Wordpress service: + +```bash +watch kubectl get service wordpress +``` + +Now, copy the service `EXTERNAL-IP` from above to your favorite browser to view +and configure your new WordPress site. + +Congratulations! You have just deployed a WordPress site using GKE Sandbox. + +### What's next + +To learn more about GKE Sandbox and how to run your deployment securely, take +a look at the [documentation][gke-sandbox-docs]. + +[gke-sandbox-docs]: https://cloud.google.com/kubernetes-engine/docs/how-to/sandbox-pods +[gke-sandbox]: https://cloud.google.com/kubernetes-engine/sandbox/ +[project-selector]: https://console.cloud.google.com/projectselector/kubernetes +[wordpress]: https://wordpress.com/ diff --git a/website/content/docs/tutorials/node-pool-button.png b/website/content/docs/tutorials/node-pool-button.png new file mode 100755 index 000000000..bee0c11dc Binary files /dev/null and b/website/content/docs/tutorials/node-pool-button.png differ diff --git a/website/content/docs/user_guide/FAQ.md b/website/content/docs/user_guide/FAQ.md new file mode 100755 index 000000000..951192495 --- /dev/null +++ b/website/content/docs/user_guide/FAQ.md @@ -0,0 +1,118 @@ +--- +title: "FAQ" +permalink: /docs/user_guide/FAQ/ +layout: docs +category: User Guide +weight: 90 +--- + +### What operating systems are supported? {#supported-os} + +Today, gVisor requires Linux. + +### What CPU architectures are supported? {#supported-cpus} + +gVisor currently supports [x86_64/AMD64](https://en.wikipedia.org/wiki/X86-64) +compatible processors. + +### Do I need to modify my Linux application to use gVisor? {#modify-app} + +No. gVisor is capable of running unmodified Linux binaries. + +### What binary formats does gVisor support? {#supported-binaries} + +gVisor supports Linux +[ELF](https://en.wikipedia.org/wiki/Executable_and_Linkable_Format) binaries. +Binaries run in gVisor should be built for the +[AMD64](https://en.wikipedia.org/wiki/X86-64) CPU architecture. + +### Can I run Docker images using gVisor? {#docker-images} + +Yes. Please see the [Docker Quick Start][docker]. + +### Can I run Kubernetes pods using gVisor? {#k8s-pods} + +Yes. Please see the [Kubernetes Quick Start][k8s]. + +### What's the security model? {#security-model} + +See the [Security Model][security-model]. + +## Troubleshooting + +### My container runs fine with `runc` but fails with `runsc` {#app-compatibility} + +If you’re having problems running a container with `runsc` it’s most likely due +to a compatibility issue or a missing feature in gVisor. See +[Debugging][debugging]. + +### When I run my container, docker fails with: `open /run/containerd/...//log.json: no such file or directory` {#memfd-create} + +You are using an older version of Linux which doesn't support `memfd_create`. + +This is tracked in [bug #268](https://gvisor.dev/issue/268). + +### When I run my container, docker fails with: `flag provided but not defined: -console` {#old-docker} + +You're using an old version of Docker. See [Docker Quick Start][docker]. + +### I can’t see a file copied with: `docker cp` {#fs-cache} + +For performance reasons, gVisor caches directory contents, and therefore it may +not realize a new file was copied to a given directory. To invalidate the cache +and force a refresh, create a file under the directory in question and list the +contents again. + +As a workaround, shared root filesystem can be enabled. See [Filesystem][filesystem]. + +This bug is tracked in [bug #4](https://gvisor.dev/issue/4). + +Note that `kubectl cp` works because it does the copy by exec'ing inside the +sandbox, and thus gVisor's internal cache is made aware of the new files and +directories. + +### I'm getting an error like: `panic: unable to attach: operation not permitted` or `fork/exec /proc/self/exe: invalid argument: unknown` {#runsc-perms} + +Make sure that permissions and the owner is correct on the `runsc` binary. + +```bash +sudo chown root:root /usr/local/bin/runsc +sudo chmod 0755 /usr/local/bin/runsc +``` + +### I'm getting an error like `mount submount "/etc/hostname": creating mount with source ".../hostname": input/output error: unknown.` {#memlock} + +There is a bug in Linux kernel versions 5.1 to 5.3.15, 5.4.2, and 5.5. Upgrade to a newer kernel or add the following to `/lib/systemd/system/containerd.service` as a workaround. + +``` +LimitMEMLOCK=infinity +``` + +And run `systemctl daemon-reload && systemctl restart containerd` to restart containerd. + +See [issue #1765](https://gvisor.dev/issue/1765) for more details. + +### My container cannot resolve another container's name when using Docker user defined bridge {#docker-bridge} + +This is normally indicated by errors like `bad address 'container-name'` when +trying to communicate to another container in the same network. + +Docker user defined bridge uses an embedded DNS server bound to the loopback +interface on address 127.0.0.10. This requires access to the host network in +order to communicate to the DNS server. runsc network is isolated from the +host and cannot access the DNS server on the host network without breaking the +sandbox isolation. There are a few different workarounds you can try: + +* Use default bridge network with `--link` to connect containers. Default + bridge doesn't use embedded DNS. +* Use [`--network=host`][host-net] option in runsc, however beware that it will + use the host network stack and is less secure. +* Use IPs instead of container names. +* Use [Kubernetes][k8s]. Container name lookup works fine in Kubernetes. + +[security-model]: /docs/architecture_guide/security/ +[host-net]: /docs/user_guide/networking/#network-passthrough +[debugging]: /docs/user_guide/debugging/ +[filesystem]: /docs/user_guide/filesystem/ +[docker]: /docs/user_guide/quick_start/docker/ +[k8s]: /docs/user_guide/quick_start/kubernetes/ diff --git a/website/content/docs/user_guide/checkpoint_restore.md b/website/content/docs/user_guide/checkpoint_restore.md new file mode 100755 index 000000000..c7179b550 --- /dev/null +++ b/website/content/docs/user_guide/checkpoint_restore.md @@ -0,0 +1,104 @@ +--- +title: "Checkpoint/Restore" +permalink: /docs/user_guide/checkpoint_restore/ +layout: docs +category: User Guide +weight: 60 +--- +gVisor has the ability to checkpoint a process, save its current state in a +state file, and restore into a new container using the state file. + +## How to use checkpoint/restore + +Checkpoint/restore functionality is currently available via raw `runsc` +commands. To use the checkpoint command, first run a container. + +```bash +runsc run +``` + +To checkpoint the container, the `--image-path` flag must be provided. This is +the directory path within which the checkpoint state-file will be created. The +file will be called `checkpoint.img` and necessary directories will be created +if they do not yet exist. + +> Note: Two checkpoints cannot be saved to the same directory; every image-path +> provided must be unique. + +```bash +runsc checkpoint --image-path= +``` + +There is also an optional `--leave-running` flag that allows the container to +continue to run after the checkpoint has been made. (By default, containers stop +their processes after committing a checkpoint.) + +> Note: All top-level runsc flags needed when calling run must be provided to +> checkpoint if --leave-running is used. + +> Note: --leave-running functions by causing an immediate restore so the +> container, although will maintain its given container id, may have a different +> process id. + +```bash +runsc checkpoint --image-path= --leave-running +``` + +To restore, provide the image path to the `checkpoint.img` file created during +the checkpoint. Because containers stop by default after checkpointing, restore +needs to happen in a new container (restore is a command which parallels start). + +```bash +runsc create + +runsc restore --image-path= +``` + +## How to use checkpoint/restore in Docker: + +Currently checkpoint/restore through `runsc` is not entirely compatible with +Docker, although there has been progress made from both gVisor and Docker to +enable compatibility. Here, we document the ideal workflow. + +Run a container: + +```bash +docker run [options] --runtime=runsc ` +``` + +Checkpoint a container: + +```bash +docker checkpoint create ` +``` + +Create a new container into which to restore: + +```bash +docker create [options] --runtime=runsc +``` + +Restore a container: + +```bash +docker start --checkpoint --checkpoint-dir= +``` + +### Issues Preventing Compatibility with Docker + +- **[Moby #37360][leave-running]:** Docker version 18.03.0-ce and earlier hangs + when checkpointing and does not create the checkpoint. To successfully use + this feature, install a custom version of docker-ce from the moby repository. + This issue is caused by an improper implementation of the `--leave-running` + flag. This issue is fixed in newer releases. +- **Docker does not support restoration into new containers:** Docker currently + expects the container which created the checkpoint to be the same container + used to restore which is not possible in runsc. When Docker supports container + migration and therefore restoration into new containers, this will be the + flow. +- **[Moby #37344][checkpoint-dir]:** Docker does not currently support the + `--checkpoint-dir` flag but this will be required when restoring from a + checkpoint made in another container. + +[leave-running]: https://github.com/moby/moby/pull/37360 +[checkpoint-dir]: https://github.com/moby/moby/issues/37344 diff --git a/website/content/docs/user_guide/compatibility/.gitignore b/website/content/docs/user_guide/compatibility/.gitignore new file mode 100755 index 000000000..a08e1f35e --- /dev/null +++ b/website/content/docs/user_guide/compatibility/.gitignore @@ -0,0 +1 @@ +linux diff --git a/website/content/docs/user_guide/compatibility/index.md b/website/content/docs/user_guide/compatibility/index.md new file mode 100755 index 000000000..374a0992b --- /dev/null +++ b/website/content/docs/user_guide/compatibility/index.md @@ -0,0 +1,93 @@ +--- +title: Applications +layout: docs +category: Compatibility +weight: 0 +permalink: /docs/user_guide/compatibility/ +--- + +gVisor implements a large portion of the Linux surface and while we strive to +make it broadly compatible, there are (and always will be) unimplemented +features and bugs. The only real way to know if it will work is to try. If you +find a container that doesn’t work and there is no known issue, please [file a +bug][bug] indicating the full command you used to run the image. You can view +open issues related to compatibility [here][issues]. + +If you're able to provide the [debug logs](../debugging/), the +problem likely to be fixed much faster. + +## What works? + +The following applications/images have been tested: + +* elasticsearch +* golang +* httpd +* java8 +* jenkins +* mariadb +* memcached +* mongo +* mysql +* nginx +* node +* php +* postgres +* prometheus +* python +* redis +* registry +* tomcat +* wordpress + +## Utilities + +Most common utilities work. Note that: + +* Some tools, such as `tcpdump` and old versions of `ping`, require explicitly + enabling raw sockets via the unsafe `--net-raw` runsc flag. +* Different Docker images can behave differently. For example, Alpine Linux and + Ubuntu have different `ip` binaries. + + Specific tools include: + +| Tool | Status | +| --- | --- | +| apt-get | Working | +| bundle | Working | +| cat | Working | +| curl | Working | +| dd | Working | +| df | Working | +| dig | Working | +| drill | Working | +| env | Working | +| find | Working | +| gdb | Working | +| gosu | Working | +| grep | Working (unless stdin is a pipe and stdout is /dev/null) | +| ifconfig | Works partially, like ip. Full support [in progress](https://gvisor.dev/issue/578) | +| ip | Some subcommands work (e.g. addr, route). Full support [in progress](https://gvisor.dev/issue/578) | +| less | Working | +| ls | Working | +| lsof | Working | +| mount | Works in readonly mode. gVisor doesn't currently support creating new mounts at runtime | +| nc | Working | +| nmap | Not working | +| netstat | [In progress](https://gvisor.dev/issue/2112) | +| nslookup | Working | +| ping | Working | +| ps | Working | +| route | Working | +| ss | [In progress](https://gvisor.dev/issue/2114) | +| sshd | Partially working. Job control [in progress](https://gvisor.dev/issue/154) | +| strace | Working | +| tar | Working | +| tcpdump | [In progress](https://gvisor.dev/issue/173) | +| top | Working | +| uptime | Working | +| vim | Working | +| wget | Working | + +[bug]: https://github.com/google/gvisor/issues/new?title=Compatibility%20Issue: +[issues]: https://github.com/google/gvisor/issues?q=is%3Aissue+is%3Aopen+label%3A%22area%3A+compatibility%22 diff --git a/website/content/docs/user_guide/debugging.md b/website/content/docs/user_guide/debugging.md new file mode 100755 index 000000000..9353ac907 --- /dev/null +++ b/website/content/docs/user_guide/debugging.md @@ -0,0 +1,135 @@ +--- +title: "Debugging" +permalink: /docs/user_guide/debugging/ +layout: docs +category: User Guide +weight: 70 +--- + +To enable debug and system call logging, add the `runtimeArgs` below to your +[Docker](../quick_start/docker/) configuration (`/etc/docker/daemon.json`): + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--debug-log=/tmp/runsc/", + "--debug", + "--strace" + ] + } + } +} +``` + +> Note: the last `/` in `--debug-log` is needed to interpret it as a directory. +> Then each `runsc` command executed will create a separate log file. +> Otherwise, log messages from all commands will be appended to the same file. + +You may also want to pass `--log-packets` to troubleshoot network problems. Then +restart the Docker daemon: + +```bash +sudo systemctl restart docker +``` + +Run your container again, and inspect the files under `/tmp/runsc`. The log file +ending with `.boot` will contain the strace logs from your application, which can +be useful for identifying missing or broken system calls in gVisor. If you are +having problems starting the container, the log file ending with `.create` may +have the reason for the failure. + +## Stack traces + +The command `runsc debug --stacks` collects stack traces while the sandbox is +running which can be useful to troubleshoot issues or just to learn more about +gVisor. It connects to the sandbox process, collects a stack dump, and writes +it to the console. For example: + +```bash +docker run --runtime=runsc --rm -d alpine sh -c "while true; do echo running; sleep 1; done" +63254c6ab3a6989623fa1fb53616951eed31ac605a2637bb9ddba5d8d404b35b + +sudo runsc --root /var/run/docker/runtime-runsc/moby debug --stacks 63254c6ab3a6989623fa1fb53616951eed31ac605a2637bb9ddba5d8d404b35b +``` + +> Note: `--root` variable is provided by docker and is normally set to +> `/var/run/docker/runtime-[runtime-name]/moby`. If in doubt, `--root` is logged to +> `runsc` logs. + +## Debugger + +You can debug gVisor like any other Golang program. If you're running with Docker, +you'll need to find the sandbox PID and attach the debugger as root. Here is an +example: + +```bash +# Get a runsc with debug symbols (download nightly or build with symbols). +bazel build -c dbg //runsc:runsc + +# Start the container you want to debug. +docker run --runtime=runsc --rm --name=test -d alpine sleep 1000 + +# Find the sandbox PID. +docker inspect test | grep Pid | head -n 1 + +# Attach your favorite debugger. +sudo dlv attach + +# Set a breakpoint and resume. +break mm.MemoryManager.MMap +continue +``` + +## Profiling + +`runsc` integrates with Go profiling tools and gives you easy commands to profile +CPU and heap usage. First you need to enable `--profile` in the command line options +before starting the container: + +```json +{ + "runtimes": { + "runsc-prof": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--profile" + ] + } + } +} +``` + +> Note: Enabling profiling loosens the seccomp protection added to the sandbox, +> and should not be run in production under normal circumstances. + +Then restart docker to refresh the runtime options. While the container is running, +execute `runsc debug` to collect profile information and save to a file. Here are +the options available: + +* **--profile-heap:** Generates heap profile to the speficied file. +* **--profile-cpu:** Enables CPU profiler, waits for `--duration` seconds + and generates CPU profile to the speficied file. + +For example: + +```bash +docker run --runtime=runsc-prof --rm -d alpine sh -c "while true; do echo running; sleep 1; done" +63254c6ab3a6989623fa1fb53616951eed31ac605a2637bb9ddba5d8d404b35b + +sudo runsc --root /var/run/docker/runtime-runsc-prof/moby debug --profile-heap=/tmp/heap.prof 63254c6ab3a6989623fa1fb53616951eed31ac605a2637bb9ddba5d8d404b35b +sudo runsc --root /var/run/docker/runtime-runsc-prof/moby debug --profile-cpu=/tmp/cpu.prof --duration=30s 63254c6ab3a6989623fa1fb53616951eed31ac605a2637bb9ddba5d8d404b35b +``` + +The resulting files can be opened using `go tool pprof` or [pprof][]. The examples +below create image file (`.svg`) with the heap profile and writes the top +functions using CPU to the console: + +```bash +go tool pprof -svg /usr/local/bin/runsc /tmp/heap.prof +go tool pprof -top /usr/local/bin/runsc /tmp/cpu.prof +``` + +[pprof]: https://github.com/google/pprof/blob/master/doc/README.md diff --git a/website/content/docs/user_guide/filesystem.md b/website/content/docs/user_guide/filesystem.md new file mode 100755 index 000000000..a320b95f3 --- /dev/null +++ b/website/content/docs/user_guide/filesystem.md @@ -0,0 +1,63 @@ +--- +title: "Filesystem" +permalink: /docs/user_guide/filesystem/ +layout: docs +category: User Guide +weight: 40 +--- + +gVisor accesses the filesystem through a file proxy, called the Gofer. The gofer +runs as a separate process, that is isolated from the sandbox. Gofer instances +communicate with their respective sentry using the 9P protocol. For a more detailed +explanation see [Overview > Gofer](../../architecture_guide/#gofer). + +## Sandbox overlay + +To isolate the host filesystem from the sandbox, you can set a writable tmpfs overlay +on top of the entire filesystem. All modifications are made to the overlay, keeping +the host filesystem unmodified. + +> Note: All created and modified files are stored in memory inside the sandbox. + +To use the tmpfs overlay, add the following `runtimeArgs` to your Docker configuration +(`/etc/docker/daemon.json`) and restart the Docker daemon: + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--overlay" + ] + } + } +} +``` + +## Shared root filesystem + +The root filesystem is where the image is extracted and is not generally modified +from outside the sandbox. This allows for some optimizations, like skipping checks +to determine if a directory has changed since the last time it was cached, thus +missing updates that may have happened. If you need to `docker cp` files inside the +root filesystem, you may want to enable shared mode. Just be aware that file system +access will be slower due to the extra checks that are required. + +> Note: External mounts are always shared. + +To use set the root filesystem shared, add the following `runtimeArgs` to your Docker +configuration (`/etc/docker/daemon.json`) and restart the Docker daemon: + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--file-access=shared" + ] + } + } +} +``` diff --git a/website/content/docs/user_guide/install.md b/website/content/docs/user_guide/install.md new file mode 100755 index 000000000..c5d4891bb --- /dev/null +++ b/website/content/docs/user_guide/install.md @@ -0,0 +1,163 @@ +--- +title: "Installation" +permalink: /docs/user_guide/install/ +layout: docs +category: User Guide +weight: 10 +--- + +{% include required_linux.html %} + +## Versions + +The `runsc` binaries and repositories are available in multiple versions and +release channels. You should pick the version you'd like to install. For +experimentation, the nightly release is recommended. For production use, the +latest release is recommended. + +After selecting an appropriate release channel from the options below, proceed +to the preferred installation mechanism: manual or from an `apt` repository. + +### HEAD + +Binaries are available for every commit on the `master` branch, and are +available at the following URL: + + `https://storage.googleapis.com/gvisor/releases/master/latest/runsc` + +Checksums for the release binary are at: + + `https://storage.googleapis.com/gvisor/releases/master/latest/runsc.sha512` + +For `apt` installation, use the `master` as the `${DIST}` below. + +### Nightly + +Nightly releases are built most nights from the master branch, and are available +at the following URL: + + `https://storage.googleapis.com/gvisor/releases/nightly/latest/runsc` + +Checksums for the release binary are at: + + `https://storage.googleapis.com/gvisor/releases/nightly/latest/runsc.sha512` + +Specific nightly releases can be found at: + + `https://storage.googleapis.com/gvisor/releases/nightly/${yyyy-mm-dd}/runsc` + +Note that a release may not be available for every day. + +For `apt` installation, use the `nightly` as the `${DIST}` below. + +### Latest release + +The latest official release is available at the following URL: + + `https://storage.googleapis.com/gvisor/releases/release/latest` + +For `apt` installation, use the `release` as the `${DIST}` below. + +### Specific release + +A given release release is available at the following URL: + + `https://storage.googleapis.com/gvisor/releases/release/${yyyymmdd}` + +See the [releases][releases] page for information about specific releases. + +For `apt` installation of a specific release, which may include point updates, +use the date of the release, e.g. `${yyyymmdd}`, as the `${DIST}` below. + +> Note: only newer releases may be available as `apt` repositories. + +### Point release + +A given point release is available at the following URL: + + `https://storage.googleapis.com/gvisor/releases/release/${yyyymmdd}.${rc}` + +Note that `apt` installation of a specific point release is not supported. + +## Install from an `apt` repository + +First, appropriate dependencies must be installed to allow `apt` to install +packages via https: + +```bash +sudo apt-get update && \ +sudo apt-get install -y \ + apt-transport-https \ + ca-certificates \ + curl \ + gnupg-agent \ + software-properties-common +``` + +Next, the key used to sign archives should be added to your `apt` keychain: + +```bash +curl -fsSL https://gvisor.dev/archive.key | sudo apt-key add - +``` + +Based on the release type, you will need to substitute `${DIST}` below, using +one of: + +* `master`: For HEAD. +* `nightly`: For nightly releases. +* `release`: For the latest release. +* `${yyyymmdd}`: For a specific releases (see above). + +The repository for the release you wish to install should be added: + +```bash +sudo add-apt-repository "deb https://storage.googleapis.com/gvisor/releases ${DIST} main" +``` + +For example, to install the latest official release, you can use: + +```bash +sudo add-apt-repository "deb https://storage.googleapis.com/gvisor/releases release main" +``` + +Now the runsc package can be installed: + +```bash +sudo apt-get update && sudo apt-get install -y runsc +``` + +If you have Docker installed, it will be automatically configured. + +## Install directly + +The binary URLs provided above can be used to install directly. For example, the +latest nightly binary can be downloaded, validated, and placed in an appropriate +location by running: + +```bash +( + set -e + URL=https://storage.googleapis.com/gvisor/releases/nightly/latest + wget ${URL}/runsc + wget ${URL}/runsc.sha512 + sha512sum -c runsc.sha512 + rm -f runsc.sha512 + sudo mv runsc /usr/local/bin + sudo chown root:root /usr/local/bin/runsc + sudo chmod 0755 /usr/local/bin/runsc +) +``` + +**It is important to copy this binary to a location that is accessible to all +users, and ensure it is executable by all users**, since `runsc` executes itself +as user `nobody` to avoid unnecessary privileges. The `/usr/local/bin` directory +is a good place to put the `runsc` binary. + +After installation, the`runsc` binary comes with an `install` command that can +optionally automatically configure Docker: + +```bash +runsc install +``` + +[releases]: https://github.com/google/gvisor/releases diff --git a/website/content/docs/user_guide/networking.md b/website/content/docs/user_guide/networking.md new file mode 100755 index 000000000..0971a38ff --- /dev/null +++ b/website/content/docs/user_guide/networking.md @@ -0,0 +1,89 @@ +--- +title: "Networking" +permalink: /docs/user_guide/networking/ +layout: docs +category: User Guide +weight: 50 +--- + +gVisor implements its own network stack called [netstack][netstack]. All aspects +of the network stack are handled inside the Sentry — including TCP connection +state, control messages, and packet assembly — keeping it isolated from the host +network stack. Data link layer packets are written directly to the virtual +device inside the network namespace setup by Docker or Kubernetes. + +The IP address and routes configured for the device are transferred inside the +sandbox. The loopback device runs exclusively inside the sandbox and does not +use the host. You can inspect them by running: + +```bash +docker run --rm --runtime=runsc alpine ip addr +``` + +## Network passthrough + +For high-performance networking applications, you may choose to disable the user +space network stack and instead use the host network stack, including the loopback. +Note that this mode decreases the isolation to the host. + +Add the following `runtimeArgs` to your Docker configuration +(`/etc/docker/daemon.json`) and restart the Docker daemon: + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--network=host" + ] + } + } +} +``` + +## Disabling external networking + +To completely isolate the host and network from the sandbox, external +networking can be disabled. The sandbox will still contain a loopback provided +by netstack. + +Add the following `runtimeArgs` to your Docker configuration +(`/etc/docker/daemon.json`) and restart the Docker daemon: + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--network=none" + ] + } + } +} +``` + +### Disable GSO {#gso} + +If your Linux is older than 4.14.17, you can disable Generic Segmentation +Offload (GSO) to run with a kernel that is newer than 3.17. Add the +`--gso=false` flag to your Docker runtime configuration +(`/etc/docker/daemon.json`) and restart the Docker daemon: + +> Note: Network performance, especially for large payloads, will be greatly reduced. + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--gso=false" + ] + } + } +} +``` + +[netstack]: https://github.com/google/netstack diff --git a/website/content/docs/user_guide/platforms.md b/website/content/docs/user_guide/platforms.md new file mode 100755 index 000000000..b32386bc9 --- /dev/null +++ b/website/content/docs/user_guide/platforms.md @@ -0,0 +1,122 @@ +--- +title: "Platforms (KVM)" +permalink: /docs/user_guide/platforms/ +layout: docs +category: User Guide +weight: 30 +--- + +This document will help you set up your system to use a different gVisor +platform. + +## What is a Platform? + +gVisor requires a *platform* to implement interception of syscalls, basic +context switching, and memory mapping functionality. These are described in +more depth in the [Platform Design](../../architecture_guide/platforms/). + +## Selecting a Platform + +The platform is selected by the `--platform` command line flag passed to +`runsc`. By default, the ptrace platform is selected. To select a different +platform, modify your Docker configuration (`/etc/docker/daemon.json`) to +pass this argument: + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--platform=kvm" + ] + } + } +} +``` + +You must restart the Docker daemon after making changes to this file, typically +this is done via `systemd`: + +```bash +sudo systemctl restart docker +``` + +## Example: Using the KVM Platform + +The KVM platform is currently experimental; however, it provides several +benefits over the default ptrace platform. + +### Prerequisites + +You will also to have KVM installed on your system. If you are running a Debian +based system like Debian or Ubuntu you can usually do this by installing the +`qemu-kvm` package. + +```bash +sudo apt-get install qemu-kvm +``` + +If you are using a virtual machine you will need to make sure that nested +virtualization is configured. Here are links to documents on how to set up +nested virtualization in several popular environments: + +* Google Cloud: [Enabling Nested Virtualization for VM Instances][nested-gcp] +* Microsoft Azure: [How to enable nested virtualization in an Azure VM][nested-azure] +* VirtualBox: [Nested Virtualization][nested-virtualbox] +* KVM: [Nested Guests][nested-kvm] + +***Note: nested virtualization will have poor performance and is historically a +cause of security issues (e.g. +[CVE-2018-12904](https://nvd.nist.gov/vuln/detail/CVE-2018-12904)). It is not +recommended for production.*** + +### Configuring Docker + +Per above, you will need to configure Docker to use `runsc` with the KVM +platform. You will remember from the Docker Quick Start that you configured +Docker to use `runsc` as the runtime. Docker allows you to add multiple +runtimes to the Docker configuration. + +Add a new entry for the KVM platform entry to your Docker configuration +(`/etc/docker/daemon.json`) in order to provide the `--platform=kvm` runtime +argument. + +In the end, the file should look something like: + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc" + }, + "runsc-kvm": { + "path": "/usr/local/bin/runsc", + "runtimeArgs": [ + "--platform=kvm" + ] + } + } +} +``` + +You must restart the Docker daemon after making changes to this file, typically +this is done via `systemd`: + +```bash +sudo systemctl restart docker +``` + +## Running a container + +Now run your container using the `runsc-kvm` runtime. This will run the +container using the KVM platform: + +```bash +docker run --runtime=runsc-kvm --rm hello-world +``` + +[nested-azure]: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/nested-virtualization +[nested-gcp]: https://cloud.google.com/compute/docs/instances/enable-nested-virtualization-vm-instances +[nested-virtualbox]: https://www.virtualbox.org/manual/UserManual.html#nested-virt +[nested-kvm]: https://www.linux-kvm.org/page/Nested_Guests diff --git a/website/content/docs/user_guide/quick_start/docker.md b/website/content/docs/user_guide/quick_start/docker.md new file mode 100755 index 000000000..4afeb3e2f --- /dev/null +++ b/website/content/docs/user_guide/quick_start/docker.md @@ -0,0 +1,98 @@ +--- +title: "Docker Quick Start" +permalink: /docs/user_guide/quick_start/docker/ +layout: docs +category: User Guide +subcategory: Quick Start +weight: 15 +--- + +> Note: This guide requires Docker version 17.09.0 or greater. Refer to the +> [Docker documentation][docker] for how to install it. + +This guide will help you quickly get started running Docker containers using +gVisor. + +First, follow the [Installation guide][install]. + +If you use the `apt` repository or the `automated` install, then you can skip +the next section and proceed straight to running a container. + +## Configuring Docker + +First you will need to configure Docker to use `runsc` by adding a runtime +entry to your Docker configuration (`/etc/docker/daemon.json`). You may have to +create this file if it does not exist. Also, some Docker versions also require +you to [specify the `storage-driver` field][storage-driver]. + +In the end, the file should look something like: + +```json +{ + "runtimes": { + "runsc": { + "path": "/usr/local/bin/runsc" + } + } +} +``` + +You must restart the Docker daemon after making changes to this file, typically +this is done via `systemd`: + +```bash +sudo systemctl restart docker +``` + +## Running a container + +Now run your container using the `runsc` runtime: + +```bash +docker run --runtime=runsc --rm hello-world +``` + +You can also run a terminal to explore the container. + +```bash +docker run --runtime=runsc --rm -it ubuntu /bin/bash +``` + +Many docker options are compatible with gVisor, try them out. Here is an example: + +```bash +docker run --runtime=runsc --rm --link backend:database -v ~/bin:/tools:ro -p 8080:80 --cpus=0.5 -it busybox telnet towel.blinkenlights.nl +``` + +## Verify the runtime + +You can verify that you are running in gVisor using the `dmesg` command. + +```text +$ docker run --runtime=runsc -it ubuntu dmesg +[ 0.000000] Starting gVisor... +[ 0.354495] Daemonizing children... +[ 0.564053] Constructing home... +[ 0.976710] Preparing for the zombie uprising... +[ 1.299083] Creating process schedule... +[ 1.479987] Committing treasure map to memory... +[ 1.704109] Searching for socket adapter... +[ 1.748935] Generating random numbers by fair dice roll... +[ 2.059747] Digging up root... +[ 2.259327] Checking naughty and nice process list... +[ 2.610538] Rewriting operating system in Javascript... +[ 2.613217] Ready! +``` + +Note that this is easily replicated by an attacker so applications should never +use `dmesg` to verify the runtime in a security sensitive context. + +Next, look at the different options available for gVisor: [platform][platforms], +[network][networking], [filesystem][filesystem]. + +[docker]: https://docs.docker.com/install/ +[storage-driver]: https://docs.docker.com/engine/reference/commandline/dockerd/#daemon-storage-driver +[install]: /docs/user_guide/install/ +[filesystem]: /docs/user_guide/filesystem/ +[networking]: /docs/user_guide/networking/ +[platforms]: /docs/user_guide/platforms/ diff --git a/website/content/docs/user_guide/quick_start/kubernetes.md b/website/content/docs/user_guide/quick_start/kubernetes.md new file mode 100755 index 000000000..689305082 --- /dev/null +++ b/website/content/docs/user_guide/quick_start/kubernetes.md @@ -0,0 +1,43 @@ +--- +title: "Kubernetes" +permalink: /docs/user_guide/quick_start/kubernetes/ +layout: docs +category: User Guide +subcategory: Quick Start +weight: 17 +--- + +gVisor can be used to run Kubernetes pods and has several integration points +with Kubernetes. + +## Using Minikube + +gVisor can run sandboxed containers in a Kubernetes cluster with Minikube. +After the gVisor addon is enabled, pods with +`io.kubernetes.cri.untrusted-workload` set to true will execute with `runsc`. +Follow [these instructions][minikube] to enable gVisor addon. + +## Using Containerd + +You can also setup Kubernetes nodes to run pods in gvisor using the +[containerd][containerd] CRI runtime and the `gvisor-containerd-shim`. You can +use either the `io.kubernetes.cri.untrusted-workload` annotation or +[RuntimeClass][runtimeclass] to run Pods with `runsc`. You can find +instructions [here][gvisor-containerd-shim]. + +## Using GKE Sandbox + +[GKE Sandbox][gke-sandbox] is available in [Google Kubernetes Engine][gke]. You +just need to deploy a node pool with gVisor enabled in your cluster, and it will +run pods annotated with `runtimeClassName: gvisor` inside a gVisor sandbox for +you. [Here][wordpress-quick] is a quick example showing how to deploy a +WordPress site. You can view the full documentation [here][gke-sandbox-docs]. + +[containerd]: https://containerd.io/ +[minikube]: https://github.com/kubernetes/minikube/blob/master/deploy/addons/gvisor/README.md +[gke]: https://cloud.google.com/kubernetes-engine/ +[gke-sandbox]: https://cloud.google.com/kubernetes-engine/sandbox/ +[gke-sandbox-docs]: https://cloud.google.com/kubernetes-engine/docs/how-to/sandbox-pods +[gvisor-containerd-shim]: https://github.com/google/gvisor-containerd-shim +[runtimeclass]: https://kubernetes.io/docs/concepts/containers/runtime-class/ +[wordpress-quick]: /docs/tutorials/kubernetes/ diff --git a/website/content/docs/user_guide/quick_start/oci.md b/website/content/docs/user_guide/quick_start/oci.md new file mode 100755 index 000000000..62e49e409 --- /dev/null +++ b/website/content/docs/user_guide/quick_start/oci.md @@ -0,0 +1,51 @@ +--- +title: "OCI Quick Start" +permalink: /docs/user_guide/quick_start/oci/ +layout: docs +category: User Guide +subcategory: Quick Start +weight: 19 +--- + +This guide will quickly get you started running your first gVisor sandbox +container using the runtime directly with the default platform. + +First, follow the [Installation guide][install]. + +## Run an OCI compatible container + +Now we will create an [OCI][oci] container bundle to run our container. First we +will create a root directory for our bundle. + +```bash +mkdir bundle +cd bundle +``` + +Create a root file system for the container. We will use the Docker hello-world +image as the basis for our container. + +```bash +mkdir rootfs +docker export $(docker create hello-world) | tar -xf - -C rootfs +``` + +Next, create an specification file called `config.json` that contains our +container specification. We will update the default command it runs to `/hello` +in the `hello-world` container. + +```bash +runsc spec +sed -i 's;"sh";"/hello";' config.json +``` + +Finally run the container. + +```bash +sudo runsc run hello +``` + +Next try [using CNI to set up networking](../../../tutorials/cni/) or [running gVisor using Docker](../docker/). + +[oci]: https://opencontainers.org/ +[install]: /docs/user_guide/install -- cgit v1.2.3