summaryrefslogtreecommitdiffhomepage
path: root/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'README.md')
-rw-r--r--README.md430
1 files changed, 430 insertions, 0 deletions
diff --git a/README.md b/README.md
new file mode 100644
index 000000000..761226b6f
--- /dev/null
+++ b/README.md
@@ -0,0 +1,430 @@
+# gVisor
+
+gVisor is a user-space kernel, written in Go, that implements a substantial
+portion of the Linux system surface. It includes an [Open Container Initiative
+(OCI)][oci] runtime called `runsc` that provides an isolation boundary between
+the application and the host kernel. The `runsc` runtime integrates with Docker
+and Kubernetes, making it simple to run sandboxed containers.
+
+gVisor takes a distinct approach to container sandboxing and makes a different
+set of technical trade-offs compared to existing sandbox technologies, thus
+providing new tools and ideas for the container security landscape.
+
+### Why does gVisor exist?
+
+Containers are not a [**sandbox**][sandbox]. While containers have
+revolutionized how we develop, package, and deploy applications, running
+untrusted or potentially malicious code without additional isolation is not a
+good idea. The efficiency and performance gains from using a single, shared
+kernel also mean that container escape is possible with a single vulnerability.
+
+gVisor is a user-space kernel for containers. It limits the host kernel surface
+accessible to the application while still giving the application access to all
+the features it expects. Unlike most kernels, gVisor does not assume or require
+a fixed set of physical resources; instead, it leverages existing host kernel
+functionality and runs as a normal user-space process. In other words, gVisor
+implements Linux by way of Linux.
+
+gVisor should not be confused with technologies and tools to harden containers
+against external threats, provide additional integrity checks, or limit the
+scope of access for a service. One should always be careful about what data is
+made available to a container.
+
+### How is gVisor different from other container isolation mechanisms?
+
+Two other approaches are commonly taken to provide stronger isolation than
+native containers.
+
+![Machine-level virtualization](g3doc/Machine-Virtualization.png "Machine-level
+virtualization")
+
+**Machine-level virtualization**, such as [KVM][kvm] and [Xen][xen], exposes
+virtualized hardware to a guest kernel via a Virtual Machine Monitor (VMM). This
+virtualized hardware is generally enlightened (paravirtualized) and additional
+mechanisms can be used to improve the visibility between the guest and host
+(e.g. balloon drivers, paravirtualized spinlocks). Running containers in
+distinct virtual machines can provide great isolation, compatibility and
+performance (though nested virtualization may bring challenges in this area),
+but for containers it often requires additional proxies and agents, and may
+require a larger resource footprint and slower start-up times.
+
+![Rule-based execution](g3doc/Rule-Based-Execution.png "Rule-based execution")
+
+**Rule-based execution**, such as [seccomp][seccomp], [SELinux][selinux] and
+[AppArmor][apparmor], allows the specification of a fine-grained security policy
+for an application or container. These schemes typically rely on hooks
+implemented inside the host kernel to enforce the rules. If the surface can be
+made small enough (i.e. a sufficiently complete policy defined), then this is
+an excellent way to sandbox applications and maintain native performance.
+However, in practice it can be extremely difficult (if not impossible) to
+reliably define a policy for arbitrary, previously unknown applications,
+making this approach challenging to apply universally.
+
+Rule-based execution is often combined with additional layers for
+defense-in-depth.
+
+![gVisor](g3doc/Layers.png "gVisor")
+
+**gVisor** provides a third isolation mechanism, distinct from those mentioned
+above.
+
+gVisor intercepts application system calls and acts as the guest kernel, without
+the need for translation through virtualized hardware. gVisor may be thought of
+as either a merged guest kernel and VMM, or as seccomp on steroids. This
+architecture allows it to provide a flexible resource footprint (i.e. one based
+on threads and memory mappings, not fixed guest physical resources) while also
+lowering the fixed costs of virtualization. However, this comes at the price of
+reduced application compatibility and higher per-system call overhead.
+
+On top of this, gVisor employs rule-based execution to provide defense-in-depth
+(details below).
+
+gVisor's approach is similar to [User Mode Linux (UML)][uml], although UML
+virtualizes hardware internally and thus provides a fixed resource footprint.
+
+Each of the above approaches may excel in distinct scenarios. For example,
+machine-level virtualization will face challenges achieving high density, while
+gVisor may provide poor performance for system call heavy workloads.
+
+### Why Go?
+
+gVisor was written in Go in order to avoid security pitfalls that can plague
+kernels. With Go, there are strong types, built-in bounds checks, no
+uninitialized variables, no use-after-free, no stack overflow, and a built-in
+race detector. (The use of Go has its challenges too, and isn't free.)
+
+## Architecture
+
+gVisor intercepts all system calls made by the application, and does the
+necessary work to service them. Importantly, gVisor does not simply redirect
+application system calls through to the host kernel. Instead, gVisor implements
+most kernel primitives (signals, file systems, futexes, pipes, mm, etc.) and has
+complete system call handlers built on top of these primitives.
+
+Since gVisor is itself a user-space application, it will make some host system
+calls to support its operation, but much like a VMM, it will not allow the
+application to directly control the system calls it makes.
+
+### File System Access
+
+![Sentry](g3doc/Sentry-Gofer.png)
+
+In order to provide defense-in-depth and limit the host system surface, the
+gVisor container runtime is normally split into two separate processes. First,
+the *Sentry* process includes the kernel and is responsible for executing user
+code and handling system calls. Second, file system operations that extend beyond
+the sandbox (not internal proc or tmp files, pipes, etc.) are sent to a proxy,
+called a *Gofer*, via a 9P connection.
+
+The Gofer acts as a file system proxy by opening host files on behalf of the
+application, and passing them to the Sentry process, which has no host file
+access itself. Furthermore, the Sentry runs in an empty user namespace, and the
+system calls made by gVisor to the host are restricted using seccomp filters in
+order to provide defense-in-depth.
+
+### Network Access
+
+The Sentry implements its own network stack (also written in Go) called
+[netstack][netstack]. All aspects of the network stack are handled inside the
+Sentry — including TCP connection state, control messages, and packet assembly —
+keeping it isolated from the host network stack. Data link layer packets are
+written directly to the virtual device inside the network namespace setup by
+Docker or Kubernetes.
+
+A network passthrough mode is also supported, but comes at the cost of reduced
+isolation (see below).
+
+### Platforms
+
+The Sentry requires a *platform* to implement basic context switching and memory
+mapping functionality. Today, gVisor supports two platforms:
+
+* The **Ptrace** platform uses SYSEMU functionality to execute user code without
+ executing host system calls. This platform can run anywhere that `ptrace`
+ works (even VMs without nested virtualization).
+
+* The **KVM** platform allows the Sentry to act as both guest OS and VMM,
+ switching back and forth between the two worlds seamlessly. The KVM platform
+ can run on bare-metal or on a VM with nested virtualization enabled. While
+ there is no virtualized hardware layer -- the sandbox retains a process model
+ -- gVisor leverages virtualization extensions available on modern processors
+ in order to improve isolation and performance of address space switches.
+
+### Performance
+
+There are several factors influencing performance. The platform choice has the
+largest direct impact that varies depending on the specific workload. There is
+no best platform: Ptrace works universally, including on VM instances, but
+applications may perform at a fraction of their original levels. Beyond the
+platform choice, passthrough modes may be useful for improving perfomance at the
+cost of some isolation.
+
+## Installation
+
+These instructions will get you up-and-running sandboxed containers with gVisor
+and Docker.
+
+### Requirements
+
+* [git][git]
+* [Bazel][bazel]
+* [Docker version 17.09.0 or greater][docker]
+
+### Getting the source
+
+Clone the gVisor repo:
+
+```
+git clone https://gvisor.googlesource.com/gvisor gvisor
+cd gvisor
+```
+
+### Building
+
+Build and install the `runsc` binary.
+
+It is important to copy this binary to some place that is accessible to all
+users, since `runsc` executes itself as user `nobody` to avoid unnecessary
+privileges. The `/usr/local/bin` directory is a good choice.
+
+```
+bazel build runsc
+sudo cp ./bazel-bin/runsc/linux_amd64_pure_stripped/runsc /usr/local/bin
+```
+
+### Configuring Docker
+
+Next, configure Docker to use `runsc` by adding a runtime entry to your Docker
+configuration (`/etc/docker/daemon.json`). You may have to create this file if
+it does not exist. Also, some Docker versions also require you to [specify the
+`storage-driver` field][docker-storage-driver].
+
+In the end, the file should look something like:
+
+```
+{
+ "runtimes": {
+ "runsc": {
+ "path": "/usr/local/bin/runsc"
+ }
+ }
+}
+```
+
+You must restart the Docker daemon after making changes to this file, typically this is done via:
+
+```
+sudo systemctl restart docker
+```
+
+Now run your container in `runsc`:
+
+```
+docker run --runtime=runsc hello-world
+```
+
+Terminal support works too:
+
+```
+docker run --runtime=runsc -it ubuntu /bin/bash
+```
+
+
+### Kubernetes Support (Experimental)
+
+gVisor can run sandboxed containers in a Kubernetes cluster with cri-o, although
+this is not recommended for production environments yet. Follow [these
+instructions][cri-o-k8s] to run [cri-o][cri-o] on a node in a Kubernetes
+cluster. Build `runsc` and put it on the node, and set it as the
+`runtime_untrusted_workload` in `/etc/crio/crio.conf`.
+
+Any Pod without the `io.kubernetes.cri-o.TrustedSandbox` annotation (or with the
+annotation set to false) will be run with `runsc`.
+
+Currently, gVisor only supports Pods with a single container (not counting the
+ever-present pause container). Support for multiple containers within a single
+Pod is coming soon.
+
+## Advanced Usage
+
+### Testing
+
+The gVisor test suite can be run with Bazel:
+
+```
+bazel test ...
+```
+
+### Debugging
+
+To enable debug + system call logging, add the `runtimeArgs` below to your
+Docker configuration (`/etc/docker/daemon.json`):
+
+```
+{
+ "runtimes": {
+ "runsc": {
+ "path": "/usr/local/bin/runsc"
+ "runtimeArgs": [
+ "--debug-log-dir=/tmp/runsc",
+ "--debug",
+ "--strace"
+ ]
+ }
+ }
+}
+```
+
+You may also want to pass `--log-packets` to troubleshoot network problems. Then
+restart the Docker daemon:
+
+```
+sudo systemctl restart docker
+```
+
+Run your container again, and inspect the files under `/tmp/runsc`. The log file
+with name `boot` will contain the strace logs from your application, which can
+be useful for identifying missing or broken system calls in gVisor.
+
+### Enabling network passthrough
+
+For high-performance networking applications, you may choose to disable the user
+space network stack and instead use the host network stack. Note that this mode
+decreases the isolation to the host.
+
+Add the following `runtimeArgs` to your Docker configuration
+(`/etc/docker/daemon.json`) and restart the Docker daemon:
+
+```
+{
+ "runtimes": {
+ "runsc": {
+ "path": "/usr/local/bin/runsc"
+ "runtimeArgs": [
+ "--network=host"
+ ]
+ }
+ }
+}
+```
+
+### Selecting a different platform
+
+Depending on hardware and performance characteristics, you may choose to use a
+different platform. The Ptrace platform is the default, but the KVM platform may
+be specified by passing the `--platform` flag to `runsc` in your Docker
+configuration (`/etc/docker/daemon.json`):
+
+```
+{
+ "runtimes": {
+ "runsc": {
+ "path": "/usr/local/bin/runsc"
+ "runtimeArgs": [
+ "--platform=kvm"
+ ]
+ }
+ }
+}
+```
+
+Then restart the Docker daemon.
+
+## FAQ & Known Issues
+
+### What works?
+
+The following applications/images have been tested:
+
+* golang
+* httpd
+* java8
+* jenkins
+* mariadb
+* memcached
+* mongo
+* mysql
+* node
+* php
+* prometheus
+* python
+* redis
+* registry
+* tomcat
+* wordpress
+
+### What doesn't work yet?
+
+The following applications have been tested and may not yet work:
+
+* elasticsearch: Requires unimplemented socket ioctls. See [bug
+ #2](https://github.com/google/gvisor/issues/2).
+* nginx: Requires `ioctl(FIOASYNC)`, but see workaround in [bug
+ #1](https://github.com/google/gvisor/issues/1).
+* postgres: Requires SysV shared memory support. See [bug
+ #3](https://github.com/google/gvisor/issues/3).
+
+### Will my container work with gVisor?
+
+gVisor implements a large portion of the Linux surface and while we strive to
+make it broadly compatible, there are (and always will be) unimplemented
+features and bugs. The only real way to know if it will work is to try. If you
+find a container that doesn’t work and there is no known issue, please [file a
+bug][bug] indicating the full command you used to run the image. Providing the
+debug logs is also helpful.
+
+### My container runs fine with *runc* but fails with *runsc*.
+
+If you’re having problems running a container with `runsc` it’s most likely due
+to a compatibility issue or a missing feature in gVisor. See **Debugging**,
+above.
+
+### I can’t see a file copied with `docker cp` or `kubectl cp`.
+
+For performance reasons, gVisor caches directory contents, and therefore it may
+not realize a new file was copied to a given directory. To invalidate the cache
+and force a refresh, create a file under the directory in question and list the
+contents again.
+
+This bug is tracked in [bug #4](https://github.com/google/gvisor/issues/4).
+
+## Technical details
+
+We plan to release a full paper with technical details and will include it
+here when available.
+
+## Community
+
+Join the [gvisor-discuss mailing list][gvisor-discuss-list] to discuss all things
+gVisor.
+
+Sensitive security-related questions and comments can be sent to the private
+[gvisor-security mailing list][gvisor-security-list].
+
+## Contributing
+
+See [Contributing.md](CONTRIBUTING.md).
+
+## Disclaimer
+
+This is not an official Google product (experimental or otherwise), it is just
+code that happens to be owned by Google.
+
+[apparmor]: https://wiki.ubuntu.com/AppArmor
+[bazel]: https://bazel.build
+[bug]: https://github.com/google/gvisor/issues
+[cri-o]: https://github.com/kubernetes-incubator/cri-o
+[cri-o-k8s]: https://github.com/kubernetes-incubator/cri-o/blob/master/kubernetes.md
+[docker]: https://www.docker.com
+[docker-storage-driver]: https://docs.docker.com/engine/reference/commandline/dockerd/#daemon-storage-driver
+[git]: https://git-scm.com
+[gvisor-discuss-list]: https://groups.google.com/forum/#!forum/gvisor-users
+[gvisor-security-list]: https://groups.google.com/forum/#!forum/gvisor-security
+[kvm]: https://www.linux-kvm.org
+[netstack]: https://github.com/google/netstack
+[oci]: https://www.opencontainers.org
+[sandbox]: https://en.wikipedia.org/wiki/Sandbox_(computer_security)
+[seccomp]: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
+[selinux]: https://selinuxproject.org
+[uml]: http://user-mode-linux.sourceforge.net/
+[xen]: https://www.xenproject.org