From 22f1890a9beab11d8cfdceba3a4d66f8bbbb468c Mon Sep 17 00:00:00 2001 From: Ian Lewis Date: Fri, 29 Mar 2019 22:40:11 -0400 Subject: Initial commit --- content/docs/architecture_guide/Layers.png | Bin 0 -> 11044 bytes content/docs/architecture_guide/Layers.svg | 1 + .../architecture_guide/Machine-Virtualization.png | Bin 0 -> 13205 bytes .../architecture_guide/Machine-Virtualization.svg | 1 + .../architecture_guide/Rule-Based-Execution.png | Bin 0 -> 6780 bytes .../architecture_guide/Rule-Based-Execution.svg | 1 + content/docs/architecture_guide/Sentry-Gofer.png | Bin 0 -> 9064 bytes content/docs/architecture_guide/Sentry-Gofer.svg | 1 + content/docs/architecture_guide/_index.md | 80 ++++++++ content/docs/architecture_guide/overview.md | 88 ++++++++ content/docs/architecture_guide/performance.md | 39 ++++ content/docs/architecture_guide/security.md | 221 +++++++++++++++++++++ 12 files changed, 432 insertions(+) create mode 100644 content/docs/architecture_guide/Layers.png create mode 100644 content/docs/architecture_guide/Layers.svg create mode 100644 content/docs/architecture_guide/Machine-Virtualization.png create mode 100644 content/docs/architecture_guide/Machine-Virtualization.svg create mode 100644 content/docs/architecture_guide/Rule-Based-Execution.png create mode 100644 content/docs/architecture_guide/Rule-Based-Execution.svg create mode 100644 content/docs/architecture_guide/Sentry-Gofer.png create mode 100644 content/docs/architecture_guide/Sentry-Gofer.svg create mode 100644 content/docs/architecture_guide/_index.md create mode 100644 content/docs/architecture_guide/overview.md create mode 100644 content/docs/architecture_guide/performance.md create mode 100644 content/docs/architecture_guide/security.md (limited to 'content/docs/architecture_guide') diff --git a/content/docs/architecture_guide/Layers.png b/content/docs/architecture_guide/Layers.png new file mode 100644 index 000000000..308c6c451 Binary files /dev/null and b/content/docs/architecture_guide/Layers.png differ diff --git a/content/docs/architecture_guide/Layers.svg b/content/docs/architecture_guide/Layers.svg new file mode 100644 index 000000000..0a366f841 --- /dev/null +++ b/content/docs/architecture_guide/Layers.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/content/docs/architecture_guide/Machine-Virtualization.png b/content/docs/architecture_guide/Machine-Virtualization.png new file mode 100644 index 000000000..1ba2ed6b2 Binary files /dev/null and b/content/docs/architecture_guide/Machine-Virtualization.png differ diff --git a/content/docs/architecture_guide/Machine-Virtualization.svg b/content/docs/architecture_guide/Machine-Virtualization.svg new file mode 100644 index 000000000..5352da07b --- /dev/null +++ b/content/docs/architecture_guide/Machine-Virtualization.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/content/docs/architecture_guide/Rule-Based-Execution.png b/content/docs/architecture_guide/Rule-Based-Execution.png new file mode 100644 index 000000000..b42654a90 Binary files /dev/null and b/content/docs/architecture_guide/Rule-Based-Execution.png differ diff --git a/content/docs/architecture_guide/Rule-Based-Execution.svg b/content/docs/architecture_guide/Rule-Based-Execution.svg new file mode 100644 index 000000000..bd6717043 --- /dev/null +++ b/content/docs/architecture_guide/Rule-Based-Execution.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/content/docs/architecture_guide/Sentry-Gofer.png b/content/docs/architecture_guide/Sentry-Gofer.png new file mode 100644 index 000000000..ca2c27ef7 Binary files /dev/null and b/content/docs/architecture_guide/Sentry-Gofer.png differ diff --git a/content/docs/architecture_guide/Sentry-Gofer.svg b/content/docs/architecture_guide/Sentry-Gofer.svg new file mode 100644 index 000000000..5c10750d2 --- /dev/null +++ b/content/docs/architecture_guide/Sentry-Gofer.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/content/docs/architecture_guide/_index.md b/content/docs/architecture_guide/_index.md new file mode 100644 index 000000000..c1c16a79c --- /dev/null +++ b/content/docs/architecture_guide/_index.md @@ -0,0 +1,80 @@ ++++ +title = "Architecture Guide" +weight = 20 ++++ +gVisor provides a fully-virtualized environment in order to sandbox untrusted +containers. The system interfaces normally implemented by the host kernel are +moved into a distinct, per-sandbox user space kernel in order to minimize the +risk of an exploit. gVisor does not introduce large fixed overheads however, +and still retains a process-like model with respect to resource utilization. + +## How is this different? + +Two other approaches are commonly taken to provide stronger isolation than +native containers. + +**Machine-level virtualization**, such as [KVM][kvm] and [Xen][xen], exposes +virtualized hardware to a guest kernel via a Virtual Machine Monitor (VMM). This +virtualized hardware is generally enlightened (paravirtualized) and additional +mechanisms can be used to improve the visibility between the guest and host +(e.g. balloon drivers, paravirtualized spinlocks). Running containers in +distinct virtual machines can provide great isolation, compatibility and +performance (though nested virtualization may bring challenges in this area), +but for containers it often requires additional proxies and agents, and may +require a larger resource footprint and slower start-up times. + +![Machine-level virtualization](Machine-Virtualization.png "Machine-level virtualization") + +**Rule-based execution**, such as [seccomp][seccomp], [SELinux][selinux] and +[AppArmor][apparmor], allows the specification of a fine-grained security policy +for an application or container. These schemes typically rely on hooks +implemented inside the host kernel to enforce the rules. If the surface can be +made small enough (i.e. a sufficiently complete policy defined), then this is an +excellent way to sandbox applications and maintain native performance. However, +in practice it can be extremely difficult (if not impossible) to reliably define +a policy for arbitrary, previously unknown applications, making this approach +challenging to apply universally. + +![Rule-based execution](Rule-Based-Execution.png "Rule-based execution") + +Rule-based execution is often combined with additional layers for +defense-in-depth. + +**gVisor** provides a third isolation mechanism, distinct from those above. + +gVisor intercepts application system calls and acts as the guest kernel, without +the need for translation through virtualized hardware. gVisor may be thought of +as either a merged guest kernel and VMM, or as seccomp on steroids. This +architecture allows it to provide a flexible resource footprint (i.e. one based +on threads and memory mappings, not fixed guest physical resources) while also +lowering the fixed costs of virtualization. However, this comes at the price of +reduced application compatibility and higher per-system call overhead. + +![gVisor](Layers.png "gVisor") + +On top of this, gVisor employs rule-based execution to provide defense-in-depth +(details below). + +gVisor's approach is similar to [User Mode Linux (UML)][uml], although UML +virtualizes hardware internally and thus provides a fixed resource footprint. + +Each of the above approaches may excel in distinct scenarios. For example, +machine-level virtualization will face challenges achieving high density, while +gVisor may provide poor performance for system call heavy workloads. + +### Why Go? + +gVisor is written in [Go][golang] in order to avoid security pitfalls that can +plague kernels. With Go, there are strong types, built-in bounds checks, no +uninitialized variables, no use-after-free, no stack overflow, and a built-in +race detector. (The use of Go has its challenges too, and isn't free.) + +[apparmor]: https://wiki.ubuntu.com/AppArmor +[golang]: https://golang.org +[kvm]: https://www.linux-kvm.org +[oci]: https://www.opencontainers.org +[sandbox]: https://en.wikipedia.org/wiki/Sandbox_(computer_security) +[seccomp]: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt +[selinux]: https://selinuxproject.org +[uml]: http://user-mode-linux.sourceforge.net/ +[xen]: https://www.xenproject.org diff --git a/content/docs/architecture_guide/overview.md b/content/docs/architecture_guide/overview.md new file mode 100644 index 000000000..dc963dc70 --- /dev/null +++ b/content/docs/architecture_guide/overview.md @@ -0,0 +1,88 @@ ++++ +title = "Overview & Platforms" +weight = 10 ++++ +gVisor sandbox consists of multiple processes when running. These sandboxes +collectively comprise a shared environment in which one or more containers can +be run. + +Each sandbox has its own isolated instance of: + +* The **Sentry**, A user-space kernel that runs the container and intercepts + and responds to system calls made by the application. + +Each container running in the sandbox has its own isolated instance of: + +* A **Gofer** which provides file system access to the container. + +![gVisor architecture diagram](../Sentry-Gofer.png "gVisor architecture diagram") + +## runsc + +The entrypoint to running a sandboxed container is the `runsc` executable. +`runsc` implements the [Open Container Initiative (OCI)][oci] runtime +specification. This means that OCI compatible _filesystem bundles_ can be run by +`runsc`. Filesystem bundles are comprised of a `config.json` file containing +container configuration, and a root filesystem for the container. Please see +the [OCI runtime spec][runtime-spec] for more information on filesystem bundles. +`runsc` implements multiple commands that perform various functions such as +starting, stopping, listing, and querying the status of containers. + +## The Sentry + +The Sentry is the largest component of gVisor. It can be thought of as a +userspace OS kernel. The Sentry implements all the kernel functionality needed +by the untrusted application. It implements all of the supported system calls, +signal delivery, memory management and page faulting logic, the threading +model, and more. + +When the untrusted application makes a system call, the currently used platform +redirects to the Sentry, which will do the necessary work to service the system +call. It is important to note that the Sentry will not simply pass through +system calls to the host kernel. As a userspace application, the Sentry will +make some host system calls to support its operation, but it will not allow the +application to directly control the system calls it makes. + +The Sentry aims to present an equivalent environment to (upstream) Linux v4.4. + +I/O operations that extend beyond the sandbox (not internal /proc files, pipes, +etc) are sent to the Gofer, described below. + +## Platforms + +gVisor requires a platform to implement interruption of syscalls, basic context +switching, and memory mapping functionality. + +### ptrace + +The ptrace platform uses `PTRACE_SYSEMU` to execute user code without executing +host system calls. This platform can run anywhere that ptrace works (even VMs +without nested virtualization). + +### KVM (experimental) + +The KVM platform allows the Sentry to act as both guest OS and VMM, switching +back and forth between the two worlds seamlessly. The KVM platform can run on +bare-metal or on a VM with nested virtualization enabled. While there is no +virtualized hardware layer -- the sandbox retains a process model -- gVisor +leverages virtualization extensions available on modern processors in order to +improve isolation and performance of address space switches. + +## Gofer + +The Gofer is a normal host Linux process. The Gofer is started with each sandbox +and connected to the Sentry. The Sentry process is started in a restricted +seccomp container without access to file system resources. The Gofer provides +access to file system resources to the Sentry via the 9P protocol and provides +an additional level of isolation. + +## Application + +The application (aka, the untrusted application) is a normal Linux binary +provided to gVisor in an OCI runtime bundle. gVisor aims to provide an +environment equivalent to Linux v4.4, so applications should be able to run +unmodified. However, gVisor does not presently implement every system call, +/proc file, or /sys file so some incompatibilities may occur. + +[oci]: https://www.opencontainers.org +[runtime-spec]: https://github.com/opencontainers/runtime-spec diff --git a/content/docs/architecture_guide/performance.md b/content/docs/architecture_guide/performance.md new file mode 100644 index 000000000..f2e928b84 --- /dev/null +++ b/content/docs/architecture_guide/performance.md @@ -0,0 +1,39 @@ ++++ +title = "Performance" +weight = 30 ++++ +gVisor is designed to provide a secure, virtualized environment while preserving +key benefits of containerization such as small fixed overheads and a dynamic +resource footprint. For containerized infrastructure, this can provide an “easy +button” for sandboxing untrusted workloads: there are no changes to the +fundamental resource model. + +However, there are clear trade-offs in this approach. gVisor does not fully +implement the system call surface provided by an upstream Linux kernel. We are +always working to improve this support, and current limitations are described +[Compatibility](../../user_guide/compatibility). + +gVisor also imposes runtime costs over native containers. These costs come in +two forms: additional cycles and memory usage, and they come from two different +sources. First, the existence of the Sentry itself means that additional memory +will be required, and application system calls generally traverse additional +layers. We place an emphasis on [Security](../security/) and therefore chose to +use a language for the Sentry that provides lots of benefits in this domain, but +may not offer the raw performance of other choices. Costs imposed by this design +are structural costs. + +Second, as gVisor is a fresh implementation of the system call surface, many of +the subsystems or specific calls are not as optimized as more mature +implementations. A good example here is the network stack, which is continuing +to evolve but does not support all the advanced recovery mechanisms offered by +other stacks and is less CPU efficient. This an implementation cost and should +not be confused with structural costs. Improvements here are ongoing and largely +driven by the workloads that matter to gVisor contributors and users. + +## Structural Costs + +The structural costs of gVisor are heavily influenced by the platform choice, +which implements system call interception. Today, gVisor supports a variety of +platforms. These platforms present distinct performance, compatibility and +security trade-offs. For example, the KVM platform low overhead system call +interception but runs poorly with nested virtualization. diff --git a/content/docs/architecture_guide/security.md b/content/docs/architecture_guide/security.md new file mode 100644 index 000000000..935301fc7 --- /dev/null +++ b/content/docs/architecture_guide/security.md @@ -0,0 +1,221 @@ ++++ +title = "Security Model" +weight = 20 ++++ +gVisor was created in order to provide additional defense against the +exploitation of kernel bugs when running untrusted code. In order to understand +how gVisor achieves this goal, it is first necessary to understand the basic +threat model. + +## Threats: the Anatomy of an Exploit + +An exploit takes advantage of a software or hardware bug in order to escalate +privileges, gain access to privileged data, or disrupt services. All of the +possible interactions that a malicious application can have with the rest of the +system (attack vectors) define the attack surface. We categorize these attack +vectors into several common classes. + +### System API + +An operating system or hypervisor exposes an abstract System API in the form of +system calls and traps. This API may be documented and stable, as with Linux, or +it may be hidden behind a library, as with Windows (i.e. win32.dll or +ntdll.dll). The System API includes all standard interfaces that application +code uses to interact with the system. This includes high-level abstractions +that are derived from low-level system calls, such as system files, sockets and +namespaces. + +Although the System API is exposed to applications by design, bugs and race +conditions within the kernel or hypervisor may occasionally be exploitable via +the API. A typical exploit might perform some combination of the following: + + 1. Opening or creating some combination of files, sockets or other descriptors. + 1. Passing crafted, malicious arguments, structures or packets. + 1. Racing with multiple threads in order to hit specific code paths. + +For example, for the “Dirty Cow” privilege escalation bug (CVE-2016-5195), an +application would open a specific file in proc or use a specific ptrace system +call, and use multiple threads in order to trigger a race condition when +touching a fresh page of memory. The attacker then gains control over a page of +memory belonging to the system. With additional privileges or access to +privileged data in the kernel, an attacker will often be able to employ +additional techniques to gain full access to the rest of the system. + +While bugs in the implementation of the System API are readily fixed, they are +also the most common form of exploit. The exposure created by this class of +exploit is what gVisor aims to minimize and control, described in detail below. + +### System ABI + +Hardware and software exploits occasionally exist in execution paths that are +not part of an intended System API. In this case, exploits may be found as part +of implicit actions the hardware or privileged system code takes in response to +certain events, such as traps or interrupts. For example, the recent “POPSS” +flaw (CVE-2018-8897) required only native code execution (no specific system +call or file access). In that case, the Xen hypervisor was similarly vulnerable, +highlighting that hypervisors are not immune to this vector. + +### Side Channels + +Hardware side channels may be exploitable by any code running on a system: +native, sandboxed, or virtualized. However, many host-level mitigations against +hardware side channels are still effective with a sandbox. For example, kernels +built with retpoline protect against some speculative execution attacks +(Spectre) and frame poisoning may protect against L1 terminal fault (L1TF) +attacks. Hypervisors may introduce additional complications in this regard, as +there is no mitigation against an application in a normally functioning Virtual +Machine (VM) exploiting the L1TF vulnerability for another VM on the sibling +hyperthread. + +### What’s missing? + +These categories in no way represent an exhaustive list of exploits, as we focus +only on running untrusted code from within the operating system or hypervisor. +We do not consider the many other ways that a more generic adversary may +interact with a system, such as inserting a portable storage device with a +malicious filesystem image, using a combination of crafted keyboard or touch +inputs, or saturating a network device with ill-formed ICMP packets. + +Furthermore, high-level systems may contain exploitable components. An attacker +need not escalate privileges within a container if there’s an exploitable +network-accessible service on the host or some other API path. A sandbox is not +a substitute for a secure architecture. + +## Goals: Limiting Exposure + +gVisor’s primary design goal is to minimize the System API attack vector while +still providing a process model. There are two primary security principles that +inform this design. First, the application’s direct interactions with the host +System API are intercepted by the Sentry, which implements the System API +instead. Second, the System API accessible to the Sentry itself is minimized to +a safer, restricted set. The first principle minimizes the possibility of direct +exploitation of the host System API by applications, and the second principle +minimizes indirect exploitability, which is the exploitation by an exploited or +buggy Sentry (e.g. chaining an exploit). + +The first principle is similar to the security basis for a Virtual Machine (VM). +With a VM, an application’s interactions with the host are replaced by +interactions with a guest operating system and a set of virtualized hardware +devices. These hardware devices are then implemented via the host System API by +a Virtual Machine Monitor (VMM). For both the Sentry and a VMM, it’s worth +noting that while direct interactions are minimized, indirect interactions are +still possible. For example, a read on a host-backed file in the Sentry with +ultimately result in a host read system call (made by the Sentry, not by passing +through arguments from the application), similarly to how a read on a block +device in a VMM will often result in a host read system call from the backing +file. The same applies for a write on a socket, on a write on a tap device. + +The key difference here is that the Sentry implements a second System API +directly instead of relying on virtualized hardware and a guest operating +system. This selects a distinct set of trade-offs, largely in the performance +and compatibility domains. Since sandbox transitions of the nature described +above are generally expensive, a guest operating system will typically take +ownership of resources. For example, in the above case, the guest operating +system may read the block device data in a local page cache, to avoid subsequent +reads. This may lead to better performance but lower efficiency, since memory +may be wasted or duplicated. The Sentry opts instead to defer to the host for +many operations during runtime, for improved efficiency but lower performance in +some use cases. + +gVisor relies on the host operating system and the platform for defense against +hardware-based attacks. Given the nature of these vulnerabilities, there is +little defense that gVisor can provide (there’s no guarantee that additional +hardware measures, such as virtualization, memory encryption, etc. would +actually decrease the attack surface). Note that this is true even when using +hardware virtualization for acceleration, as the host kernel or hypervisor is +ultimately responsible for defending against attacks from within malicious +guests. + +### What can a sandbox do? + +We allow a sandbox to do the following. + + 1. Communicate with a Gofer process via a connected socket. The sandbox may + receive new file descriptors from the Gofer process, corresponding to opened + files. + 1. Make a minimal set of host system calls. The calls do not include the + creation of new sockets (unless host networking mode is enabled) or opening + files. The calls include duplication and closing of file descriptors, + synchronization, timers and signal management. + 1. Read and write packets to a virtual ethernet device. This is not required if + not host networking is enabled. + +## Principles: Defense-in-Depth + +For gVisor development, there are several engineering principles that are +employed in order to ensure that the system meets its design goals. + + 1. No system call is passed through directly to the host. Every supported call + has a distinct implementation in the Sentry, that is unlikely to suffer from + identical vulnerabilities that may appear in the host. This has the + consequence that all kernel features used by applications require an + implementation within the Sentry. + 1. Only common, universal functionality is implemented. Some filesystems, + network devices or modules may expose specialized functionality to user + space applications via mechanisms such as extended attributes, raw sockets + or ioctls. Since the Sentry is responsible for implementing the full system + call surface, we do not implement or pass through these specialized APIs. + 1. The host surface exposed to the Sentry is minimized. While the system call + surface is not trivial, it is explicitly enumerated and controlled. The + Sentry is not permitted to open new files, create new sockets or do many + other interesting things on the host. + +Additionally, we have practical restrictions that are imposed on the project to +minimize the risk of Sentry exploitability. For example: + + 1. Unsafe code is carefully controlled. All unsafe code is isolated in files + that end with “_unsafe.go”, in order to facilitate validation and auditing. + No file without the unsafe suffix may import the unsafe package. + 1. No CGo is allowed. The Sentry must be a pure Go binary. + 1. External imports are not generally allowed within the core packages. Only + limited external imports are used within the setup code. The code available + inside the Sentry is carefully controlled, to ensure that the above rules + are effective. + +Finally, we recognize that security is a process, and that vigilance is +critical. Beyond our security disclosure process, the Sentry is fuzzed +continuously to identify potential bugs and races proactively, and production +crashes are recorded and triaged to similarly identify material issues. + +## FAQ + +### Is this more or less secure than a Virtual Machine? + +The security of a VM depends to a large extent on what is exposed from the host +kernel and user space support code. For example, device emulation code in the +host kernel (e.g. APIC) or optimizations (e.g. vhost) can be more complex than a +simple system call, and exploits carry the same risks. Similarly, the user space +support code is frequently unsandboxed and exploits, while rare, may allowed +unfettered access to the system. + +Some platforms leverage the same virtualization hardware as VMs in order to +provide better system call interception performance. However, gVisor does not +implement any device emulation, and instead opts to use a sandboxed host System +API directly. Both approaches significantly reduce the original attack surface. +Ultimately, since gVisor uses the same hardware mechanism, one should not assume +that the mere use of virtualization hardware makes a system more or less secure, +just as it would be a mistake to make the claim that the use of an engine makes +a car safe. + +### Does this stop hardware side channels? + +In general, gVisor does not provide protection against hardware side channels, +although it may make exploits that rely on direct access to the host System API +more difficult to use. To minimize exposure, you should follow relevant guidance +from vendors and keep your host kernel and firmware up-to-date. + +### Is this just a ptrace sandbox? + +No: the term “ptrace sandbox” generally refers to software that uses ptrace in +order to inspect and authorize system calls made by applications, enforcing a +specific policy. These commonly suffer from two issues. First, vulnerable system +calls may be authorized by the sandbox, as the application still has direct +access to some System API. Second, it’s impossible to avoid time-of-check, +time-of-use race conditions without disabling multi-threading. + +In gVisor, the platforms that use ptrace operate differently. The stubs that are +traced are never allowed to continue execution into the host kernel and complete +a call directly. Instead, all system calls are interpreted and handled by the +Sentry itself, who reflects resulting register state back into the tracee before +continuing execution in user space. This is very similar to the mechanism used +by User-Mode Linux (UML). -- cgit v1.2.3