diff options
Diffstat (limited to 'website/blog')
-rw-r--r-- | website/blog/2019-11-18-security-basics.md | 308 | ||||
-rw-r--r-- | website/blog/2020-04-02-networking-security.md | 183 | ||||
-rw-r--r-- | website/blog/2020-09-18-containing-a-real-vulnerability.md | 223 | ||||
-rw-r--r-- | website/blog/BUILD | 47 | ||||
-rw-r--r-- | website/blog/index.html | 22 |
5 files changed, 0 insertions, 783 deletions
diff --git a/website/blog/2019-11-18-security-basics.md b/website/blog/2019-11-18-security-basics.md deleted file mode 100644 index b6cf57a77..000000000 --- a/website/blog/2019-11-18-security-basics.md +++ /dev/null @@ -1,308 +0,0 @@ -# gVisor Security Basics - Part 1 - -This blog is a space for engineers and community members to share perspectives -and deep dives on technology and design within the gVisor project. Though our -logo suggests we're in the business of space exploration (or perhaps fighting -sea monsters), we're actually in the business of sandboxing Linux containers. -When we created gVisor, we had three specific goals in mind; _container-native -security_, _resource efficiency_, and _platform portability_. To put it simply, -gVisor provides _efficient defense-in-depth for containers anywhere_. - -This post addresses gVisor's _container-native security_, specifically how -gVisor provides strong isolation between an application and the host OS. Future -posts will address _resource efficiency_ (how gVisor preserves container -benefits like fast starts, smaller snapshots, and less memory overhead than VMs) -and _platform portability_ (run gVisor wherever Linux OCI containers run). -Delivering on each of these goals requires careful security considerations and a -robust design. - -## What does "sandbox" mean? - -gVisor allows the execution of untrusted containers, preventing them from -adversely affecting the host. This means that the untrusted container is -prevented from attacking or spying on either the host kernel or any other peer -userspace processes on the host. - -For example, if you are a cloud container hosting service, running containers -from different customers on the same virtual machine means that compromises -expose customer data. Properly configured, gVisor can provide sufficient -isolation to allow different customers to run containers on the same host. There -are many aspects to the proper configuration, including limiting file and -network access, which we will discuss in future posts. - -## The cost of compromise - -gVisor was designed around the premise that any security boundary could -potentially be compromised with enough time and resources. We tried to optimize -for a solution that was as costly and time-consuming for an attacker as -possible, at every layer. - -Consequently, gVisor was built through a combination of intentional design -principles and specific technology choices that work together to provide the -security isolation needed for running hostile containers on a host. We'll dig -into it in the next section! - -# Design Principles - -gVisor was designed with some common -[secure design](https://en.wikipedia.org/wiki/Secure_by_design) principles in -mind: Defense-in-Depth, Principle of Least-Privilege, Attack Surface Reduction -and Secure-by-Default[^1]. - -In general, Design Principles outline good engineering practices, but in the -case of security, they also can be thought of as a set of tactics. In a -real-life castle, there is no single defensive feature. Rather, there are many -in combination: redundant walls, scattered draw bridges, small bottle-neck -entrances, moats, etc. - -A simplified version of the design is below -([more detailed version](/docs/))[^2]: - -![Figure 1](/assets/images/2019-11-18-security-basics-figure1.png "Simplified design of gVisor.") - -In order to discuss design principles, the following components are important to -know: - -* runsc - binary that packages the Sentry, platform, and Gofer(s) that run - containers. runsc is the drop-in binary for running gVisor in Docker and - Kubernetes. -* Untrusted Application - container running in the sandbox. Untrusted - application/container are used interchangeably in this article. -* Platform Syscall Switcher - intercepts syscalls from the application and - passes them to the Sentry with no further handling. -* Sentry - The "application kernel" in userspace that serves the untrusted - application. Each application instance has its own Sentry. The Sentry - handles syscalls, routes I/O to gofers, and manages memory and CPU, all in - userspace. The Sentry is allowed to make limited, filtered syscalls to the - host OS. -* Gofer - a process that specifically handles different types of I/O for the - Sentry (usually disk I/O). Gofers are also allowed to make filtered syscalls - to the Host OS. -* Host OS - the actual OS on which gVisor containers are running, always some - flavor of Linux (sorry, Windows/MacOS users). - -It is important to emphasize what is being protected from the untrusted -application in this diagram: the host OS and other userspace applications. - -In this post, we are only discussing security-related features of gVisor, and -you might ask, "What about performance, compatibility and stability?" We will -cover these considerations in future posts. - -## Defense-in-Depth - -For gVisor, Defense-in-Depth means each component of the software stack trusts -the other components as little as possible. - -It may seem strange that we would want our own software components to distrust -each other. But by limiting the trust between small, discrete components, each -component is forced to defend itself against potentially malicious input. And -when you stack these components on top of each other, you can ensure that -multiple security barriers must be overcome by an attacker. - -And this leads us to how Defense-in-Depth is applied to gVisor: no single -vulnerability should compromise the host. - -In the "Attacker's Advantage / Defender's Dilemma," the defender must succeed -all the time while the attacker only needs to succeed once. Defense in Depth -inverts this principle: once the attacker successfully compromises any given -software component, they are immediately faced with needing to compromise a -subsequent, distinct layer in order to move laterally or acquire more privilege. - -For example, the untrusted container is isolated from the Sentry. The Sentry is -isolated from host I/O operations by serving those requests in separate -processes called Gofers. And both the untrusted container and its associated -Gofers are isolated from the host process that is running the sandbox. - -An additional benefit is that this generally leads to more robust and stable -software, forcing interfaces to be strictly defined and tested to ensure all -inputs are properly parsed and bounds checked. - -## Least-Privilege - -The principle of Least-Privilege implies that each software component has only -the permissions it needs to function, and no more. - -Least-Privilege is applied throughout gVisor. Each component and more -importantly, each interface between the components, is designed so that only the -minimum level of permission is required for it to perform its function. -Specifically, the closer you are to the untrusted application, the less -privilege you have. - -![Figure 2](/assets/images/2019-11-18-security-basics-figure2.png "runsc components and their privileges.") - -This is evident in how runsc (the drop in gVisor binary for Docker/Kubernetes) -constructs the sandbox. The Sentry has the least privilege possible (it can't -even open a file!). Gofers are only allowed file access, so even if it were -compromised, the host network would be unavailable. Only the runsc binary itself -has full access to the host OS, and even runsc's access to the host OS is often -limited through capabilities / chroot / namespacing. - -Designing a system with Defense-in-Depth and Least-Privilege in mind encourages -small, separate, single-purpose components, each with very restricted -privileges. - -## Attack Surface Reduction - -There are no bugs in unwritten code. In other words, gVisor supports a feature -if and only if it is needed to run host Linux containers. - -### Host Application/Sentry Interface: - -There are a lot of things gVisor does not need to do. For example, it does not -need to support arbitrary device drivers, nor does it need to support video -playback. By not implementing what will not be used, we avoid introducing -potential bugs in our code. - -That is not to say gVisor has limited functionality! Quite the opposite, we -analyzed what is actually needed to run Linux containers and today the Sentry -supports 237 syscalls[^3]<sup>,</sup>[^4], along with the range of critical -/proc and /dev files. However, gVisor does not support every syscall in the -Linux kernel. There are about 350 syscalls[^5] within the 5.3.11 version of the -Linux kernel, many of which do not apply to Linux containers that typically host -cloud-like workloads. For example, we don't support old versions of epoll -(epoll_ctl_old, epoll_wait_old), because they are deprecated in Linux and no -supported workloads use them. - -Furthermore, any exploited vulnerabilities in the implemented syscalls (or -Sentry code in general) only apply to gaining control of the Sentry. More on -this in a later post. - -### Sentry/Host OS Interface: - -The Sentry's interactions with the Host OS are restricted in many ways. For -instance, no syscall is "passed-through" from the untrusted application to the -host OS. All syscalls are intercepted and interpreted. In the case where the -Sentry needs to call the Host OS, we severely limit the syscalls that the Sentry -itself is allowed to make to the host kernel[^6]. - -For example, there are many file-system based attacks, where manipulation of -files or their paths, can lead to compromise of the host[^7]. As a result, the -Sentry does not allow any syscall that creates or opens a file descriptor. All -file descriptors must be donated to the sandbox. By disallowing open or creation -of file descriptors, we eliminate entire categories of these file-based attacks. - -This does not affect functionality though. For example, during startup, runsc -will donate FDs the Sentry that allow for mapping STDIN/STDOUT/STDERR to the -sandboxed application. Also the Gofer may donate an FD to the Sentry, allowing -for direct access to some files. And most files will be remotely accessed -through the Gofers, in which case no FDs are donated to the Sentry. - -The Sentry itself is only allowed access to specific -[whitelisted syscalls](https://github.com/google/gvisor/blob/master/runsc/config/config.go). -Without networking, the Sentry needs 53 host syscalls in order to function, and -with networking, it uses an additional 15[^8]. By limiting the whitelist to only -these needed syscalls, we radically reduce the amount of host OS attack surface. -If any attempts are made to call something outside the whitelist, it is -immediately blocked and the sandbox is killed by the Host OS. - -### Sentry/Gofer Interface: - -The Sentry communicates with the Gofer through a local unix domain socket (UDS) -via a version of the 9P protocol[^9]. The UDS file descriptor is passed to the -sandbox during initialization and all communication between the Sentry and Gofer -happens via 9P. We will go more into how Gofers work in future posts. - -### End Result - -So, of the 350 syscalls in the Linux kernel, the Sentry needs to implement only -237 of them to support containers. At most, the Sentry only needs to call 68 of -the host Linux syscalls. In other words, with gVisor, applications get the vast -majority (and growing) functionality of Linux containers for only 68 possible -syscalls to the Host OS. 350 syscalls to 68 is attack surface reduction. - -![Figure 3](/assets/images/2019-11-18-security-basics-figure3.png "Reduction of Attack Surface of the Syscall Table. Note that the Senty's Syscall Emulation Layer keeps the Containerized Process from ever calling the Host OS.") - -## Secure-by-default - -The default choice for a user should be safe. If users need to run a less secure -configuration of the sandbox for the sake of performance or application -compatibility, they must make the choice explicitly. - -An example of this might be a networking application that is performance -sensitive. Instead of using the safer, Go-based Netstack in the Sentry, the -untrusted container can instead use the host Linux networking stack directly. -However, this means the untrusted container will be directly interacting with -the host, without the safety benefits of the sandbox. It also means that an -attack could directly compromise the host through his path. - -These less secure configurations are **not** the default. In fact, the user must -take action to change the configuration and run in a less secure mode. -Additionally, these actions make it very obvious that a less secure -configuration is being used. - -This can be as simple as forcing a default runtime flag option to the secure -option. gVisor does this by always using its internal netstack by default. -However, for certain performance sensitive applications, we allow the usage of -the host OS networking stack, but it requires the user to actively set a -flag[^10]. - -# Technology Choices - -Technology choices for gVisor mainly involve things that will give us a security -boundary. - -At a higher level, boundaries in software might be describing a great many -things. It may be discussing the boundaries between threads, boundaries between -processes, boundaries between CPU privilege levels, and more. - -Security boundaries are interfaces that are designed and built so that entire -classes of bugs/vulnerabilities are eliminated. - -For example, the Sentry and Gofers are implemented using Go. Go was chosen for a -number of the features it provided. Go is a fast, statically-typed, compiled -language that has efficient multi-threading support, garbage collection and a -constrained set of "unsafe" operations. - -Using these features enabled safe array and pointer handling. This means entire -classes of vulnerabilities were eliminated, such as buffer overflows and -use-after-free. - -Another example is our use of very strict syscall switching to ensure that the -Sentry is always the first software component that parses and interprets the -calls being made by the untrusted container. Here is an instance where different -platforms use different solutions, but all of them share this common trait, -whether it is through the use of ptrace "a la PTRACE_ATTACH"[^11] or kvm's -ring0[^12]. - -Finally, one of the most restrictive choices was to use seccomp, to restrict the -Sentry from being able to open or create a file descriptor on the host. All file -I/O is required to go through Gofers. Preventing the opening or creation of file -descriptions eliminates whole categories of bugs around file permissions -[like this one](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)[^13]. - -# To be continued - Part 2 - -In part 2 of this blog post, we will explore gVisor from an attacker's point of -view. We will use it as an opportunity to examine the specific strengths and -weaknesses of each gVisor component. - -We will also use it to introduce Google's Vulnerability Reward Program[^14], and -other ways the community can contribute to help make gVisor safe, fast and -stable. -<br> -<br> - --------------------------------------------------------------------------------- - -[^1]: [https://en.wikipedia.org/wiki/Secure_by_design](https://en.wikipedia.org/wiki/Secure_by_design) -[^2]: [https://gvisor.dev/docs/architecture_guide](https://gvisor.dev/docs/architecture_guide/) -[^3]: [https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/linux64_amd64.go](https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/syscalls.go) - -<!-- mdformat off(mdformat formats this into multiple lines) --> -[^4]: Internally that is, it doesn't call to the Host OS to implement them, in fact that is explicitly disallowed, more on that in the future. -<!-- mdformat on --> - -[^5]: [https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345](https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345) -[^6]: [https://github.com/google/gvisor/tree/master/runsc/boot/filter](https://github.com/google/gvisor/tree/master/runsc/boot/filter) -[^7]: [https://en.wikipedia.org/wiki/Dirty_COW](https://en.wikipedia.org/wiki/Dirty_COW) -[^8]: [https://github.com/google/gvisor/blob/master/runsc/boot/config.go](https://github.com/google/gvisor/blob/master/runsc/boot/config.go) - -<!-- mdformat off(mdformat breaks this url by escaping the parenthesis) --> -[^9]: [https://en.wikipedia.org/wiki/9P_(protocol)](https://en.wikipedia.org/wiki/9P_(protocol)) -<!-- mdformat on --> - -[^10]: [https://gvisor.dev/docs/user_guide/networking/#network-passthrough](https://gvisor.dev/docs/user_guide/networking/#network-passthrough) -[^11]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390) -[^12]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182) -[^13]: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557) -[^14]: [https://www.google.com/about/appsecurity/reward-program/index.html](https://www.google.com/about/appsecurity/reward-program/index.html) diff --git a/website/blog/2020-04-02-networking-security.md b/website/blog/2020-04-02-networking-security.md deleted file mode 100644 index f3ce02d11..000000000 --- a/website/blog/2020-04-02-networking-security.md +++ /dev/null @@ -1,183 +0,0 @@ -# gVisor Networking Security - -In our -[first blog post](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/), -we covered some secure design principles and how they guided the architecture of -gVisor as a whole. In this post, we will cover how these principles guided the -networking architecture of gVisor, and the tradeoffs involved. In particular, we -will cover how these principles culminated in two networking modes, how they -work, and the properties of each. - -## gVisor's security architecture in the context of networking - -Linux networking is complicated. The TCP protocol is over 40 years old, and has -been repeatedly extended over the years to keep up with the rapid pace of -network infrastructure improvements, all while maintaining compatibility. On top -of that, Linux networking has a fairly large API surface. Linux supports -[over 150 options](https://github.com/google/gvisor/blob/960f6a975b7e44c0efe8fd38c66b02017c4fe137/pkg/sentry/strace/socket.go#L476-L644) -for the most common socket types alone. In fact, the net subsystem is one of the -largest and fastest growing in Linux at approximately 1.1 million lines of code. -For comparison, that is several times the size of the entire gVisor codebase. - -At the same time, networking is increasingly important. The cloud era is -arguably about making everything a network service, and in order to make that -work, the interconnect performance is critical. Adding networking support to -gVisor was difficult, not just due to the inherent complexity, but also because -it has the potential to significantly weaken gVisor's security model. - -As outlined in the previous blog post, gVisor's -[secure design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#design-principles) -are: - -1. Defense in Depth: each component of the software stack trusts each other - component as little as possible. -1. Least Privilege: each software component has only the permissions it needs - to function, and no more. -1. Attack Surface Reduction: limit the surface area of the host exposed to the - sandbox. -1. Secure by Default: the default choice for a user should be safe. - -gVisor manifests these principles as a multi-layered system. An application -running in the sandbox interacts with the Sentry, a userspace kernel, which -mediates all interactions with the Host OS and beyond. The Sentry is written in -pure Go with minimal unsafe code, making it less vulnerable to buffer overflows -and related memory bugs that can lead to a variety of compromises including code -injection. It emulates Linux using only a minimal and audited set of Host OS -syscalls that limit the Host OS's attack surface exposed to the Sentry itself. -The syscall restrictions are enforced by running the Sentry with seccomp -filters, which enforce that the Sentry can only use the expected set of -syscalls. The Sentry runs as an unprivileged user and in namespaces, which, -along with the seccomp filters, ensure that the Sentry is run with the Least -Privilege required. - -gVisor's multi-layered design provides Defense in Depth. The Sentry, which does -not trust the application because it may attack the Sentry and try to bypass it, -is the first layer. The sandbox that the Sentry runs in is the second layer. If -the Sentry were compromised, the attacker would still be in a highly restrictive -sandbox which they must also break out of in order to compromise the Host OS. - -To enable networking functionality while preserving gVisor's security -properties, we implemented a -[userspace network stack](https://github.com/google/gvisor/tree/master/pkg/tcpip) -in the Sentry, which we creatively named Netstack. Netstack is also written in -Go, not only to avoid unsafe code in the network stack itself, but also to avoid -a complicated and unsafe Foreign Function Interface. Having its own integrated -network stack allows the Sentry to implement networking operations using up to -three Host OS syscalls to read and write packets. These syscalls allow a very -minimal set of operations which are already allowed (either through the same or -a similar syscall). Moreover, because packets typically come from off-host (e.g. -the internet), the Host OS's packet processing code has received a lot of -scrutiny, hopefully resulting in a high degree of hardening. - -![Figure 1](/assets/images/2020-04-02-networking-security-figure1.png "Network and gVisor.") - -## Writing a network stack - -Netstack was written from scratch specifically for gVisor. Because Netstack was -designed and implemented to be modular, flexible and self-contained, there are -now several more projects using Netstack in creative and exciting ways. As we -discussed, a custom network stack has enabled a variety of security-related -goals which would not have been possible any other way. This came at a cost -though. Network stacks are complex and writing a new one comes with many -challenges, mostly related to application compatibility and performance. - -Compatibility issues typically come in two forms: missing features, and features -with behavior that differs from Linux (usually due to bugs). Both of these are -inevitable in an implementation of a complex system spanning many quickly -evolving and ambiguous standards. However, we have invested heavily in this -area, and the vast majority of applications have no issues using Netstack. For -example, -[we now support setting 34 different socket options](https://github.com/google/gvisor/blob/815df2959a76e4a19f5882e40402b9bbca9e70be/pkg/sentry/socket/netstack/netstack.go#L830-L1764) -versus -[only 7 in our initial git commit](https://github.com/google/gvisor/blob/d02b74a5dcfed4bfc8f2f8e545bca4d2afabb296/pkg/sentry/socket/epsocket/epsocket.go#L445-L702). -We are continuing to make good progress in this area. - -Performance issues typically come from TCP behavior and packet processing speed. -To improve our TCP behavior, we are working on implementing the full set of TCP -RFCs. There are many RFCs which are significant to performance (e.g. -[RACK](https://tools.ietf.org/id/draft-ietf-tcpm-rack-03.html) and -[BBR](https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00)) -that we have yet to implement. This mostly affects TCP performance with -non-ideal network conditions (e.g. cross continent connections). Faster packet -processing mostly improves TCP performance when network conditions are very good -(e.g. within a datacenter). Our primary strategy here is to reduce interactions -with the Go runtime, specifically the garbage collector (GC) and scheduler. We -are currently optimizing buffer management to reduce the amount of garbage, -which will lower the GC cost. To reduce scheduler interactions, we are -re-architecting the TCP implementation to use fewer goroutines. Performance -today is good enough for most applications and we are making steady -improvements. For example, since May of 2019, we have improved the Netstack -runsc -[iperf3 download benchmark](https://github.com/google/gvisor/tree/master/test/benchmarks/network) -score by roughly 15% and upload score by around 10,000X. Current numbers are -about 17 Gbps download and about 8 Gbps upload versus about 42 Gbps and 43 Gbps -for native (Linux) respectively. - -## An alternative - -We also offer an alternative network mode: passthrough. This name can be -misleading as syscalls are never passed through from the app to the Host OS. -Instead, the passthrough mode implements networking in gVisor using the Host -OS's network stack. (This mode is called -[hostinet](https://github.com/google/gvisor/tree/master/pkg/sentry/socket/hostinet) -in the codebase.) Passthrough mode can improve performance for some use cases as -the Host OS's network stack has had an enormous number of person-years poured -into making it highly performant. However, there is a rather large downside to -using passthrough mode: it weakens gVisor's security model by increasing the -Host OS's Attack Surface. This is because using the Host OS's network stack -requires the Sentry to use the Host OS's -[Berkeley socket interface](https://en.wikipedia.org/wiki/Berkeley_sockets). The -Berkeley socket interface is a much larger API surface than the packet interface -that our network stack uses. When passthrough mode is in use, the Sentry is -allowed to use -[15 additional syscalls](https://github.com/google/gvisor/blob/b1576e533223e98ebe4bd1b82b04e3dcda8c4bf1/runsc/boot/filter/config.go#L312-L517). -Further, this set of syscalls includes some that allow the Sentry to create file -descriptors, something that -[we don't normally allow](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#sentry-host-os-interface) -as it opens up classes of file-based attacks. - -There are some networking features that we can't implement on top of syscalls -that we feel are safe (most notably those behind -[ioctl](http://man7.org/linux/man-pages/man2/ioctl.2.html)) and therefore are -not supported. Because of this, we actually support fewer networking features in -passthrough mode than we do in Netstack, reducing application compatibility. -That's right: using our networking stack provides better overall application -compatibility than using our passthrough mode. - -That said, gVisor with passthrough networking still provides a high level of -isolation. Applications cannot specify host syscall arguments directly, and the -sentry's seccomp policy restricts its syscall use significantly more than a -general purpose seccomp policy. - -## Secure by Default - -The goal of the Secure by Default principle is to make it easy to securely -sandbox containers. Of course, disabling network access entirely is the most -secure option, but that is not practical for most applications. To make gVisor -Secure by Default, we have made Netstack the default networking mode in gVisor -as we believe that it provides significantly better isolation. For this reason -we strongly caution users from changing the default unless Netstack flat out -won't work for them. The passthrough mode option is still provided, but we want -users to make an informed decision when selecting it. - -Another way in which gVisor makes it easy to securely sandbox containers is by -allowing applications to run unmodified, with no special configuration needed. -In order to do this, gVisor needs to support all of the features and syscalls -that applications use. Neither seccomp nor gVisor's passthrough mode can do this -as applications commonly use syscalls which are too dangerous to be included in -a secure policy. Even if this dream isn't fully realized today, gVisor's -architecture with Netstack makes this possible. - -## Give Netstack a Try - -If you haven't already, try running a workload in gVisor with Netstack. You can -find instructions on how to get started in our -[Quick Start](/docs/user_guide/quick_start/docker/). We want to hear about both -your successes and any issues you encounter. We welcome your contributions, -whether that be verbal feedback or code contributions, via our -[Gitter channel](https://gitter.im/gvisor/community), -[email list](https://groups.google.com/forum/#!forum/gvisor-users), -[issue tracker](https://gvisor.dev/issue/new), and -[Github repository](https://github.com/google/gvisor). Feel free to express -interest in an [open issue](https://gvisor.dev/issue/), or reach out if you -aren't sure where to start. diff --git a/website/blog/2020-09-18-containing-a-real-vulnerability.md b/website/blog/2020-09-18-containing-a-real-vulnerability.md deleted file mode 100644 index c1b06a996..000000000 --- a/website/blog/2020-09-18-containing-a-real-vulnerability.md +++ /dev/null @@ -1,223 +0,0 @@ -# Containing a Real Vulnerability - -In the previous two posts we talked about gVisor's -[security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/) -as well as how those are applied in the -[context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/). -Recently, a new container escape vulnerability -([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386)) -was announced that ties these topics well together. gVisor is -[not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific -issue, but it provides an interesting case study to continue our exploration of -gVisor's security. While gVisor is not immune to vulnerabilities, -[we take several steps](https://gvisor.dev/security/) to minimize the impact and -remediate if a vulnerability is found. - -## Escaping the Container - -First, let’s describe how the discovered vulnerability works. There are numerous -ways one can send and receive bytes over the network with Linux. One of the most -performant ways is to use a ring buffer, which is a memory region shared by the -application and the kernel. These rings are created by calling -[setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with -[`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for -receiving and -[`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for -sending packets. - -The vulnerability is in the code that reads packets when `PACKET_RX_RING` is -enabled. There is another option -([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that -asks the kernel to leave some space in the ring buffer before each packet for -anything the application needs, e.g. control structures. When a packet is -received, the kernel calculates where to copy the packet to, taking the amount -reserved before each packet into consideration. If the amount reserved is large, -the kernel performed an incorrect calculation which could cause an overflow -leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker. -The data in the write is easily controlled using the loopback to send a crafted -packet and receiving it using a `PACKET_RX_RING` with a carefully selected -`PACKET_RESERVE` size. - -```c -static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, - struct packet_type *pt, struct net_device *orig_dev) -{ -// ... - if (sk->sk_type == SOCK_DGRAM) { - macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 + - po->tp_reserve; - } else { - unsigned int maclen = skb_network_offset(skb); - // tp_reserve is unsigned int, netoff is unsigned short. Addition can overflow netoff - netoff = TPACKET_ALIGN(po->tp_hdrlen + - (maclen < 16 ? 16 : maclen)) + - po->tp_reserve; - if (po->has_vnet_hdr) { - netoff += sizeof(struct virtio_net_hdr); - do_vnet = true; - } - // Attacker controls netoff and can make macoff be smaller than sizeof(struct virtio_net_hdr) - macoff = netoff - maclen; - } -// ... - // "macoff - sizeof(struct virtio_net_hdr)" can be negative, resulting in a pointer before h.raw - if (do_vnet && - virtio_net_hdr_from_skb(skb, h.raw + macoff - - sizeof(struct virtio_net_hdr), - vio_le(), true, 0)) { -// ... -``` - -The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html) -capability is required to create the socket above. However, in order to support -common debugging tools like `ping` and `tcpdump`, Docker containers, including -those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be -able to trigger this vulnerability to elevate privileges and escape the -container. - -Next, we are going to explore why this vulnerability doesn’t work in gVisor, and -how gVisor could prevent the escape even if a similar vulnerability existed -inside gVisor’s kernel. - -## Default Protections - -gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets -which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature -to support in a sandbox environment. While it allows great customizations for -essential tools like `ping`, it may allow packets to be written to the network -without any validation. In general, allowing an untrusted application to write -crafted packets to the network is a questionable idea and a historical source of -vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it -would not be _secure by default_ to run untrusted applications. - -After multiple discussions when raw sockets were first implemented, we decided -to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the -application**. Instead, enabling raw sockets in gVisor requires the admin to set -`--net-raw` flag to runsc when configuring the runtime, in addition to requiring -the `CAP_NET_RAW` capability in the application. It comes at the expense that -some tools may not work out of the box, but as part of our -[secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default) -principle, we felt that it was important for the “less secure” configuration to -be explicit. - -Since this bug was due to an overflow in the specific Linux implementation of -the packet ring, gVisor's raw socket implementation is not affected. However, if -there were a vulnerability in gVisor, containers would not be allowed to exploit -it by default. - -As an alternative way to implement this same constraint, Kubernetes allows -[admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) -to be configured to customize requests. Cloud providers can use this to -implement more stringent policies. For example, GKE implements an admission -controller for gVisor that -[removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities) -unless it has been explicitly set in the pod spec. - -## Isolated Kernel - -gVisor has its own application kernel, called the Sentry, that is distinct from -the host kernel. Just like what you would expect from a kernel, gVisor has a -memory management subsystem, virtual file system, and a full network stack. The -host network is only used as a transport to carry packets in and out the -sandbox[^1]. The loopback interface which is used in the exploit stays -completely inside the sandbox, never reaching the host. - -Therefore, even if the Sentry was vulnerable to the attack, there would be two -factors that would prevent a container escape from happening. First, the -vulnerability would be limited to the Sentry, and the attacker would compromise -only the application kernel, bound by a restricted set of -[seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in -depth below. Second, the Sentry is a distinct implementation of the API, written -in Go, which provides bounds checking that would have likely prevented access -past the bounds of the shared region (e.g. see -[aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor) -or -[kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor), -which have similar shared regions). - -Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit -of isolation and a pod can run multiple containers. In other words, each pod is -a gVisor instance, and each container is a set of processes running inside -gVisor, isolated via Sentry-internal namespaces like regular containers inside a -pod. If there were a vulnerability in gVisor, the privilege escalation would -allow a container inside the pod to break out to other **containers inside the -same pod**, but the container still **cannot break out of the pod**. - -## Defense in Depth - -gVisor follows a -[common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf) -that the system should have two layers of protection, and those layers should -require different compromises to be broken. We apply this principle by assuming -that the Sentry (first layer of defense) -[will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth). -In order to protect the host kernel from a compromised Sentry, we wrap it around -many security and isolations features to ensure only the minimal set of -functionality from the host kernel is exposed. - -![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.") - -First, the sandbox runs inside a cgroup that can limit and throttle host -resources being used. Second, the sandbox joins empty namespaces, including user -and mount, to further isolate from the host. Next, it changes the process root -to a read-only directory that contains only `/proc` and nothing else. Then, it -executes with the unprivileged user/group -[`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all -capabilities stripped. Last and most importantly, a seccomp filter is added to -tightly restrict what parts of the Linux syscall surface that gVisor is allowed -to access. The allowed host surface is a far smaller set of syscalls than the -Sentry implements for applications to use. Not only restricting the syscall -being called, but also checking that arguments to these syscalls are within the -expected set. Dangerous syscalls like <code>execve(2)</code>, -<code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an -attacker isn’t able to execute binaries or acquire new resources on the host. - -if there were a vulnerability in gVisor that allowed an attacker to execute code -inside the Sentry, the attacker still has extremely limited privileges on the -host. In fact, a compromised Sentry is much more restricted than a -non-compromised regular container. For CVE-2020-14386 in particular, the attack -would be blocked by more than one security layer: non-privileged user, no -capability, and seccomp filters. - -Although the surface is drastically reduced, there is still a chance that there -is a vulnerability in one of the allowed syscalls. That’s why it’s important to -keep the surface small and carefully consider what syscalls are allowed. You can -find the full set of allowed syscalls -[here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/). - -Another possible attack vector is resources that are present in the Sentry, like -open file descriptors. The Sentry has file descriptors that an attacker could -potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC -endpoint that allows external communication with the Sentry, and a Netstack -endpoint that connects the sandbox to the network. The Netstack endpoint in -particular is a concern because it gives direct access to the network. It’s an -`AF_PACKET` socket that allows arbitrary L2 packets to be written to the -network. In the normal case, Netstack assembles packets that go out the network, -giving the container control over only the payload. But if the Sentry is -compromised, an attacker can craft packets to the network. In many ways this is -similar to anyone sending random packets over the internet, but still this is a -place where the host kernel surface exposed is larger than we would like it to -be. - -## Conclusion - -Security comes with many tradeoffs that are often hard to make, such as the -decision to disable raw sockets by default. However, these tradeoffs have served -us well, and we've found them to have paid off over time. CVE-2020-14386 offers -great insight into how multiple layers of protection can be effective against -such an attack. - -We cannot guarantee that a container escape will never happen in gVisor, but we -do our best to make it as hard as we possibly can. - -If you have not tried gVisor yet, it’s easier than you think. Just follow the -steps [here](https://gvisor.dev/docs/user_guide/install/). -<br> -<br> - --------------------------------------------------------------------------------- - -[^1]: Those packets are eventually handled by the host, as it needs to route - them to local containers or send them out the NIC. The packet will be - handled by many switches, routers, proxies, servers, etc. along the way, - which may be subject to their own vulnerabilities. diff --git a/website/blog/BUILD b/website/blog/BUILD deleted file mode 100644 index 865e403da..000000000 --- a/website/blog/BUILD +++ /dev/null @@ -1,47 +0,0 @@ -load("//website:defs.bzl", "doc", "docs") - -package( - default_visibility = ["//website:__pkg__"], - licenses = ["notice"], -) - -exports_files(["index.html"]) - -doc( - name = "security_basics", - src = "2019-11-18-security-basics.md", - authors = [ - "jsprad", - "zkoopmans", - ], - layout = "post", - permalink = "/blog/2019/11/18/gvisor-security-basics-part-1/", -) - -doc( - name = "networking_security", - src = "2020-04-02-networking-security.md", - authors = [ - "igudger", - ], - layout = "post", - permalink = "/blog/2020/04/02/gvisor-networking-security/", -) - -doc( - name = "containing_a_real_vulnerability", - src = "2020-09-18-containing-a-real-vulnerability.md", - authors = [ - "fvoznika", - ], - layout = "post", - permalink = "/blog/2020/09/18/containing-a-real-vulnerability/", -) - -docs( - name = "posts", - deps = [ - ":" + rule - for rule in existing_rules() - ], -) diff --git a/website/blog/index.html b/website/blog/index.html deleted file mode 100644 index 5c67c95fc..000000000 --- a/website/blog/index.html +++ /dev/null @@ -1,22 +0,0 @@ ---- -title: Blog -layout: blog -feed: true -pagination: - enabled: true ---- - -{% for post in paginator.posts %} -<div> - <h2><a href="{{ post.url }}">{{ post.title }}</a></h2> - <div class="blog-meta"> - {% include byline.html authors=post.authors date=post.date %} - </div> - <p>{{ post.excerpt | strip_html }}</p> - <p><a href="{{ post.url }}">Full Post »</a></p> -</div> -{% endfor %} - -{% if paginator.total_pages > 1 %} -{% include paginator.html %} -{% endif %} |