summaryrefslogtreecommitdiffhomepage
path: root/website/blog
diff options
context:
space:
mode:
Diffstat (limited to 'website/blog')
-rw-r--r--website/blog/2019-11-18-security-basics.md310
-rw-r--r--website/blog/2020-04-02-networking-security.md183
-rw-r--r--website/blog/2020-09-18-containing-a-real-vulnerability.md226
-rw-r--r--website/blog/2020-10-22-platform-portability.md120
-rw-r--r--website/blog/BUILD58
-rw-r--r--website/blog/README.md62
-rw-r--r--website/blog/index.html27
7 files changed, 0 insertions, 986 deletions
diff --git a/website/blog/2019-11-18-security-basics.md b/website/blog/2019-11-18-security-basics.md
deleted file mode 100644
index 938605cc2..000000000
--- a/website/blog/2019-11-18-security-basics.md
+++ /dev/null
@@ -1,310 +0,0 @@
-# gVisor Security Basics - Part 1
-
-This blog is a space for engineers and community members to share perspectives
-and deep dives on technology and design within the gVisor project. Though our
-logo suggests we're in the business of space exploration (or perhaps fighting
-sea monsters), we're actually in the business of sandboxing Linux containers.
-When we created gVisor, we had three specific goals in mind; _container-native
-security_, _resource efficiency_, and _platform portability_. To put it simply,
-gVisor provides _efficient defense-in-depth for containers anywhere_.
-
-This post addresses gVisor's _container-native security_, specifically how
-gVisor provides strong isolation between an application and the host OS. Future
-posts will address _resource efficiency_ (how gVisor preserves container
-benefits like fast starts, smaller snapshots, and less memory overhead than VMs)
-and _platform portability_ (run gVisor wherever Linux OCI containers run).
-Delivering on each of these goals requires careful security considerations and a
-robust design.
-
-## What does "sandbox" mean?
-
-gVisor allows the execution of untrusted containers, preventing them from
-adversely affecting the host. This means that the untrusted container is
-prevented from attacking or spying on either the host kernel or any other peer
-userspace processes on the host.
-
-For example, if you are a cloud container hosting service, running containers
-from different customers on the same virtual machine means that compromises
-expose customer data. Properly configured, gVisor can provide sufficient
-isolation to allow different customers to run containers on the same host. There
-are many aspects to the proper configuration, including limiting file and
-network access, which we will discuss in future posts.
-
-## The cost of compromise
-
-gVisor was designed around the premise that any security boundary could
-potentially be compromised with enough time and resources. We tried to optimize
-for a solution that was as costly and time-consuming for an attacker as
-possible, at every layer.
-
-Consequently, gVisor was built through a combination of intentional design
-principles and specific technology choices that work together to provide the
-security isolation needed for running hostile containers on a host. We'll dig
-into it in the next section!
-
-# Design Principles
-
-gVisor was designed with some common
-[secure design](https://en.wikipedia.org/wiki/Secure_by_design) principles in
-mind: Defense-in-Depth, Principle of Least-Privilege, Attack Surface Reduction
-and Secure-by-Default[^1].
-
-In general, Design Principles outline good engineering practices, but in the
-case of security, they also can be thought of as a set of tactics. In a
-real-life castle, there is no single defensive feature. Rather, there are many
-in combination: redundant walls, scattered draw bridges, small bottle-neck
-entrances, moats, etc.
-
-A simplified version of the design is below
-([more detailed version](/docs/))[^2]:
-
-![Figure 1](/assets/images/2019-11-18-security-basics-figure1.png "Simplified design of gVisor.")
-
-In order to discuss design principles, the following components are important to
-know:
-
-* runsc - binary that packages the Sentry, platform, and Gofer(s) that run
- containers. runsc is the drop-in binary for running gVisor in Docker and
- Kubernetes.
-* Untrusted Application - container running in the sandbox. Untrusted
- application/container are used interchangeably in this article.
-* Platform Syscall Switcher - intercepts syscalls from the application and
- passes them to the Sentry with no further handling.
-* Sentry - The "application kernel" in userspace that serves the untrusted
- application. Each application instance has its own Sentry. The Sentry
- handles syscalls, routes I/O to gofers, and manages memory and CPU, all in
- userspace. The Sentry is allowed to make limited, filtered syscalls to the
- host OS.
-* Gofer - a process that specifically handles different types of I/O for the
- Sentry (usually disk I/O). Gofers are also allowed to make filtered syscalls
- to the Host OS.
-* Host OS - the actual OS on which gVisor containers are running, always some
- flavor of Linux (sorry, Windows/MacOS users).
-
-It is important to emphasize what is being protected from the untrusted
-application in this diagram: the host OS and other userspace applications.
-
-In this post, we are only discussing security-related features of gVisor, and
-you might ask, "What about performance, compatibility and stability?" We will
-cover these considerations in future posts.
-
-## Defense-in-Depth
-
-For gVisor, Defense-in-Depth means each component of the software stack trusts
-the other components as little as possible.
-
-It may seem strange that we would want our own software components to distrust
-each other. But by limiting the trust between small, discrete components, each
-component is forced to defend itself against potentially malicious input. And
-when you stack these components on top of each other, you can ensure that
-multiple security barriers must be overcome by an attacker.
-
-And this leads us to how Defense-in-Depth is applied to gVisor: no single
-vulnerability should compromise the host.
-
-In the "Attacker's Advantage / Defender's Dilemma," the defender must succeed
-all the time while the attacker only needs to succeed once. Defense in Depth
-inverts this principle: once the attacker successfully compromises any given
-software component, they are immediately faced with needing to compromise a
-subsequent, distinct layer in order to move laterally or acquire more privilege.
-
-For example, the untrusted container is isolated from the Sentry. The Sentry is
-isolated from host I/O operations by serving those requests in separate
-processes called Gofers. And both the untrusted container and its associated
-Gofers are isolated from the host process that is running the sandbox.
-
-An additional benefit is that this generally leads to more robust and stable
-software, forcing interfaces to be strictly defined and tested to ensure all
-inputs are properly parsed and bounds checked.
-
-## Least-Privilege
-
-The principle of Least-Privilege implies that each software component has only
-the permissions it needs to function, and no more.
-
-Least-Privilege is applied throughout gVisor. Each component and more
-importantly, each interface between the components, is designed so that only the
-minimum level of permission is required for it to perform its function.
-Specifically, the closer you are to the untrusted application, the less
-privilege you have.
-
-![Figure 2](/assets/images/2019-11-18-security-basics-figure2.png "runsc components and their privileges.")
-
-This is evident in how runsc (the drop in gVisor binary for Docker/Kubernetes)
-constructs the sandbox. The Sentry has the least privilege possible (it can't
-even open a file!). Gofers are only allowed file access, so even if it were
-compromised, the host network would be unavailable. Only the runsc binary itself
-has full access to the host OS, and even runsc's access to the host OS is often
-limited through capabilities / chroot / namespacing.
-
-Designing a system with Defense-in-Depth and Least-Privilege in mind encourages
-small, separate, single-purpose components, each with very restricted
-privileges.
-
-## Attack Surface Reduction
-
-There are no bugs in unwritten code. In other words, gVisor supports a feature
-if and only if it is needed to run host Linux containers.
-
-### Host Application/Sentry Interface:
-
-There are a lot of things gVisor does not need to do. For example, it does not
-need to support arbitrary device drivers, nor does it need to support video
-playback. By not implementing what will not be used, we avoid introducing
-potential bugs in our code.
-
-That is not to say gVisor has limited functionality! Quite the opposite, we
-analyzed what is actually needed to run Linux containers and today the Sentry
-supports 237 syscalls[^3]<sup>,</sup>[^4], along with the range of critical
-/proc and /dev files. However, gVisor does not support every syscall in the
-Linux kernel. There are about 350 syscalls[^5] within the 5.3.11 version of the
-Linux kernel, many of which do not apply to Linux containers that typically host
-cloud-like workloads. For example, we don't support old versions of epoll
-(epoll_ctl_old, epoll_wait_old), because they are deprecated in Linux and no
-supported workloads use them.
-
-Furthermore, any exploited vulnerabilities in the implemented syscalls (or
-Sentry code in general) only apply to gaining control of the Sentry. More on
-this in a later post.
-
-### Sentry/Host OS Interface:
-
-The Sentry's interactions with the Host OS are restricted in many ways. For
-instance, no syscall is "passed-through" from the untrusted application to the
-host OS. All syscalls are intercepted and interpreted. In the case where the
-Sentry needs to call the Host OS, we severely limit the syscalls that the Sentry
-itself is allowed to make to the host kernel[^6].
-
-For example, there are many file-system based attacks, where manipulation of
-files or their paths, can lead to compromise of the host[^7]. As a result, the
-Sentry does not allow any syscall that creates or opens a file descriptor. All
-file descriptors must be donated to the sandbox. By disallowing open or creation
-of file descriptors, we eliminate entire categories of these file-based attacks.
-
-This does not affect functionality though. For example, during startup, runsc
-will donate FDs the Sentry that allow for mapping STDIN/STDOUT/STDERR to the
-sandboxed application. Also the Gofer may donate an FD to the Sentry, allowing
-for direct access to some files. And most files will be remotely accessed
-through the Gofers, in which case no FDs are donated to the Sentry.
-
-The Sentry itself is only allowed access to specific
-[allowlisted syscalls](https://github.com/google/gvisor/blob/master/runsc/config/config.go).
-Without networking, the Sentry needs 53 host syscalls in order to function, and
-with networking, it uses an additional 15[^8]. By limiting the allowlist to only
-these needed syscalls, we radically reduce the amount of host OS attack surface.
-If any attempts are made to call something outside the allowlist, it is
-immediately blocked and the sandbox is killed by the Host OS.
-
-### Sentry/Gofer Interface:
-
-The Sentry communicates with the Gofer through a local unix domain socket (UDS)
-via a version of the 9P protocol[^9]. The UDS file descriptor is passed to the
-sandbox during initialization and all communication between the Sentry and Gofer
-happens via 9P. We will go more into how Gofers work in future posts.
-
-### End Result
-
-So, of the 350 syscalls in the Linux kernel, the Sentry needs to implement only
-237 of them to support containers. At most, the Sentry only needs to call 68 of
-the host Linux syscalls. In other words, with gVisor, applications get the vast
-majority (and growing) functionality of Linux containers for only 68 possible
-syscalls to the Host OS. 350 syscalls to 68 is attack surface reduction.
-
-![Figure 3](/assets/images/2019-11-18-security-basics-figure3.png "Reduction of Attack Surface of the Syscall Table. Note that the Senty's Syscall Emulation Layer keeps the Containerized Process from ever calling the Host OS.")
-
-## Secure-by-default
-
-The default choice for a user should be safe. If users need to run a less secure
-configuration of the sandbox for the sake of performance or application
-compatibility, they must make the choice explicitly.
-
-An example of this might be a networking application that is performance
-sensitive. Instead of using the safer, Go-based Netstack in the Sentry, the
-untrusted container can instead use the host Linux networking stack directly.
-However, this means the untrusted container will be directly interacting with
-the host, without the safety benefits of the sandbox. It also means that an
-attack could directly compromise the host through his path.
-
-These less secure configurations are **not** the default. In fact, the user must
-take action to change the configuration and run in a less secure mode.
-Additionally, these actions make it very obvious that a less secure
-configuration is being used.
-
-This can be as simple as forcing a default runtime flag option to the secure
-option. gVisor does this by always using its internal netstack by default.
-However, for certain performance sensitive applications, we allow the usage of
-the host OS networking stack, but it requires the user to actively set a
-flag[^10].
-
-# Technology Choices
-
-Technology choices for gVisor mainly involve things that will give us a security
-boundary.
-
-At a higher level, boundaries in software might be describing a great many
-things. It may be discussing the boundaries between threads, boundaries between
-processes, boundaries between CPU privilege levels, and more.
-
-Security boundaries are interfaces that are designed and built so that entire
-classes of bugs/vulnerabilities are eliminated.
-
-For example, the Sentry and Gofers are implemented using Go. Go was chosen for a
-number of the features it provided. Go is a fast, statically-typed, compiled
-language that has efficient multi-threading support, garbage collection and a
-constrained set of "unsafe" operations.
-
-Using these features enabled safe array and pointer handling. This means entire
-classes of vulnerabilities were eliminated, such as buffer overflows and
-use-after-free.
-
-Another example is our use of very strict syscall switching to ensure that the
-Sentry is always the first software component that parses and interprets the
-calls being made by the untrusted container. Here is an instance where different
-platforms use different solutions, but all of them share this common trait,
-whether it is through the use of ptrace "a la PTRACE_ATTACH"[^11] or kvm's
-ring0[^12].
-
-Finally, one of the most restrictive choices was to use seccomp, to restrict the
-Sentry from being able to open or create a file descriptor on the host. All file
-I/O is required to go through Gofers. Preventing the opening or creation of file
-descriptions eliminates whole categories of bugs around file permissions
-[like this one](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)[^13].
-
-# To be continued - Part 2
-
-In part 2 of this blog post, we will explore gVisor from an attacker's point of
-view. We will use it as an opportunity to examine the specific strengths and
-weaknesses of each gVisor component.
-
-We will also use it to introduce Google's Vulnerability Reward Program[^14], and
-other ways the community can contribute to help make gVisor safe, fast and
-stable.
-<br>
-<br>
-**Updated (2021-07-14):** this post was updated to use more inclusive language.
-<br>
-
---------------------------------------------------------------------------------
-
-[^1]: [https://en.wikipedia.org/wiki/Secure_by_design](https://en.wikipedia.org/wiki/Secure_by_design)
-[^2]: [https://gvisor.dev/docs/architecture_guide](https://gvisor.dev/docs/architecture_guide/)
-[^3]: [https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/linux64_amd64.go](https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/syscalls.go)
-
-<!-- mdformat off(mdformat formats this into multiple lines) -->
-[^4]: Internally that is, it doesn't call to the Host OS to implement them, in fact that is explicitly disallowed, more on that in the future.
-<!-- mdformat on -->
-
-[^5]: [https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345](https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345)
-[^6]: [https://github.com/google/gvisor/tree/master/runsc/boot/filter](https://github.com/google/gvisor/tree/master/runsc/boot/filter)
-[^7]: [https://en.wikipedia.org/wiki/Dirty_COW](https://en.wikipedia.org/wiki/Dirty_COW)
-[^8]: [https://github.com/google/gvisor/blob/master/runsc/boot/config.go](https://github.com/google/gvisor/blob/master/runsc/boot/config.go)
-
-<!-- mdformat off(mdformat breaks this url by escaping the parenthesis) -->
-[^9]: [https://en.wikipedia.org/wiki/9P_(protocol)](https://en.wikipedia.org/wiki/9P_(protocol))
-<!-- mdformat on -->
-
-[^10]: [https://gvisor.dev/docs/user_guide/networking/#network-passthrough](https://gvisor.dev/docs/user_guide/networking/#network-passthrough)
-[^11]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390)
-[^12]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182)
-[^13]: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)
-[^14]: [https://www.google.com/about/appsecurity/reward-program/index.html](https://www.google.com/about/appsecurity/reward-program/index.html)
diff --git a/website/blog/2020-04-02-networking-security.md b/website/blog/2020-04-02-networking-security.md
deleted file mode 100644
index f3ce02d11..000000000
--- a/website/blog/2020-04-02-networking-security.md
+++ /dev/null
@@ -1,183 +0,0 @@
-# gVisor Networking Security
-
-In our
-[first blog post](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/),
-we covered some secure design principles and how they guided the architecture of
-gVisor as a whole. In this post, we will cover how these principles guided the
-networking architecture of gVisor, and the tradeoffs involved. In particular, we
-will cover how these principles culminated in two networking modes, how they
-work, and the properties of each.
-
-## gVisor's security architecture in the context of networking
-
-Linux networking is complicated. The TCP protocol is over 40 years old, and has
-been repeatedly extended over the years to keep up with the rapid pace of
-network infrastructure improvements, all while maintaining compatibility. On top
-of that, Linux networking has a fairly large API surface. Linux supports
-[over 150 options](https://github.com/google/gvisor/blob/960f6a975b7e44c0efe8fd38c66b02017c4fe137/pkg/sentry/strace/socket.go#L476-L644)
-for the most common socket types alone. In fact, the net subsystem is one of the
-largest and fastest growing in Linux at approximately 1.1 million lines of code.
-For comparison, that is several times the size of the entire gVisor codebase.
-
-At the same time, networking is increasingly important. The cloud era is
-arguably about making everything a network service, and in order to make that
-work, the interconnect performance is critical. Adding networking support to
-gVisor was difficult, not just due to the inherent complexity, but also because
-it has the potential to significantly weaken gVisor's security model.
-
-As outlined in the previous blog post, gVisor's
-[secure design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#design-principles)
-are:
-
-1. Defense in Depth: each component of the software stack trusts each other
- component as little as possible.
-1. Least Privilege: each software component has only the permissions it needs
- to function, and no more.
-1. Attack Surface Reduction: limit the surface area of the host exposed to the
- sandbox.
-1. Secure by Default: the default choice for a user should be safe.
-
-gVisor manifests these principles as a multi-layered system. An application
-running in the sandbox interacts with the Sentry, a userspace kernel, which
-mediates all interactions with the Host OS and beyond. The Sentry is written in
-pure Go with minimal unsafe code, making it less vulnerable to buffer overflows
-and related memory bugs that can lead to a variety of compromises including code
-injection. It emulates Linux using only a minimal and audited set of Host OS
-syscalls that limit the Host OS's attack surface exposed to the Sentry itself.
-The syscall restrictions are enforced by running the Sentry with seccomp
-filters, which enforce that the Sentry can only use the expected set of
-syscalls. The Sentry runs as an unprivileged user and in namespaces, which,
-along with the seccomp filters, ensure that the Sentry is run with the Least
-Privilege required.
-
-gVisor's multi-layered design provides Defense in Depth. The Sentry, which does
-not trust the application because it may attack the Sentry and try to bypass it,
-is the first layer. The sandbox that the Sentry runs in is the second layer. If
-the Sentry were compromised, the attacker would still be in a highly restrictive
-sandbox which they must also break out of in order to compromise the Host OS.
-
-To enable networking functionality while preserving gVisor's security
-properties, we implemented a
-[userspace network stack](https://github.com/google/gvisor/tree/master/pkg/tcpip)
-in the Sentry, which we creatively named Netstack. Netstack is also written in
-Go, not only to avoid unsafe code in the network stack itself, but also to avoid
-a complicated and unsafe Foreign Function Interface. Having its own integrated
-network stack allows the Sentry to implement networking operations using up to
-three Host OS syscalls to read and write packets. These syscalls allow a very
-minimal set of operations which are already allowed (either through the same or
-a similar syscall). Moreover, because packets typically come from off-host (e.g.
-the internet), the Host OS's packet processing code has received a lot of
-scrutiny, hopefully resulting in a high degree of hardening.
-
-![Figure 1](/assets/images/2020-04-02-networking-security-figure1.png "Network and gVisor.")
-
-## Writing a network stack
-
-Netstack was written from scratch specifically for gVisor. Because Netstack was
-designed and implemented to be modular, flexible and self-contained, there are
-now several more projects using Netstack in creative and exciting ways. As we
-discussed, a custom network stack has enabled a variety of security-related
-goals which would not have been possible any other way. This came at a cost
-though. Network stacks are complex and writing a new one comes with many
-challenges, mostly related to application compatibility and performance.
-
-Compatibility issues typically come in two forms: missing features, and features
-with behavior that differs from Linux (usually due to bugs). Both of these are
-inevitable in an implementation of a complex system spanning many quickly
-evolving and ambiguous standards. However, we have invested heavily in this
-area, and the vast majority of applications have no issues using Netstack. For
-example,
-[we now support setting 34 different socket options](https://github.com/google/gvisor/blob/815df2959a76e4a19f5882e40402b9bbca9e70be/pkg/sentry/socket/netstack/netstack.go#L830-L1764)
-versus
-[only 7 in our initial git commit](https://github.com/google/gvisor/blob/d02b74a5dcfed4bfc8f2f8e545bca4d2afabb296/pkg/sentry/socket/epsocket/epsocket.go#L445-L702).
-We are continuing to make good progress in this area.
-
-Performance issues typically come from TCP behavior and packet processing speed.
-To improve our TCP behavior, we are working on implementing the full set of TCP
-RFCs. There are many RFCs which are significant to performance (e.g.
-[RACK](https://tools.ietf.org/id/draft-ietf-tcpm-rack-03.html) and
-[BBR](https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00))
-that we have yet to implement. This mostly affects TCP performance with
-non-ideal network conditions (e.g. cross continent connections). Faster packet
-processing mostly improves TCP performance when network conditions are very good
-(e.g. within a datacenter). Our primary strategy here is to reduce interactions
-with the Go runtime, specifically the garbage collector (GC) and scheduler. We
-are currently optimizing buffer management to reduce the amount of garbage,
-which will lower the GC cost. To reduce scheduler interactions, we are
-re-architecting the TCP implementation to use fewer goroutines. Performance
-today is good enough for most applications and we are making steady
-improvements. For example, since May of 2019, we have improved the Netstack
-runsc
-[iperf3 download benchmark](https://github.com/google/gvisor/tree/master/test/benchmarks/network)
-score by roughly 15% and upload score by around 10,000X. Current numbers are
-about 17 Gbps download and about 8 Gbps upload versus about 42 Gbps and 43 Gbps
-for native (Linux) respectively.
-
-## An alternative
-
-We also offer an alternative network mode: passthrough. This name can be
-misleading as syscalls are never passed through from the app to the Host OS.
-Instead, the passthrough mode implements networking in gVisor using the Host
-OS's network stack. (This mode is called
-[hostinet](https://github.com/google/gvisor/tree/master/pkg/sentry/socket/hostinet)
-in the codebase.) Passthrough mode can improve performance for some use cases as
-the Host OS's network stack has had an enormous number of person-years poured
-into making it highly performant. However, there is a rather large downside to
-using passthrough mode: it weakens gVisor's security model by increasing the
-Host OS's Attack Surface. This is because using the Host OS's network stack
-requires the Sentry to use the Host OS's
-[Berkeley socket interface](https://en.wikipedia.org/wiki/Berkeley_sockets). The
-Berkeley socket interface is a much larger API surface than the packet interface
-that our network stack uses. When passthrough mode is in use, the Sentry is
-allowed to use
-[15 additional syscalls](https://github.com/google/gvisor/blob/b1576e533223e98ebe4bd1b82b04e3dcda8c4bf1/runsc/boot/filter/config.go#L312-L517).
-Further, this set of syscalls includes some that allow the Sentry to create file
-descriptors, something that
-[we don't normally allow](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#sentry-host-os-interface)
-as it opens up classes of file-based attacks.
-
-There are some networking features that we can't implement on top of syscalls
-that we feel are safe (most notably those behind
-[ioctl](http://man7.org/linux/man-pages/man2/ioctl.2.html)) and therefore are
-not supported. Because of this, we actually support fewer networking features in
-passthrough mode than we do in Netstack, reducing application compatibility.
-That's right: using our networking stack provides better overall application
-compatibility than using our passthrough mode.
-
-That said, gVisor with passthrough networking still provides a high level of
-isolation. Applications cannot specify host syscall arguments directly, and the
-sentry's seccomp policy restricts its syscall use significantly more than a
-general purpose seccomp policy.
-
-## Secure by Default
-
-The goal of the Secure by Default principle is to make it easy to securely
-sandbox containers. Of course, disabling network access entirely is the most
-secure option, but that is not practical for most applications. To make gVisor
-Secure by Default, we have made Netstack the default networking mode in gVisor
-as we believe that it provides significantly better isolation. For this reason
-we strongly caution users from changing the default unless Netstack flat out
-won't work for them. The passthrough mode option is still provided, but we want
-users to make an informed decision when selecting it.
-
-Another way in which gVisor makes it easy to securely sandbox containers is by
-allowing applications to run unmodified, with no special configuration needed.
-In order to do this, gVisor needs to support all of the features and syscalls
-that applications use. Neither seccomp nor gVisor's passthrough mode can do this
-as applications commonly use syscalls which are too dangerous to be included in
-a secure policy. Even if this dream isn't fully realized today, gVisor's
-architecture with Netstack makes this possible.
-
-## Give Netstack a Try
-
-If you haven't already, try running a workload in gVisor with Netstack. You can
-find instructions on how to get started in our
-[Quick Start](/docs/user_guide/quick_start/docker/). We want to hear about both
-your successes and any issues you encounter. We welcome your contributions,
-whether that be verbal feedback or code contributions, via our
-[Gitter channel](https://gitter.im/gvisor/community),
-[email list](https://groups.google.com/forum/#!forum/gvisor-users),
-[issue tracker](https://gvisor.dev/issue/new), and
-[Github repository](https://github.com/google/gvisor). Feel free to express
-interest in an [open issue](https://gvisor.dev/issue/), or reach out if you
-aren't sure where to start.
diff --git a/website/blog/2020-09-18-containing-a-real-vulnerability.md b/website/blog/2020-09-18-containing-a-real-vulnerability.md
deleted file mode 100644
index 8a6f7bbf1..000000000
--- a/website/blog/2020-09-18-containing-a-real-vulnerability.md
+++ /dev/null
@@ -1,226 +0,0 @@
-# Containing a Real Vulnerability
-
-In the previous two posts we talked about gVisor's
-[security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/)
-as well as how those are applied in the
-[context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/).
-Recently, a new container escape vulnerability
-([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386))
-was announced that ties these topics well together. gVisor is
-[not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific
-issue, but it provides an interesting case study to continue our exploration of
-gVisor's security. While gVisor is not immune to vulnerabilities,
-[we take several steps](https://gvisor.dev/security/) to minimize the impact and
-remediate if a vulnerability is found.
-
-## Escaping the Container
-
-First, let’s describe how the discovered vulnerability works. There are numerous
-ways one can send and receive bytes over the network with Linux. One of the most
-performant ways is to use a ring buffer, which is a memory region shared by the
-application and the kernel. These rings are created by calling
-[setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with
-[`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
-receiving and
-[`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
-sending packets.
-
-The vulnerability is in the code that reads packets when `PACKET_RX_RING` is
-enabled. There is another option
-([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that
-asks the kernel to leave some space in the ring buffer before each packet for
-anything the application needs, e.g. control structures. When a packet is
-received, the kernel calculates where to copy the packet to, taking the amount
-reserved before each packet into consideration. If the amount reserved is large,
-the kernel performed an incorrect calculation which could cause an overflow
-leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker.
-The data in the write is easily controlled using the loopback to send a crafted
-packet and receiving it using a `PACKET_RX_RING` with a carefully selected
-`PACKET_RESERVE` size.
-
-```c
-static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
- struct packet_type *pt, struct net_device *orig_dev)
-{
-// ...
- if (sk->sk_type == SOCK_DGRAM) {
- macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
- po->tp_reserve;
- } else {
- unsigned int maclen = skb_network_offset(skb);
- // tp_reserve is unsigned int, netoff is unsigned short.
- // Addition can overflow netoff
- netoff = TPACKET_ALIGN(po->tp_hdrlen +
- (maclen < 16 ? 16 : maclen)) +
- po->tp_reserve;
- if (po->has_vnet_hdr) {
- netoff += sizeof(struct virtio_net_hdr);
- do_vnet = true;
- }
- // Attacker controls netoff and can make macoff be smaller
- // than sizeof(struct virtio_net_hdr)
- macoff = netoff - maclen;
- }
-// ...
- // "macoff - sizeof(struct virtio_net_hdr)" can be negative,
- // resulting in a pointer before h.raw
- if (do_vnet &&
- virtio_net_hdr_from_skb(skb, h.raw + macoff -
- sizeof(struct virtio_net_hdr),
- vio_le(), true, 0)) {
-// ...
-```
-
-The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html)
-capability is required to create the socket above. However, in order to support
-common debugging tools like `ping` and `tcpdump`, Docker containers, including
-those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be
-able to trigger this vulnerability to elevate privileges and escape the
-container.
-
-Next, we are going to explore why this vulnerability doesn’t work in gVisor, and
-how gVisor could prevent the escape even if a similar vulnerability existed
-inside gVisor’s kernel.
-
-## Default Protections
-
-gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets
-which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature
-to support in a sandbox environment. While it allows great customizations for
-essential tools like `ping`, it may allow packets to be written to the network
-without any validation. In general, allowing an untrusted application to write
-crafted packets to the network is a questionable idea and a historical source of
-vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it
-would not be _secure by default_ to run untrusted applications.
-
-After multiple discussions when raw sockets were first implemented, we decided
-to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the
-application**. Instead, enabling raw sockets in gVisor requires the admin to set
-`--net-raw` flag to runsc when configuring the runtime, in addition to requiring
-the `CAP_NET_RAW` capability in the application. It comes at the expense that
-some tools may not work out of the box, but as part of our
-[secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default)
-principle, we felt that it was important for the “less secure” configuration to
-be explicit.
-
-Since this bug was due to an overflow in the specific Linux implementation of
-the packet ring, gVisor's raw socket implementation is not affected. However, if
-there were a vulnerability in gVisor, containers would not be allowed to exploit
-it by default.
-
-As an alternative way to implement this same constraint, Kubernetes allows
-[admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
-to be configured to customize requests. Cloud providers can use this to
-implement more stringent policies. For example, GKE implements an admission
-controller for gVisor that
-[removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities)
-unless it has been explicitly set in the pod spec.
-
-## Isolated Kernel
-
-gVisor has its own application kernel, called the Sentry, that is distinct from
-the host kernel. Just like what you would expect from a kernel, gVisor has a
-memory management subsystem, virtual file system, and a full network stack. The
-host network is only used as a transport to carry packets in and out the
-sandbox[^1]. The loopback interface which is used in the exploit stays
-completely inside the sandbox, never reaching the host.
-
-Therefore, even if the Sentry was vulnerable to the attack, there would be two
-factors that would prevent a container escape from happening. First, the
-vulnerability would be limited to the Sentry, and the attacker would compromise
-only the application kernel, bound by a restricted set of
-[seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in
-depth below. Second, the Sentry is a distinct implementation of the API, written
-in Go, which provides bounds checking that would have likely prevented access
-past the bounds of the shared region (e.g. see
-[aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor)
-or
-[kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor),
-which have similar shared regions).
-
-Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit
-of isolation and a pod can run multiple containers. In other words, each pod is
-a gVisor instance, and each container is a set of processes running inside
-gVisor, isolated via Sentry-internal namespaces like regular containers inside a
-pod. If there were a vulnerability in gVisor, the privilege escalation would
-allow a container inside the pod to break out to other **containers inside the
-same pod**, but the container still **cannot break out of the pod**.
-
-## Defense in Depth
-
-gVisor follows a
-[common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf)
-that the system should have two layers of protection, and those layers should
-require different compromises to be broken. We apply this principle by assuming
-that the Sentry (first layer of defense)
-[will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth).
-In order to protect the host kernel from a compromised Sentry, we wrap it around
-many security and isolations features to ensure only the minimal set of
-functionality from the host kernel is exposed.
-
-![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.")
-
-First, the sandbox runs inside a cgroup that can limit and throttle host
-resources being used. Second, the sandbox joins empty namespaces, including user
-and mount, to further isolate from the host. Next, it changes the process root
-to a read-only directory that contains only `/proc` and nothing else. Then, it
-executes with the unprivileged user/group
-[`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all
-capabilities stripped. Last and most importantly, a seccomp filter is added to
-tightly restrict what parts of the Linux syscall surface that gVisor is allowed
-to access. The allowed host surface is a far smaller set of syscalls than the
-Sentry implements for applications to use. Not only restricting the syscall
-being called, but also checking that arguments to these syscalls are within the
-expected set. Dangerous syscalls like <code>execve(2)</code>,
-<code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an
-attacker isn’t able to execute binaries or acquire new resources on the host.
-
-if there were a vulnerability in gVisor that allowed an attacker to execute code
-inside the Sentry, the attacker still has extremely limited privileges on the
-host. In fact, a compromised Sentry is much more restricted than a
-non-compromised regular container. For CVE-2020-14386 in particular, the attack
-would be blocked by more than one security layer: non-privileged user, no
-capability, and seccomp filters.
-
-Although the surface is drastically reduced, there is still a chance that there
-is a vulnerability in one of the allowed syscalls. That’s why it’s important to
-keep the surface small and carefully consider what syscalls are allowed. You can
-find the full set of allowed syscalls
-[here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/).
-
-Another possible attack vector is resources that are present in the Sentry, like
-open file descriptors. The Sentry has file descriptors that an attacker could
-potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC
-endpoint that allows external communication with the Sentry, and a Netstack
-endpoint that connects the sandbox to the network. The Netstack endpoint in
-particular is a concern because it gives direct access to the network. It’s an
-`AF_PACKET` socket that allows arbitrary L2 packets to be written to the
-network. In the normal case, Netstack assembles packets that go out the network,
-giving the container control over only the payload. But if the Sentry is
-compromised, an attacker can craft packets to the network. In many ways this is
-similar to anyone sending random packets over the internet, but still this is a
-place where the host kernel surface exposed is larger than we would like it to
-be.
-
-## Conclusion
-
-Security comes with many tradeoffs that are often hard to make, such as the
-decision to disable raw sockets by default. However, these tradeoffs have served
-us well, and we've found them to have paid off over time. CVE-2020-14386 offers
-great insight into how multiple layers of protection can be effective against
-such an attack.
-
-We cannot guarantee that a container escape will never happen in gVisor, but we
-do our best to make it as hard as we possibly can.
-
-If you have not tried gVisor yet, it’s easier than you think. Just follow the
-steps [here](https://gvisor.dev/docs/user_guide/install/).
-<br>
-<br>
-
---------------------------------------------------------------------------------
-
-[^1]: Those packets are eventually handled by the host, as it needs to route
- them to local containers or send them out the NIC. The packet will be
- handled by many switches, routers, proxies, servers, etc. along the way,
- which may be subject to their own vulnerabilities.
diff --git a/website/blog/2020-10-22-platform-portability.md b/website/blog/2020-10-22-platform-portability.md
deleted file mode 100644
index 4d82940f9..000000000
--- a/website/blog/2020-10-22-platform-portability.md
+++ /dev/null
@@ -1,120 +0,0 @@
-# Platform Portability
-
-Hardware virtualization is often seen as a requirement to provide an additional
-isolation layer for untrusted applications. However, hardware virtualization
-requires expensive bare-metal machines or cloud instances to run safely with
-good performance, increasing cost and complexity for Cloud users. gVisor,
-however, takes a more flexible approach.
-
-One of the pillars of gVisor's architecture is portability, allowing it to run
-anywhere that runs Linux. Modern Cloud-Native applications run in containers in
-many different places, from bare metal to virtual machines, and can't always
-rely on nested virtualization. It is important for gVisor to be able to support
-the environments where you run containers.
-
-gVisor achieves portability through an abstraction called a _Platform_.
-Platforms can have many implementations, and each implementation can cover
-different environments, making use of available software or hardware features.
-
-## Background
-
-Before we can understand how gVisor achieves portability using platforms, we
-should take a step back and understand how applications interact with their
-host.
-
-Container sandboxes can provide an isolation layer between the host and
-application by virtualizing one of the layers below it, including the hardware
-or operating system. Many sandboxes virtualize the hardware layer by running
-applications in virtual machines. gVisor takes a different approach by
-virtualizing the OS layer.
-
-When an application is run in a normal situation the host operating system loads
-the application into user memory and schedules it for execution. The operating
-system scheduler eventually schedules the application to a CPU and begins
-executing it. It then handles the application's requests, such as for memory and
-the lifecycle of the application. gVisor virtualizes these interactions, such as
-system calls, and context switching that happen between an application and OS.
-
-[System calls](https://en.wikipedia.org/wiki/System_call) allow applications to
-ask the OS to perform some task for it. System calls look like a normal function
-call in most programming languages though works a bit differently under the
-hood. When an application system call is encountered some special processing
-takes place to do a
-[context switch](https://en.wikipedia.org/wiki/Context_switch) into kernel mode
-and begin executing code in the kernel before returning a result to the
-application. Context switching may happen in other situations as well. For
-example, to respond to an interrupt.
-
-## The Platform Interface
-
-gVisor provides a sandbox which implements the Linux OS interface, intercepting
-OS interactions such as system calls and implements them in the sandbox kernel.
-
-It does this to limit interactions with the host, and protect the host from an
-untrusted application running in the sandbox. The Platform is the bottom layer
-of gVisor which provides the environment necessary for gVisor to control and
-manage applications. In general, the Platform must:
-
-1. Provide the ability to create and manage memory address spaces.
-2. Provide execution contexts for running applications in those memory address
- spaces.
-3. Provide the ability to change execution context and return control to gVisor
- at specific times (e.g. system call, page fault)
-
-This interface is conceptually simple, but very powerful. Since the Platform
-interface only requires these three capabilities, it gives gVisor enough control
-for it to act as the application's OS, while still allowing the use of very
-different isolation technologies under the hood. You can learn more about the
-Platform interface in the
-[Platform Guide](https://gvisor.dev/docs/architecture_guide/platforms/).
-
-## Implementations of the Platform Interface
-
-While gVisor can make use of technologies like hardware virtualization, it
-doesn't necessarily rely on any one technology to provide a similar level of
-isolation. The flexibility of the Platform interface allows for implementations
-that use technologies other than hardware virtualization. This allows gVisor to
-run in VMs without nested virtualization, for example. By providing an
-abstraction for the underlying platform, each implementation can make various
-tradeoffs regarding performance or hardware requirements.
-
-Currently gVisor provides two gVisor Platform implementations; the Ptrace
-Platform, and the KVM Platform, each using very different methods to implement
-the Platform interface.
-
-![gVisor Platforms](../../../../../docs/architecture_guide/platforms/platforms.png "Platforms")
-
-The Ptrace Platform uses
-[PTRACE\_SYSEMU](http://man7.org/linux/man-pages/man2/ptrace.2.html) to trap
-syscalls, and uses the host for memory mapping and context switching. This
-platform can run anywhere that ptrace is available, which includes most Linux
-systems, VMs or otherwise.
-
-The KVM Platform uses virtualization, but in an unconventional way. gVisor runs
-in a virtual machine but as both guest OS and VMM, and presents no virtualized
-hardware layer. This provides a simpler interface that can avoid hardware
-initialization for fast start up, while taking advantage of hardware
-virtualization support to improve memory isolation and performance of context
-switching.
-
-The flexibility of the Platform interface allows for a lot of room to improve
-the existing KVM and ptrace platforms, as well as the ability to utilize new
-methods for improving gVisor's performance or portability in future Platform
-implementations.
-
-## Portability
-
-Through the Platform interface, gVisor is able to support bare metal, virtual
-machines, and Cloud environments while still providing a highly secure sandbox
-for running untrusted applications. This is especially important for Cloud and
-Kubernetes users because it allows gVisor to run anywhere that Kubernetes can
-run and provide similar experiences in multi-region, hybrid, multi-platform
-environments.
-
-Give gVisor's open source platforms a try. Using a Platform is as easy as
-providing the `--platform` flag to `runsc`. See the documentation on
-[changing platforms](https://gvisor.dev/docs/user_guide/platforms/) for how to
-use different platforms with Docker. We would love to hear about your experience
-so come chat with us in our
-[Gitter channel](https://gitter.im/gvisor/community), or send us an
-[issue on Github](https://gvisor.dev/issue) if you run into any problems.
diff --git a/website/blog/BUILD b/website/blog/BUILD
deleted file mode 100644
index 17beb721f..000000000
--- a/website/blog/BUILD
+++ /dev/null
@@ -1,58 +0,0 @@
-load("//website:defs.bzl", "doc", "docs")
-
-package(
- default_visibility = ["//website:__pkg__"],
- licenses = ["notice"],
-)
-
-exports_files(["index.html"])
-
-doc(
- name = "security_basics",
- src = "2019-11-18-security-basics.md",
- authors = [
- "jsprad",
- "zkoopmans",
- ],
- layout = "post",
- permalink = "/blog/2019/11/18/gvisor-security-basics-part-1/",
-)
-
-doc(
- name = "networking_security",
- src = "2020-04-02-networking-security.md",
- authors = [
- "igudger",
- ],
- layout = "post",
- permalink = "/blog/2020/04/02/gvisor-networking-security/",
-)
-
-doc(
- name = "containing_a_real_vulnerability",
- src = "2020-09-18-containing-a-real-vulnerability.md",
- authors = [
- "fvoznika",
- ],
- layout = "post",
- permalink = "/blog/2020/09/18/containing-a-real-vulnerability/",
-)
-
-doc(
- name = "platform_portability",
- src = "2020-10-22-platform-portability.md",
- authors = [
- "ianlewis",
- "mpratt",
- ],
- layout = "post",
- permalink = "/blog/2020/10/22/platform-portability/",
-)
-
-docs(
- name = "posts",
- deps = [
- ":" + rule
- for rule in existing_rules()
- ],
-)
diff --git a/website/blog/README.md b/website/blog/README.md
deleted file mode 100644
index e1d685288..000000000
--- a/website/blog/README.md
+++ /dev/null
@@ -1,62 +0,0 @@
-# gVisor blog
-
-The gVisor blog is owned and run by the gVisor team.
-
-## Contact
-
-Reach out to us on [gitter](https://gitter.im/gvisor/community) or the
-[mailing list](https://groups.google.com/forum/#!forum/gvisor-users) if you
-would like to write a blog post.
-
-## Submit a Post
-
-Anyone can write a blog post and submit it for review. Purely commercial content
-or vendor pitches are not allowed. Please refer to the
-[blog guidelines](#blog-guidelines) for more guidance about content is that
-allowed.
-
-To submit a blog post, follow the steps below.
-
-1. [Sign the Contributor License Agreements](https://gvisor.dev/contributing/)
- if you have not yet done so.
-1. Familiarize yourself with the Markdown format for the
- [existing blog posts](https://github.com/google/gvisor/tree/master/website/blog).
-1. Write your blog post in a text editor of your choice.
-1. (Optional) If you need help with markdown, check out
- [StakEdit](https://stackedit.io/app#) or read
- [Jekyll's formatting reference](https://jekyllrb.com/docs/posts/#creating-posts)
- for more information.
-1. Click **Add file** > **Create new file**.
-1. Paste your content into the editor and save it. Name the file in the
- following way: *[BLOG] Your proposed title* , but don’t put the date in the
- file name. The blog reviewers will work with you on the final file name, and
- the date on which the blog will be published.
-1. When you save the file, GitHub will walk you through the pull request (PR)
- process.
-1. Send us a message on [gitter](https://gitter.im/gvisor/community) with a
- link to your recently created PR.
-1. A reviewer will be assigned to the pull request. They check your submission,
- and work with you on feedback and final details. When the pull request is
- approved, the blog will be scheduled for publication.
-
-### Blog Guidelines {#blog-guidelines}
-
-#### Suitable content:
-
-- **Original content only**
-- gVisor features or project updates
-- Tutorials and demos
-- Use cases
-- Content that is specific to a vendor or platform about gVisor installation
- and use
-
-#### Unsuitable Content:
-
-- Blogs with no content relevant to gVisor
-- Vendor pitches
-
-## Review Process
-
-Each blog post should be approved by at least one person on the team. Once all
-of the review comments have been addressed and approved, a member of the team
-will schedule publication of the blog post.
diff --git a/website/blog/index.html b/website/blog/index.html
deleted file mode 100644
index 272917fc4..000000000
--- a/website/blog/index.html
+++ /dev/null
@@ -1,27 +0,0 @@
----
-title: Blog
-layout: blog
-feed: true
-pagination:
- enabled: true
----
-
-{% for post in paginator.posts %}
-<div>
- <h2><a href="{{ post.url }}">{{ post.title }}</a></h2>
- <div class="blog-meta">
- {% include byline.html authors=post.authors date=post.date %}
- </div>
- <p>{{ post.excerpt | strip_html }}</p>
- <p><a href="{{ post.url }}">Full Post &raquo;</a></p>
-</div>
-{% endfor %}
-
-{% if paginator.total_pages > 1 %}
-{% include paginator.html %}
-{% endif %}
-
-<hr>
-
-If you would like to contribute to the gVisor blog check out the
-<a href="https://github.com/google/gvisor/tree/master/website/blog">instructions</a>.