summaryrefslogtreecommitdiffhomepage
path: root/website/blog
diff options
context:
space:
mode:
Diffstat (limited to 'website/blog')
-rw-r--r--website/blog/2019-11-18-security-basics.md6
-rw-r--r--website/blog/2020-09-18-containing-a-real-vulnerability.md226
-rw-r--r--website/blog/2020-10-22-platform-portability.md120
-rw-r--r--website/blog/BUILD21
4 files changed, 371 insertions, 2 deletions
diff --git a/website/blog/2019-11-18-security-basics.md b/website/blog/2019-11-18-security-basics.md
index 76bbabc13..b6cf57a77 100644
--- a/website/blog/2019-11-18-security-basics.md
+++ b/website/blog/2019-11-18-security-basics.md
@@ -188,7 +188,7 @@ for direct access to some files. And most files will be remotely accessed
through the Gofers, in which case no FDs are donated to the Sentry.
The Sentry itself is only allowed access to specific
-[whitelisted syscalls](https://github.com/google/gvisor/blob/master/runsc/boot/config.go).
+[whitelisted syscalls](https://github.com/google/gvisor/blob/master/runsc/config/config.go).
Without networking, the Sentry needs 53 host syscalls in order to function, and
with networking, it uses an additional 15[^8]. By limiting the whitelist to only
these needed syscalls, we radically reduce the amount of host OS attack surface.
@@ -279,8 +279,10 @@ weaknesses of each gVisor component.
We will also use it to introduce Google's Vulnerability Reward Program[^14], and
other ways the community can contribute to help make gVisor safe, fast and
stable.
+<br>
+<br>
-## Notes
+--------------------------------------------------------------------------------
[^1]: [https://en.wikipedia.org/wiki/Secure_by_design](https://en.wikipedia.org/wiki/Secure_by_design)
[^2]: [https://gvisor.dev/docs/architecture_guide](https://gvisor.dev/docs/architecture_guide/)
diff --git a/website/blog/2020-09-18-containing-a-real-vulnerability.md b/website/blog/2020-09-18-containing-a-real-vulnerability.md
new file mode 100644
index 000000000..8a6f7bbf1
--- /dev/null
+++ b/website/blog/2020-09-18-containing-a-real-vulnerability.md
@@ -0,0 +1,226 @@
+# Containing a Real Vulnerability
+
+In the previous two posts we talked about gVisor's
+[security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/)
+as well as how those are applied in the
+[context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/).
+Recently, a new container escape vulnerability
+([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386))
+was announced that ties these topics well together. gVisor is
+[not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific
+issue, but it provides an interesting case study to continue our exploration of
+gVisor's security. While gVisor is not immune to vulnerabilities,
+[we take several steps](https://gvisor.dev/security/) to minimize the impact and
+remediate if a vulnerability is found.
+
+## Escaping the Container
+
+First, let’s describe how the discovered vulnerability works. There are numerous
+ways one can send and receive bytes over the network with Linux. One of the most
+performant ways is to use a ring buffer, which is a memory region shared by the
+application and the kernel. These rings are created by calling
+[setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with
+[`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
+receiving and
+[`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
+sending packets.
+
+The vulnerability is in the code that reads packets when `PACKET_RX_RING` is
+enabled. There is another option
+([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that
+asks the kernel to leave some space in the ring buffer before each packet for
+anything the application needs, e.g. control structures. When a packet is
+received, the kernel calculates where to copy the packet to, taking the amount
+reserved before each packet into consideration. If the amount reserved is large,
+the kernel performed an incorrect calculation which could cause an overflow
+leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker.
+The data in the write is easily controlled using the loopback to send a crafted
+packet and receiving it using a `PACKET_RX_RING` with a carefully selected
+`PACKET_RESERVE` size.
+
+```c
+static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
+ struct packet_type *pt, struct net_device *orig_dev)
+{
+// ...
+ if (sk->sk_type == SOCK_DGRAM) {
+ macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
+ po->tp_reserve;
+ } else {
+ unsigned int maclen = skb_network_offset(skb);
+ // tp_reserve is unsigned int, netoff is unsigned short.
+ // Addition can overflow netoff
+ netoff = TPACKET_ALIGN(po->tp_hdrlen +
+ (maclen < 16 ? 16 : maclen)) +
+ po->tp_reserve;
+ if (po->has_vnet_hdr) {
+ netoff += sizeof(struct virtio_net_hdr);
+ do_vnet = true;
+ }
+ // Attacker controls netoff and can make macoff be smaller
+ // than sizeof(struct virtio_net_hdr)
+ macoff = netoff - maclen;
+ }
+// ...
+ // "macoff - sizeof(struct virtio_net_hdr)" can be negative,
+ // resulting in a pointer before h.raw
+ if (do_vnet &&
+ virtio_net_hdr_from_skb(skb, h.raw + macoff -
+ sizeof(struct virtio_net_hdr),
+ vio_le(), true, 0)) {
+// ...
+```
+
+The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html)
+capability is required to create the socket above. However, in order to support
+common debugging tools like `ping` and `tcpdump`, Docker containers, including
+those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be
+able to trigger this vulnerability to elevate privileges and escape the
+container.
+
+Next, we are going to explore why this vulnerability doesn’t work in gVisor, and
+how gVisor could prevent the escape even if a similar vulnerability existed
+inside gVisor’s kernel.
+
+## Default Protections
+
+gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets
+which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature
+to support in a sandbox environment. While it allows great customizations for
+essential tools like `ping`, it may allow packets to be written to the network
+without any validation. In general, allowing an untrusted application to write
+crafted packets to the network is a questionable idea and a historical source of
+vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it
+would not be _secure by default_ to run untrusted applications.
+
+After multiple discussions when raw sockets were first implemented, we decided
+to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the
+application**. Instead, enabling raw sockets in gVisor requires the admin to set
+`--net-raw` flag to runsc when configuring the runtime, in addition to requiring
+the `CAP_NET_RAW` capability in the application. It comes at the expense that
+some tools may not work out of the box, but as part of our
+[secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default)
+principle, we felt that it was important for the “less secure” configuration to
+be explicit.
+
+Since this bug was due to an overflow in the specific Linux implementation of
+the packet ring, gVisor's raw socket implementation is not affected. However, if
+there were a vulnerability in gVisor, containers would not be allowed to exploit
+it by default.
+
+As an alternative way to implement this same constraint, Kubernetes allows
+[admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
+to be configured to customize requests. Cloud providers can use this to
+implement more stringent policies. For example, GKE implements an admission
+controller for gVisor that
+[removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities)
+unless it has been explicitly set in the pod spec.
+
+## Isolated Kernel
+
+gVisor has its own application kernel, called the Sentry, that is distinct from
+the host kernel. Just like what you would expect from a kernel, gVisor has a
+memory management subsystem, virtual file system, and a full network stack. The
+host network is only used as a transport to carry packets in and out the
+sandbox[^1]. The loopback interface which is used in the exploit stays
+completely inside the sandbox, never reaching the host.
+
+Therefore, even if the Sentry was vulnerable to the attack, there would be two
+factors that would prevent a container escape from happening. First, the
+vulnerability would be limited to the Sentry, and the attacker would compromise
+only the application kernel, bound by a restricted set of
+[seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in
+depth below. Second, the Sentry is a distinct implementation of the API, written
+in Go, which provides bounds checking that would have likely prevented access
+past the bounds of the shared region (e.g. see
+[aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor)
+or
+[kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor),
+which have similar shared regions).
+
+Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit
+of isolation and a pod can run multiple containers. In other words, each pod is
+a gVisor instance, and each container is a set of processes running inside
+gVisor, isolated via Sentry-internal namespaces like regular containers inside a
+pod. If there were a vulnerability in gVisor, the privilege escalation would
+allow a container inside the pod to break out to other **containers inside the
+same pod**, but the container still **cannot break out of the pod**.
+
+## Defense in Depth
+
+gVisor follows a
+[common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf)
+that the system should have two layers of protection, and those layers should
+require different compromises to be broken. We apply this principle by assuming
+that the Sentry (first layer of defense)
+[will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth).
+In order to protect the host kernel from a compromised Sentry, we wrap it around
+many security and isolations features to ensure only the minimal set of
+functionality from the host kernel is exposed.
+
+![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.")
+
+First, the sandbox runs inside a cgroup that can limit and throttle host
+resources being used. Second, the sandbox joins empty namespaces, including user
+and mount, to further isolate from the host. Next, it changes the process root
+to a read-only directory that contains only `/proc` and nothing else. Then, it
+executes with the unprivileged user/group
+[`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all
+capabilities stripped. Last and most importantly, a seccomp filter is added to
+tightly restrict what parts of the Linux syscall surface that gVisor is allowed
+to access. The allowed host surface is a far smaller set of syscalls than the
+Sentry implements for applications to use. Not only restricting the syscall
+being called, but also checking that arguments to these syscalls are within the
+expected set. Dangerous syscalls like <code>execve(2)</code>,
+<code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an
+attacker isn’t able to execute binaries or acquire new resources on the host.
+
+if there were a vulnerability in gVisor that allowed an attacker to execute code
+inside the Sentry, the attacker still has extremely limited privileges on the
+host. In fact, a compromised Sentry is much more restricted than a
+non-compromised regular container. For CVE-2020-14386 in particular, the attack
+would be blocked by more than one security layer: non-privileged user, no
+capability, and seccomp filters.
+
+Although the surface is drastically reduced, there is still a chance that there
+is a vulnerability in one of the allowed syscalls. That’s why it’s important to
+keep the surface small and carefully consider what syscalls are allowed. You can
+find the full set of allowed syscalls
+[here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/).
+
+Another possible attack vector is resources that are present in the Sentry, like
+open file descriptors. The Sentry has file descriptors that an attacker could
+potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC
+endpoint that allows external communication with the Sentry, and a Netstack
+endpoint that connects the sandbox to the network. The Netstack endpoint in
+particular is a concern because it gives direct access to the network. It’s an
+`AF_PACKET` socket that allows arbitrary L2 packets to be written to the
+network. In the normal case, Netstack assembles packets that go out the network,
+giving the container control over only the payload. But if the Sentry is
+compromised, an attacker can craft packets to the network. In many ways this is
+similar to anyone sending random packets over the internet, but still this is a
+place where the host kernel surface exposed is larger than we would like it to
+be.
+
+## Conclusion
+
+Security comes with many tradeoffs that are often hard to make, such as the
+decision to disable raw sockets by default. However, these tradeoffs have served
+us well, and we've found them to have paid off over time. CVE-2020-14386 offers
+great insight into how multiple layers of protection can be effective against
+such an attack.
+
+We cannot guarantee that a container escape will never happen in gVisor, but we
+do our best to make it as hard as we possibly can.
+
+If you have not tried gVisor yet, it’s easier than you think. Just follow the
+steps [here](https://gvisor.dev/docs/user_guide/install/).
+<br>
+<br>
+
+--------------------------------------------------------------------------------
+
+[^1]: Those packets are eventually handled by the host, as it needs to route
+ them to local containers or send them out the NIC. The packet will be
+ handled by many switches, routers, proxies, servers, etc. along the way,
+ which may be subject to their own vulnerabilities.
diff --git a/website/blog/2020-10-22-platform-portability.md b/website/blog/2020-10-22-platform-portability.md
new file mode 100644
index 000000000..4d82940f9
--- /dev/null
+++ b/website/blog/2020-10-22-platform-portability.md
@@ -0,0 +1,120 @@
+# Platform Portability
+
+Hardware virtualization is often seen as a requirement to provide an additional
+isolation layer for untrusted applications. However, hardware virtualization
+requires expensive bare-metal machines or cloud instances to run safely with
+good performance, increasing cost and complexity for Cloud users. gVisor,
+however, takes a more flexible approach.
+
+One of the pillars of gVisor's architecture is portability, allowing it to run
+anywhere that runs Linux. Modern Cloud-Native applications run in containers in
+many different places, from bare metal to virtual machines, and can't always
+rely on nested virtualization. It is important for gVisor to be able to support
+the environments where you run containers.
+
+gVisor achieves portability through an abstraction called a _Platform_.
+Platforms can have many implementations, and each implementation can cover
+different environments, making use of available software or hardware features.
+
+## Background
+
+Before we can understand how gVisor achieves portability using platforms, we
+should take a step back and understand how applications interact with their
+host.
+
+Container sandboxes can provide an isolation layer between the host and
+application by virtualizing one of the layers below it, including the hardware
+or operating system. Many sandboxes virtualize the hardware layer by running
+applications in virtual machines. gVisor takes a different approach by
+virtualizing the OS layer.
+
+When an application is run in a normal situation the host operating system loads
+the application into user memory and schedules it for execution. The operating
+system scheduler eventually schedules the application to a CPU and begins
+executing it. It then handles the application's requests, such as for memory and
+the lifecycle of the application. gVisor virtualizes these interactions, such as
+system calls, and context switching that happen between an application and OS.
+
+[System calls](https://en.wikipedia.org/wiki/System_call) allow applications to
+ask the OS to perform some task for it. System calls look like a normal function
+call in most programming languages though works a bit differently under the
+hood. When an application system call is encountered some special processing
+takes place to do a
+[context switch](https://en.wikipedia.org/wiki/Context_switch) into kernel mode
+and begin executing code in the kernel before returning a result to the
+application. Context switching may happen in other situations as well. For
+example, to respond to an interrupt.
+
+## The Platform Interface
+
+gVisor provides a sandbox which implements the Linux OS interface, intercepting
+OS interactions such as system calls and implements them in the sandbox kernel.
+
+It does this to limit interactions with the host, and protect the host from an
+untrusted application running in the sandbox. The Platform is the bottom layer
+of gVisor which provides the environment necessary for gVisor to control and
+manage applications. In general, the Platform must:
+
+1. Provide the ability to create and manage memory address spaces.
+2. Provide execution contexts for running applications in those memory address
+ spaces.
+3. Provide the ability to change execution context and return control to gVisor
+ at specific times (e.g. system call, page fault)
+
+This interface is conceptually simple, but very powerful. Since the Platform
+interface only requires these three capabilities, it gives gVisor enough control
+for it to act as the application's OS, while still allowing the use of very
+different isolation technologies under the hood. You can learn more about the
+Platform interface in the
+[Platform Guide](https://gvisor.dev/docs/architecture_guide/platforms/).
+
+## Implementations of the Platform Interface
+
+While gVisor can make use of technologies like hardware virtualization, it
+doesn't necessarily rely on any one technology to provide a similar level of
+isolation. The flexibility of the Platform interface allows for implementations
+that use technologies other than hardware virtualization. This allows gVisor to
+run in VMs without nested virtualization, for example. By providing an
+abstraction for the underlying platform, each implementation can make various
+tradeoffs regarding performance or hardware requirements.
+
+Currently gVisor provides two gVisor Platform implementations; the Ptrace
+Platform, and the KVM Platform, each using very different methods to implement
+the Platform interface.
+
+![gVisor Platforms](../../../../../docs/architecture_guide/platforms/platforms.png "Platforms")
+
+The Ptrace Platform uses
+[PTRACE\_SYSEMU](http://man7.org/linux/man-pages/man2/ptrace.2.html) to trap
+syscalls, and uses the host for memory mapping and context switching. This
+platform can run anywhere that ptrace is available, which includes most Linux
+systems, VMs or otherwise.
+
+The KVM Platform uses virtualization, but in an unconventional way. gVisor runs
+in a virtual machine but as both guest OS and VMM, and presents no virtualized
+hardware layer. This provides a simpler interface that can avoid hardware
+initialization for fast start up, while taking advantage of hardware
+virtualization support to improve memory isolation and performance of context
+switching.
+
+The flexibility of the Platform interface allows for a lot of room to improve
+the existing KVM and ptrace platforms, as well as the ability to utilize new
+methods for improving gVisor's performance or portability in future Platform
+implementations.
+
+## Portability
+
+Through the Platform interface, gVisor is able to support bare metal, virtual
+machines, and Cloud environments while still providing a highly secure sandbox
+for running untrusted applications. This is especially important for Cloud and
+Kubernetes users because it allows gVisor to run anywhere that Kubernetes can
+run and provide similar experiences in multi-region, hybrid, multi-platform
+environments.
+
+Give gVisor's open source platforms a try. Using a Platform is as easy as
+providing the `--platform` flag to `runsc`. See the documentation on
+[changing platforms](https://gvisor.dev/docs/user_guide/platforms/) for how to
+use different platforms with Docker. We would love to hear about your experience
+so come chat with us in our
+[Gitter channel](https://gitter.im/gvisor/community), or send us an
+[issue on Github](https://gvisor.dev/issue) if you run into any problems.
diff --git a/website/blog/BUILD b/website/blog/BUILD
index 01c1f5a6e..17beb721f 100644
--- a/website/blog/BUILD
+++ b/website/blog/BUILD
@@ -28,6 +28,27 @@ doc(
permalink = "/blog/2020/04/02/gvisor-networking-security/",
)
+doc(
+ name = "containing_a_real_vulnerability",
+ src = "2020-09-18-containing-a-real-vulnerability.md",
+ authors = [
+ "fvoznika",
+ ],
+ layout = "post",
+ permalink = "/blog/2020/09/18/containing-a-real-vulnerability/",
+)
+
+doc(
+ name = "platform_portability",
+ src = "2020-10-22-platform-portability.md",
+ authors = [
+ "ianlewis",
+ "mpratt",
+ ],
+ layout = "post",
+ permalink = "/blog/2020/10/22/platform-portability/",
+)
+
docs(
name = "posts",
deps = [