summaryrefslogtreecommitdiffhomepage
path: root/website/blog/2020-09-18-containing-a-real-vulnerability.md
diff options
context:
space:
mode:
Diffstat (limited to 'website/blog/2020-09-18-containing-a-real-vulnerability.md')
-rw-r--r--website/blog/2020-09-18-containing-a-real-vulnerability.md224
1 files changed, 0 insertions, 224 deletions
diff --git a/website/blog/2020-09-18-containing-a-real-vulnerability.md b/website/blog/2020-09-18-containing-a-real-vulnerability.md
deleted file mode 100644
index b71ef63d9..000000000
--- a/website/blog/2020-09-18-containing-a-real-vulnerability.md
+++ /dev/null
@@ -1,224 +0,0 @@
-# Containing a Real Vulnerability
-
-In the previous two posts we talked about gVisor's
-[security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/)
-as well as how those are applied in the
-[context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/).
-Recently, a new container escape vulnerability
-([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386))
-was announced that ties these topics well together. gVisor is
-[not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific
-issue, but it provides an interesting case study to continue our exploration of
-gVisor's security. While gVisor is not immune to vulnerabilities,
-[we take several steps](https://gvisor.dev/security/) to minimize the impact and
-remediate if a vulnerability is found.
-
-## Escaping the Container
-
-First, let’s describe how the discovered vulnerability works. There are numerous
-ways one can send and receive bytes over the network with Linux. One of the most
-performant ways is to use a ring buffer, which is a memory region shared by the
-application and the kernel. These rings are created by calling
-[setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with
-[`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
-receiving and
-[`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for
-sending packets.
-
-The vulnerability is in the code that reads packets when `PACKET_RX_RING` is
-enabled. There is another option
-([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that
-asks the kernel to leave some space in the ring buffer before each packet for
-anything the application needs, e.g. control structures. When a packet is
-received, the kernel calculates where to copy the packet to, taking the amount
-reserved before each packet into consideration. If the amount reserved is large,
-the kernel performed an incorrect calculation which could cause an overflow
-leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker.
-The data in the write is easily controlled using the loopback to send a crafted
-packet and receiving it using a `PACKET_RX_RING` with a carefully selected
-`PACKET_RESERVE` size.
-
-```c
-static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
- struct packet_type *pt, struct net_device *orig_dev)
-{
-// ...
- if (sk->sk_type == SOCK_DGRAM) {
- macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 +
- po->tp_reserve;
- } else {
- unsigned int maclen = skb_network_offset(skb);
- // tp_reserve is unsigned int, netoff is unsigned short. Addition can overflow netoff
- netoff = TPACKET_ALIGN(po->tp_hdrlen +
- (maclen < 16 ? 16 : maclen)) +
- po->tp_reserve;
- if (po->has_vnet_hdr) {
- netoff += sizeof(struct virtio_net_hdr);
- do_vnet = true;
- }
- // Attacker controls netoff and can make macoff be smaller than sizeof(struct virtio_net_hdr)
- macoff = netoff - maclen;
- }
-// ...
- // "macoff - sizeof(struct virtio_net_hdr)" can be negative, resulting in a pointer before h.raw
- if (do_vnet &&
- virtio_net_hdr_from_skb(skb, h.raw + macoff -
- sizeof(struct virtio_net_hdr),
- vio_le(), true, 0)) {
-// ...
-```
-
-The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html)
-capability is required to create the socket above. However, in order to support
-common debugging tools like `ping` and `tcpdump`, Docker containers, including
-those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be
-able to trigger this vulnerability to elevate privileges and escape the
-container.
-
-Next, we are going to explore why this vulnerability doesn’t work in gVisor, and
-how gVisor could prevent the escape even if a similar vulnerability existed
-inside gVisor’s kernel.
-
-## Default Protections
-
-gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets
-which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature
-to support in a sandbox environment. While it allows great customizations for
-essential tools like `ping`, it may allow packets to be written to the network
-without any validation. In general, allowing an untrusted application to write
-crafted packets to the network is a questionable idea and a historical source of
-vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it
-would not be _secure by default_ to run untrusted applications.
-
-After multiple discussions when raw sockets were first implemented, we decided
-to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the
-application**. Instead, enabling raw sockets in gVisor requires the admin to set
-`--net-raw` flag to runsc when configuring the runtime, in addition to requiring
-the `CAP_NET_RAW` capability in the application. It comes at the expense that
-some tools may not work out of the box, but as part of our
-[secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default)
-principle, we felt that it was important for the “less secure” configuration to
-be explicit.
-
-Since this bug was due to an overflow in the specific Linux implementation of
-the packet ring, gVisor's raw socket implementation is not affected. However, if
-there were a vulnerability in gVisor, containers would not be allowed to exploit
-it by default.
-
-As an alternative way to implement this same constraint, Kubernetes allows
-[admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/)
-to be configured to customize requests. Cloud providers can use this to
-implement more stringent policies. For example, GKE implements an admission
-controller for gVisor that
-[removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities)
-unless it has been explicitly set in the pod spec.
-
-## Isolated Kernel
-
-gVisor has its own application kernel, called the Sentry, that is distinct from
-the host kernel. Just like what you would expect from a kernel, gVisor has a
-memory management subsystem, virtual file system, and a full network stack. The
-host network is only used as a transport to carry packets in and out the
-sandbox[^1]. The loopback interface which is used in the exploit stays
-completely inside the sandbox, never reaching the host.
-
-Therefore, even if the Sentry was vulnerable to the attack, there would be two
-factors that would prevent a container escape from happening. First, the
-vulnerability would be limited to the Sentry, and the attacker would compromise
-only the application kernel, bound by a restricted set of
-[seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in
-depth below. Second, the Sentry is a distinct implementation of the API, written
-in Go, which provides bounds checking that would have likely prevented access
-past the bounds of the shared region (e.g. see
-[aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor)
-or
-[kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor),
-which have similar shared regions).
-
-Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit
-of isolation and a pod can run multiple containers. In other words, each pod is
-a gVisor instance, and each container is a set of processes running inside
-gVisor, isolated via Sentry-internal namespaces like regular containers inside a
-pod. If there were a vulnerability in gVisor, the privilege escalation would
-allow a container inside the pod to break out to other **containers inside the
-same pod**, but the container still **cannot break out of the pod**.
-
-## Defense in Depth
-
-gVisor follows a
-[common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf)
-that the system should have two layers of protection, and those layers should
-require different compromises to be broken. We apply this principle by assuming
-that the Sentry (first layer of defense)
-[will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth).
-In order to protect the host kernel from a compromised Sentry, we wrap it around
-many security and isolations features to ensure only the minimal set of
-functionality from the host kernel is exposed.
-
-![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.")
-
-First, the sandbox runs inside a cgroup that can limit and throttle host
-resources being used. Second, the sandbox joins empty namespaces, including user
-and mount, to further isolate from the host. Next, it changes the process root
-to a read-only directory that contains only `/proc` and nothing else. Then, it
-executes with the unprivileged user/group
-[`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all
-capabilities stripped. Last and most importantly, a seccomp filter is added to
-tightly restrict what parts of the Linux syscall surface that gVisor is allowed
-to access. The allowed host surface is a far smaller set of syscalls than the
-Sentry implements for applications to use. Not only restricting the syscall
-being called, but also checking that arguments to these syscalls are within the
-expected set. Dangerous syscalls like <code>execve(2)</code>,
-<code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an
-attacker isn’t able to execute binaries or acquire new resources on the host.
-
-if there were a vulnerability in gVisor that allowed an attacker to execute code
-inside the Sentry, the attacker still has extremely limited privileges on the
-host. In fact, a compromised Sentry is much more restricted than a
-non-compromised regular container. For CVE-2020-14386 in particular, the attack
-would be blocked by more than one security layer: non-privileged user, no
-capability, and seccomp filters.
-
-Although the surface is drastically reduced, there is still a chance that there
-is a vulnerability in one of the allowed syscalls. That’s why it’s important to
-keep the surface small and carefully consider what syscalls are allowed. You can
-find the full set of allowed syscalls
-[here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/).
-
-Another possible attack vector is resources that are present in the Sentry, like
-open file descriptors. The Sentry has file descriptors that an attacker could
-potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC
-endpoint that allows external communication with the Sentry, and a Netstack
-endpoint that connects the sandbox to the network. The Netstack endpoint in
-particular is a concern because it gives direct access to the network. It’s an
-`AF_PACKET` socket that allows arbitrary L2 packets to be written to the
-network. In the normal case, Netstack assembles packets that go out the network,
-giving the container control over only the payload. But if the Sentry is
-compromised, an attacker can craft packets to the network. In many ways this is
-similar to anyone sending random packets over the internet, but still this is a
-place where the host kernel surface exposed is larger than we would like it to
-be.
-
-## Conclusion
-
-Security comes with many tradeoffs that are often hard to make, such as the
-decision to disable raw sockets by default. However, these tradeoffs have served
-us well, and we've found them to have paid off over time. CVE-2020-14386 offers
-great insight into how multiple layers of protection can be effective against
-such an attack.
-
-We cannot guarantee that a container escape will never happen in gVisor, but we
-do our best to make it as hard as we possibly can.
-
-If you have not tried gVisor yet, it’s easier than you think. Just follow the
-steps in the
-[Quick Start](https://gvisor.dev/docs/user_guide/quick_start/docker/) guide.
-<br>
-<br>
-
---------------------------------------------------------------------------------
-
-[^1]: Those packets are eventually handled by the host, as it needs to route
- them to local containers or send them out the NIC. The packet will be
- handled by many switches, routers, proxies, servers, etc. along the way,
- which may be subject to their own vulnerabilities.