diff options
Diffstat (limited to 'website/blog/2020-09-18-containing-a-real-vulnerability.md')
-rw-r--r-- | website/blog/2020-09-18-containing-a-real-vulnerability.md | 224 |
1 files changed, 0 insertions, 224 deletions
diff --git a/website/blog/2020-09-18-containing-a-real-vulnerability.md b/website/blog/2020-09-18-containing-a-real-vulnerability.md deleted file mode 100644 index b71ef63d9..000000000 --- a/website/blog/2020-09-18-containing-a-real-vulnerability.md +++ /dev/null @@ -1,224 +0,0 @@ -# Containing a Real Vulnerability - -In the previous two posts we talked about gVisor's -[security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/) -as well as how those are applied in the -[context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/). -Recently, a new container escape vulnerability -([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386)) -was announced that ties these topics well together. gVisor is -[not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific -issue, but it provides an interesting case study to continue our exploration of -gVisor's security. While gVisor is not immune to vulnerabilities, -[we take several steps](https://gvisor.dev/security/) to minimize the impact and -remediate if a vulnerability is found. - -## Escaping the Container - -First, let’s describe how the discovered vulnerability works. There are numerous -ways one can send and receive bytes over the network with Linux. One of the most -performant ways is to use a ring buffer, which is a memory region shared by the -application and the kernel. These rings are created by calling -[setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with -[`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for -receiving and -[`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for -sending packets. - -The vulnerability is in the code that reads packets when `PACKET_RX_RING` is -enabled. There is another option -([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that -asks the kernel to leave some space in the ring buffer before each packet for -anything the application needs, e.g. control structures. When a packet is -received, the kernel calculates where to copy the packet to, taking the amount -reserved before each packet into consideration. If the amount reserved is large, -the kernel performed an incorrect calculation which could cause an overflow -leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker. -The data in the write is easily controlled using the loopback to send a crafted -packet and receiving it using a `PACKET_RX_RING` with a carefully selected -`PACKET_RESERVE` size. - -```c -static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, - struct packet_type *pt, struct net_device *orig_dev) -{ -// ... - if (sk->sk_type == SOCK_DGRAM) { - macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 + - po->tp_reserve; - } else { - unsigned int maclen = skb_network_offset(skb); - // tp_reserve is unsigned int, netoff is unsigned short. Addition can overflow netoff - netoff = TPACKET_ALIGN(po->tp_hdrlen + - (maclen < 16 ? 16 : maclen)) + - po->tp_reserve; - if (po->has_vnet_hdr) { - netoff += sizeof(struct virtio_net_hdr); - do_vnet = true; - } - // Attacker controls netoff and can make macoff be smaller than sizeof(struct virtio_net_hdr) - macoff = netoff - maclen; - } -// ... - // "macoff - sizeof(struct virtio_net_hdr)" can be negative, resulting in a pointer before h.raw - if (do_vnet && - virtio_net_hdr_from_skb(skb, h.raw + macoff - - sizeof(struct virtio_net_hdr), - vio_le(), true, 0)) { -// ... -``` - -The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html) -capability is required to create the socket above. However, in order to support -common debugging tools like `ping` and `tcpdump`, Docker containers, including -those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be -able to trigger this vulnerability to elevate privileges and escape the -container. - -Next, we are going to explore why this vulnerability doesn’t work in gVisor, and -how gVisor could prevent the escape even if a similar vulnerability existed -inside gVisor’s kernel. - -## Default Protections - -gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets -which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature -to support in a sandbox environment. While it allows great customizations for -essential tools like `ping`, it may allow packets to be written to the network -without any validation. In general, allowing an untrusted application to write -crafted packets to the network is a questionable idea and a historical source of -vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it -would not be _secure by default_ to run untrusted applications. - -After multiple discussions when raw sockets were first implemented, we decided -to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the -application**. Instead, enabling raw sockets in gVisor requires the admin to set -`--net-raw` flag to runsc when configuring the runtime, in addition to requiring -the `CAP_NET_RAW` capability in the application. It comes at the expense that -some tools may not work out of the box, but as part of our -[secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default) -principle, we felt that it was important for the “less secure” configuration to -be explicit. - -Since this bug was due to an overflow in the specific Linux implementation of -the packet ring, gVisor's raw socket implementation is not affected. However, if -there were a vulnerability in gVisor, containers would not be allowed to exploit -it by default. - -As an alternative way to implement this same constraint, Kubernetes allows -[admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) -to be configured to customize requests. Cloud providers can use this to -implement more stringent policies. For example, GKE implements an admission -controller for gVisor that -[removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities) -unless it has been explicitly set in the pod spec. - -## Isolated Kernel - -gVisor has its own application kernel, called the Sentry, that is distinct from -the host kernel. Just like what you would expect from a kernel, gVisor has a -memory management subsystem, virtual file system, and a full network stack. The -host network is only used as a transport to carry packets in and out the -sandbox[^1]. The loopback interface which is used in the exploit stays -completely inside the sandbox, never reaching the host. - -Therefore, even if the Sentry was vulnerable to the attack, there would be two -factors that would prevent a container escape from happening. First, the -vulnerability would be limited to the Sentry, and the attacker would compromise -only the application kernel, bound by a restricted set of -[seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in -depth below. Second, the Sentry is a distinct implementation of the API, written -in Go, which provides bounds checking that would have likely prevented access -past the bounds of the shared region (e.g. see -[aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor) -or -[kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor), -which have similar shared regions). - -Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit -of isolation and a pod can run multiple containers. In other words, each pod is -a gVisor instance, and each container is a set of processes running inside -gVisor, isolated via Sentry-internal namespaces like regular containers inside a -pod. If there were a vulnerability in gVisor, the privilege escalation would -allow a container inside the pod to break out to other **containers inside the -same pod**, but the container still **cannot break out of the pod**. - -## Defense in Depth - -gVisor follows a -[common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf) -that the system should have two layers of protection, and those layers should -require different compromises to be broken. We apply this principle by assuming -that the Sentry (first layer of defense) -[will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth). -In order to protect the host kernel from a compromised Sentry, we wrap it around -many security and isolations features to ensure only the minimal set of -functionality from the host kernel is exposed. - -![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.") - -First, the sandbox runs inside a cgroup that can limit and throttle host -resources being used. Second, the sandbox joins empty namespaces, including user -and mount, to further isolate from the host. Next, it changes the process root -to a read-only directory that contains only `/proc` and nothing else. Then, it -executes with the unprivileged user/group -[`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all -capabilities stripped. Last and most importantly, a seccomp filter is added to -tightly restrict what parts of the Linux syscall surface that gVisor is allowed -to access. The allowed host surface is a far smaller set of syscalls than the -Sentry implements for applications to use. Not only restricting the syscall -being called, but also checking that arguments to these syscalls are within the -expected set. Dangerous syscalls like <code>execve(2)</code>, -<code>open(2)</code>, and <code>socket(2)</code> are prohibited, thus an -attacker isn’t able to execute binaries or acquire new resources on the host. - -if there were a vulnerability in gVisor that allowed an attacker to execute code -inside the Sentry, the attacker still has extremely limited privileges on the -host. In fact, a compromised Sentry is much more restricted than a -non-compromised regular container. For CVE-2020-14386 in particular, the attack -would be blocked by more than one security layer: non-privileged user, no -capability, and seccomp filters. - -Although the surface is drastically reduced, there is still a chance that there -is a vulnerability in one of the allowed syscalls. That’s why it’s important to -keep the surface small and carefully consider what syscalls are allowed. You can -find the full set of allowed syscalls -[here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/). - -Another possible attack vector is resources that are present in the Sentry, like -open file descriptors. The Sentry has file descriptors that an attacker could -potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC -endpoint that allows external communication with the Sentry, and a Netstack -endpoint that connects the sandbox to the network. The Netstack endpoint in -particular is a concern because it gives direct access to the network. It’s an -`AF_PACKET` socket that allows arbitrary L2 packets to be written to the -network. In the normal case, Netstack assembles packets that go out the network, -giving the container control over only the payload. But if the Sentry is -compromised, an attacker can craft packets to the network. In many ways this is -similar to anyone sending random packets over the internet, but still this is a -place where the host kernel surface exposed is larger than we would like it to -be. - -## Conclusion - -Security comes with many tradeoffs that are often hard to make, such as the -decision to disable raw sockets by default. However, these tradeoffs have served -us well, and we've found them to have paid off over time. CVE-2020-14386 offers -great insight into how multiple layers of protection can be effective against -such an attack. - -We cannot guarantee that a container escape will never happen in gVisor, but we -do our best to make it as hard as we possibly can. - -If you have not tried gVisor yet, it’s easier than you think. Just follow the -steps in the -[Quick Start](https://gvisor.dev/docs/user_guide/quick_start/docker/) guide. -<br> -<br> - --------------------------------------------------------------------------------- - -[^1]: Those packets are eventually handled by the host, as it needs to route - them to local containers or send them out the NIC. The packet will be - handled by many switches, routers, proxies, servers, etc. along the way, - which may be subject to their own vulnerabilities. |