From 93fd164fa24e1f70e4c69fc2edfb0d4961a6683b Mon Sep 17 00:00:00 2001 From: Fabricio Voznika Date: Fri, 18 Sep 2020 10:25:52 -0700 Subject: Add "Containing a Real Vulnerability" blog post PiperOrigin-RevId: 332477119 --- images/jekyll/build.sh | 3 +- website/_config.yml | 3 + ...-18-containing-a-real-vulnerability-figure1.png | Bin 0 -> 48602 bytes website/blog/2019-11-18-security-basics.md | 4 +- .../2020-09-18-containing-a-real-vulnerability.md | 224 +++++++++++++++++++++ website/blog/BUILD | 10 + 6 files changed, 242 insertions(+), 2 deletions(-) create mode 100644 website/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png create mode 100644 website/blog/2020-09-18-containing-a-real-vulnerability.md diff --git a/images/jekyll/build.sh b/images/jekyll/build.sh index bfceb2781..010972ea6 100755 --- a/images/jekyll/build.sh +++ b/images/jekyll/build.sh @@ -18,4 +18,5 @@ set -euxo pipefail # Generate the syntax highlighting css file. /usr/gem/bin/rougify style github >/input/_sass/syntax.css -/usr/gem/bin/jekyll build -t -s /input -d /output +# Build website including pages irrespective of date. +/usr/gem/bin/jekyll build --future -t -s /input -d /output diff --git a/website/_config.yml b/website/_config.yml index b08602970..20fbb3d2d 100644 --- a/website/_config.yml +++ b/website/_config.yml @@ -34,3 +34,6 @@ authors: igudger: name: Ian Gudger email: igudger@google.com + fvoznika: + name: Fabricio Voznika + email: fvoznika@google.com diff --git a/website/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png b/website/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png new file mode 100644 index 000000000..c750f0851 Binary files /dev/null and b/website/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png differ diff --git a/website/blog/2019-11-18-security-basics.md b/website/blog/2019-11-18-security-basics.md index 2256ee9d5..b6cf57a77 100644 --- a/website/blog/2019-11-18-security-basics.md +++ b/website/blog/2019-11-18-security-basics.md @@ -279,8 +279,10 @@ weaknesses of each gVisor component. We will also use it to introduce Google's Vulnerability Reward Program[^14], and other ways the community can contribute to help make gVisor safe, fast and stable. +
+
-## Notes +-------------------------------------------------------------------------------- [^1]: [https://en.wikipedia.org/wiki/Secure_by_design](https://en.wikipedia.org/wiki/Secure_by_design) [^2]: [https://gvisor.dev/docs/architecture_guide](https://gvisor.dev/docs/architecture_guide/) diff --git a/website/blog/2020-09-18-containing-a-real-vulnerability.md b/website/blog/2020-09-18-containing-a-real-vulnerability.md new file mode 100644 index 000000000..b71ef63d9 --- /dev/null +++ b/website/blog/2020-09-18-containing-a-real-vulnerability.md @@ -0,0 +1,224 @@ +# Containing a Real Vulnerability + +In the previous two posts we talked about gVisor's +[security design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/) +as well as how those are applied in the +[context of networking](https://gvisor.dev/blog/2020/04/02/gvisor-networking-security/). +Recently, a new container escape vulnerability +([CVE-2020-14386](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14386)) +was announced that ties these topics well together. gVisor is +[not vulnerable](https://seclists.org/oss-sec/2020/q3/168) to this specific +issue, but it provides an interesting case study to continue our exploration of +gVisor's security. While gVisor is not immune to vulnerabilities, +[we take several steps](https://gvisor.dev/security/) to minimize the impact and +remediate if a vulnerability is found. + +## Escaping the Container + +First, let’s describe how the discovered vulnerability works. There are numerous +ways one can send and receive bytes over the network with Linux. One of the most +performant ways is to use a ring buffer, which is a memory region shared by the +application and the kernel. These rings are created by calling +[setsockopt(2)](https://man7.org/linux/man-pages/man2/setsockopt.2.html) with +[`PACKET_RX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for +receiving and +[`PACKET_TX_RING`](https://man7.org/linux/man-pages/man7/packet.7.html) for +sending packets. + +The vulnerability is in the code that reads packets when `PACKET_RX_RING` is +enabled. There is another option +([`PACKET_RESERVE`](https://man7.org/linux/man-pages/man7/packet.7.html)) that +asks the kernel to leave some space in the ring buffer before each packet for +anything the application needs, e.g. control structures. When a packet is +received, the kernel calculates where to copy the packet to, taking the amount +reserved before each packet into consideration. If the amount reserved is large, +the kernel performed an incorrect calculation which could cause an overflow +leading to an out-of-bounds write of up to 10 bytes, controlled by the attacker. +The data in the write is easily controlled using the loopback to send a crafted +packet and receiving it using a `PACKET_RX_RING` with a carefully selected +`PACKET_RESERVE` size. + +```c +static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pt, struct net_device *orig_dev) +{ +// ... + if (sk->sk_type == SOCK_DGRAM) { + macoff = netoff = TPACKET_ALIGN(po->tp_hdrlen) + 16 + + po->tp_reserve; + } else { + unsigned int maclen = skb_network_offset(skb); + // tp_reserve is unsigned int, netoff is unsigned short. Addition can overflow netoff + netoff = TPACKET_ALIGN(po->tp_hdrlen + + (maclen < 16 ? 16 : maclen)) + + po->tp_reserve; + if (po->has_vnet_hdr) { + netoff += sizeof(struct virtio_net_hdr); + do_vnet = true; + } + // Attacker controls netoff and can make macoff be smaller than sizeof(struct virtio_net_hdr) + macoff = netoff - maclen; + } +// ... + // "macoff - sizeof(struct virtio_net_hdr)" can be negative, resulting in a pointer before h.raw + if (do_vnet && + virtio_net_hdr_from_skb(skb, h.raw + macoff - + sizeof(struct virtio_net_hdr), + vio_le(), true, 0)) { +// ... +``` + +The [`CAP_NET_RAW`](https://man7.org/linux/man-pages/man7/capabilities.7.html) +capability is required to create the socket above. However, in order to support +common debugging tools like `ping` and `tcpdump`, Docker containers, including +those created for Kubernetes, are given `CAP_NET_RAW` by default and thus may be +able to trigger this vulnerability to elevate privileges and escape the +container. + +Next, we are going to explore why this vulnerability doesn’t work in gVisor, and +how gVisor could prevent the escape even if a similar vulnerability existed +inside gVisor’s kernel. + +## Default Protections + +gVisor does not implement `PACKET_RX_RING`, but **does** support raw sockets +which are required for `PACKET_RX_RING`. Raw sockets are a controversial feature +to support in a sandbox environment. While it allows great customizations for +essential tools like `ping`, it may allow packets to be written to the network +without any validation. In general, allowing an untrusted application to write +crafted packets to the network is a questionable idea and a historical source of +vulnerabilities. With that in mind, if `CAP_NET_RAW` is enabled by default, it +would not be _secure by default_ to run untrusted applications. + +After multiple discussions when raw sockets were first implemented, we decided +to disable raw sockets by default, **even if `CAP_NET_RAW` is given to the +application**. Instead, enabling raw sockets in gVisor requires the admin to set +`--net-raw` flag to runsc when configuring the runtime, in addition to requiring +the `CAP_NET_RAW` capability in the application. It comes at the expense that +some tools may not work out of the box, but as part of our +[secure-by-default](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#secure-by-default) +principle, we felt that it was important for the “less secure” configuration to +be explicit. + +Since this bug was due to an overflow in the specific Linux implementation of +the packet ring, gVisor's raw socket implementation is not affected. However, if +there were a vulnerability in gVisor, containers would not be allowed to exploit +it by default. + +As an alternative way to implement this same constraint, Kubernetes allows +[admission controllers](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/) +to be configured to customize requests. Cloud providers can use this to +implement more stringent policies. For example, GKE implements an admission +controller for gVisor that +[removes `CAP_NET_RAW` from gVisor pods](https://cloud.google.com/kubernetes-engine/docs/concepts/sandbox-pods#capabilities) +unless it has been explicitly set in the pod spec. + +## Isolated Kernel + +gVisor has its own application kernel, called the Sentry, that is distinct from +the host kernel. Just like what you would expect from a kernel, gVisor has a +memory management subsystem, virtual file system, and a full network stack. The +host network is only used as a transport to carry packets in and out the +sandbox[^1]. The loopback interface which is used in the exploit stays +completely inside the sandbox, never reaching the host. + +Therefore, even if the Sentry was vulnerable to the attack, there would be two +factors that would prevent a container escape from happening. First, the +vulnerability would be limited to the Sentry, and the attacker would compromise +only the application kernel, bound by a restricted set of +[seccomp](https://en.wikipedia.org/wiki/Seccomp) filters, discussed more in +depth below. Second, the Sentry is a distinct implementation of the API, written +in Go, which provides bounds checking that would have likely prevented access +past the bounds of the shared region (e.g. see +[aio](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/syscalls/linux/vfs2/aio.go;l=210;drc=a11061d78a58ed75b10606d1a770b035ed944b66?q=file:aio&ss=gvisor%2Fgvisor) +or +[kcov](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/kernel/kcov.go;l=272?q=file:kcov&ss=gvisor%2Fgvisor), +which have similar shared regions). + +Here, Kubernetes warrants slightly more explanation. gVisor makes pods the unit +of isolation and a pod can run multiple containers. In other words, each pod is +a gVisor instance, and each container is a set of processes running inside +gVisor, isolated via Sentry-internal namespaces like regular containers inside a +pod. If there were a vulnerability in gVisor, the privilege escalation would +allow a container inside the pod to break out to other **containers inside the +same pod**, but the container still **cannot break out of the pod**. + +## Defense in Depth + +gVisor follows a +[common security principle used at Google](https://cloud.google.com/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf) +that the system should have two layers of protection, and those layers should +require different compromises to be broken. We apply this principle by assuming +that the Sentry (first layer of defense) +[will be compromised and should not be trusted](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#defense-in-depth). +In order to protect the host kernel from a compromised Sentry, we wrap it around +many security and isolations features to ensure only the minimal set of +functionality from the host kernel is exposed. + +![Figure 1](/assets/images/2020-09-18-containing-a-real-vulnerability-figure1.png "Protection layers.") + +First, the sandbox runs inside a cgroup that can limit and throttle host +resources being used. Second, the sandbox joins empty namespaces, including user +and mount, to further isolate from the host. Next, it changes the process root +to a read-only directory that contains only `/proc` and nothing else. Then, it +executes with the unprivileged user/group +[`nobody`](https://en.wikipedia.org/wiki/Nobody_\(username\)) with all +capabilities stripped. Last and most importantly, a seccomp filter is added to +tightly restrict what parts of the Linux syscall surface that gVisor is allowed +to access. The allowed host surface is a far smaller set of syscalls than the +Sentry implements for applications to use. Not only restricting the syscall +being called, but also checking that arguments to these syscalls are within the +expected set. Dangerous syscalls like execve(2), +open(2), and socket(2) are prohibited, thus an +attacker isn’t able to execute binaries or acquire new resources on the host. + +if there were a vulnerability in gVisor that allowed an attacker to execute code +inside the Sentry, the attacker still has extremely limited privileges on the +host. In fact, a compromised Sentry is much more restricted than a +non-compromised regular container. For CVE-2020-14386 in particular, the attack +would be blocked by more than one security layer: non-privileged user, no +capability, and seccomp filters. + +Although the surface is drastically reduced, there is still a chance that there +is a vulnerability in one of the allowed syscalls. That’s why it’s important to +keep the surface small and carefully consider what syscalls are allowed. You can +find the full set of allowed syscalls +[here](https://cs.opensource.google/gvisor/gvisor/+/master:runsc/boot/filter/). + +Another possible attack vector is resources that are present in the Sentry, like +open file descriptors. The Sentry has file descriptors that an attacker could +potentially use, such as log files, platform files (e.g. `/dev/kvm`), an RPC +endpoint that allows external communication with the Sentry, and a Netstack +endpoint that connects the sandbox to the network. The Netstack endpoint in +particular is a concern because it gives direct access to the network. It’s an +`AF_PACKET` socket that allows arbitrary L2 packets to be written to the +network. In the normal case, Netstack assembles packets that go out the network, +giving the container control over only the payload. But if the Sentry is +compromised, an attacker can craft packets to the network. In many ways this is +similar to anyone sending random packets over the internet, but still this is a +place where the host kernel surface exposed is larger than we would like it to +be. + +## Conclusion + +Security comes with many tradeoffs that are often hard to make, such as the +decision to disable raw sockets by default. However, these tradeoffs have served +us well, and we've found them to have paid off over time. CVE-2020-14386 offers +great insight into how multiple layers of protection can be effective against +such an attack. + +We cannot guarantee that a container escape will never happen in gVisor, but we +do our best to make it as hard as we possibly can. + +If you have not tried gVisor yet, it’s easier than you think. Just follow the +steps in the +[Quick Start](https://gvisor.dev/docs/user_guide/quick_start/docker/) guide. +
+
+ +-------------------------------------------------------------------------------- + +[^1]: Those packets are eventually handled by the host, as it needs to route + them to local containers or send them out the NIC. The packet will be + handled by many switches, routers, proxies, servers, etc. along the way, + which may be subject to their own vulnerabilities. diff --git a/website/blog/BUILD b/website/blog/BUILD index 01c1f5a6e..865e403da 100644 --- a/website/blog/BUILD +++ b/website/blog/BUILD @@ -28,6 +28,16 @@ doc( permalink = "/blog/2020/04/02/gvisor-networking-security/", ) +doc( + name = "containing_a_real_vulnerability", + src = "2020-09-18-containing-a-real-vulnerability.md", + authors = [ + "fvoznika", + ], + layout = "post", + permalink = "/blog/2020/09/18/containing-a-real-vulnerability/", +) + docs( name = "posts", deps = [ -- cgit v1.2.3