1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
|
# gVisor Security Basics - Part 1
This blog is a space for engineers and community members to share perspectives
and deep dives on technology and design within the gVisor project. Though our
logo suggests we're in the business of space exploration (or perhaps fighting
sea monsters), we're actually in the business of sandboxing Linux containers.
When we created gVisor, we had three specific goals in mind; _container-native
security_, _resource efficiency_, and _platform portability_. To put it simply,
gVisor provides _efficient defense-in-depth for containers anywhere_.
This post addresses gVisor's _container-native security_, specifically how
gVisor provides strong isolation between an application and the host OS. Future
posts will address _resource efficiency_ (how gVisor preserves container
benefits like fast starts, smaller snapshots, and less memory overhead than VMs)
and _platform portability_ (run gVisor wherever Linux OCI containers run).
Delivering on each of these goals requires careful security considerations and a
robust design.
## What does "sandbox" mean?
gVisor allows the execution of untrusted containers, preventing them from
adversely affecting the host. This means that the untrusted container is
prevented from attacking or spying on either the host kernel or any other peer
userspace processes on the host.
For example, if you are a cloud container hosting service, running containers
from different customers on the same virtual machine means that compromises
expose customer data. Properly configured, gVisor can provide sufficient
isolation to allow different customers to run containers on the same host. There
are many aspects to the proper configuration, including limiting file and
network access, which we will discuss in future posts.
## The cost of compromise
gVisor was designed around the premise that any security boundary could
potentially be compromised with enough time and resources. We tried to optimize
for a solution that was as costly and time-consuming for an attacker as
possible, at every layer.
Consequently, gVisor was built through a combination of intentional design
principles and specific technology choices that work together to provide the
security isolation needed for running hostile containers on a host. We'll dig
into it in the next section!
# Design Principles
gVisor was designed with some common
[secure design](https://en.wikipedia.org/wiki/Secure_by_design) principles in
mind: Defense-in-Depth, Principle of Least-Privilege, Attack Surface Reduction
and Secure-by-Default[^1].
In general, Design Principles outline good engineering practices, but in the
case of security, they also can be thought of as a set of tactics. In a
real-life castle, there is no single defensive feature. Rather, there are many
in combination: redundant walls, scattered draw bridges, small bottle-neck
entrances, moats, etc.
A simplified version of the design is below
([more detailed version](/docs/))[^2]:
![Figure 1](/assets/images/2019-11-18-security-basics-figure1.png "Simplified design of gVisor.")
In order to discuss design principles, the following components are important to
know:
* runsc - binary that packages the Sentry, platform, and Gofer(s) that run
containers. runsc is the drop-in binary for running gVisor in Docker and
Kubernetes.
* Untrusted Application - container running in the sandbox. Untrusted
application/container are used interchangeably in this article.
* Platform Syscall Switcher - intercepts syscalls from the application and
passes them to the Sentry with no further handling.
* Sentry - The "application kernel" in userspace that serves the untrusted
application. Each application instance has its own Sentry. The Sentry
handles syscalls, routes I/O to gofers, and manages memory and CPU, all in
userspace. The Sentry is allowed to make limited, filtered syscalls to the
host OS.
* Gofer - a process that specifically handles different types of I/O for the
Sentry (usually disk I/O). Gofers are also allowed to make filtered syscalls
to the Host OS.
* Host OS - the actual OS on which gVisor containers are running, always some
flavor of Linux (sorry, Windows/MacOS users).
It is important to emphasize what is being protected from the untrusted
application in this diagram: the host OS and other userspace applications.
In this post, we are only discussing security-related features of gVisor, and
you might ask, "What about performance, compatibility and stability?" We will
cover these considerations in future posts.
## Defense-in-Depth
For gVisor, Defense-in-Depth means each component of the software stack trusts
the other components as little as possible.
It may seem strange that we would want our own software components to distrust
each other. But by limiting the trust between small, discrete components, each
component is forced to defend itself against potentially malicious input. And
when you stack these components on top of each other, you can ensure that
multiple security barriers must be overcome by an attacker.
And this leads us to how Defense-in-Depth is applied to gVisor: no single
vulnerability should compromise the host.
In the "Attacker's Advantage / Defender's Dilemma," the defender must succeed
all the time while the attacker only needs to succeed once. Defense in Depth
inverts this principle: once the attacker successfully compromises any given
software component, they are immediately faced with needing to compromise a
subsequent, distinct layer in order to move laterally or acquire more privilege.
For example, the untrusted container is isolated from the Sentry. The Sentry is
isolated from host I/O operations by serving those requests in separate
processes called Gofers. And both the untrusted container and its associated
Gofers are isolated from the host process that is running the sandbox.
An additional benefit is that this generally leads to more robust and stable
software, forcing interfaces to be strictly defined and tested to ensure all
inputs are properly parsed and bounds checked.
## Least-Privilege
The principle of Least-Privilege implies that each software component has only
the permissions it needs to function, and no more.
Least-Privilege is applied throughout gVisor. Each component and more
importantly, each interface between the components, is designed so that only the
minimum level of permission is required for it to perform its function.
Specifically, the closer you are to the untrusted application, the less
privilege you have.
![Figure 2](/assets/images/2019-11-18-security-basics-figure2.png "runsc components and their privileges.")
This is evident in how runsc (the drop in gVisor binary for Docker/Kubernetes)
constructs the sandbox. The Sentry has the least privilege possible (it can't
even open a file!). Gofers are only allowed file access, so even if it were
compromised, the host network would be unavailable. Only the runsc binary itself
has full access to the host OS, and even runsc's access to the host OS is often
limited through capabilities / chroot / namespacing.
Designing a system with Defense-in-Depth and Least-Privilege in mind encourages
small, separate, single-purpose components, each with very restricted
privileges.
## Attack Surface Reduction
There are no bugs in unwritten code. In other words, gVisor supports a feature
if and only if it is needed to run host Linux containers.
### Host Application/Sentry Interface:
There are a lot of things gVisor does not need to do. For example, it does not
need to support arbitrary device drivers, nor does it need to support video
playback. By not implementing what will not be used, we avoid introducing
potential bugs in our code.
That is not to say gVisor has limited functionality! Quite the opposite, we
analyzed what is actually needed to run Linux containers and today the Sentry
supports 237 syscalls[^3]<sup>,</sup>[^4], along with the range of critical
/proc and /dev files. However, gVisor does not support every syscall in the
Linux kernel. There are about 350 syscalls[^5] within the 5.3.11 version of the
Linux kernel, many of which do not apply to Linux containers that typically host
cloud-like workloads. For example, we don't support old versions of epoll
(epoll_ctl_old, epoll_wait_old), because they are deprecated in Linux and no
supported workloads use them.
Furthermore, any exploited vulnerabilities in the implemented syscalls (or
Sentry code in general) only apply to gaining control of the Sentry. More on
this in a later post.
### Sentry/Host OS Interface:
The Sentry's interactions with the Host OS are restricted in many ways. For
instance, no syscall is "passed-through" from the untrusted application to the
host OS. All syscalls are intercepted and interpreted. In the case where the
Sentry needs to call the Host OS, we severely limit the syscalls that the Sentry
itself is allowed to make to the host kernel[^6].
For example, there are many file-system based attacks, where manipulation of
files or their paths, can lead to compromise of the host[^7]. As a result, the
Sentry does not allow any syscall that creates or opens a file descriptor. All
file descriptors must be donated to the sandbox. By disallowing open or creation
of file descriptors, we eliminate entire categories of these file-based attacks.
This does not affect functionality though. For example, during startup, runsc
will donate FDs the Sentry that allow for mapping STDIN/STDOUT/STDERR to the
sandboxed application. Also the Gofer may donate an FD to the Sentry, allowing
for direct access to some files. And most files will be remotely accessed
through the Gofers, in which case no FDs are donated to the Sentry.
The Sentry itself is only allowed access to specific
[whitelisted syscalls](https://github.com/google/gvisor/blob/master/runsc/config/config.go).
Without networking, the Sentry needs 53 host syscalls in order to function, and
with networking, it uses an additional 15[^8]. By limiting the whitelist to only
these needed syscalls, we radically reduce the amount of host OS attack surface.
If any attempts are made to call something outside the whitelist, it is
immediately blocked and the sandbox is killed by the Host OS.
### Sentry/Gofer Interface:
The Sentry communicates with the Gofer through a local unix domain socket (UDS)
via a version of the 9P protocol[^9]. The UDS file descriptor is passed to the
sandbox during initialization and all communication between the Sentry and Gofer
happens via 9P. We will go more into how Gofers work in future posts.
### End Result
So, of the 350 syscalls in the Linux kernel, the Sentry needs to implement only
237 of them to support containers. At most, the Sentry only needs to call 68 of
the host Linux syscalls. In other words, with gVisor, applications get the vast
majority (and growing) functionality of Linux containers for only 68 possible
syscalls to the Host OS. 350 syscalls to 68 is attack surface reduction.
![Figure 3](/assets/images/2019-11-18-security-basics-figure3.png "Reduction of Attack Surface of the Syscall Table. Note that the Senty's Syscall Emulation Layer keeps the Containerized Process from ever calling the Host OS.")
## Secure-by-default
The default choice for a user should be safe. If users need to run a less secure
configuration of the sandbox for the sake of performance or application
compatibility, they must make the choice explicitly.
An example of this might be a networking application that is performance
sensitive. Instead of using the safer, Go-based Netstack in the Sentry, the
untrusted container can instead use the host Linux networking stack directly.
However, this means the untrusted container will be directly interacting with
the host, without the safety benefits of the sandbox. It also means that an
attack could directly compromise the host through his path.
These less secure configurations are **not** the default. In fact, the user must
take action to change the configuration and run in a less secure mode.
Additionally, these actions make it very obvious that a less secure
configuration is being used.
This can be as simple as forcing a default runtime flag option to the secure
option. gVisor does this by always using its internal netstack by default.
However, for certain performance sensitive applications, we allow the usage of
the host OS networking stack, but it requires the user to actively set a
flag[^10].
# Technology Choices
Technology choices for gVisor mainly involve things that will give us a security
boundary.
At a higher level, boundaries in software might be describing a great many
things. It may be discussing the boundaries between threads, boundaries between
processes, boundaries between CPU privilege levels, and more.
Security boundaries are interfaces that are designed and built so that entire
classes of bugs/vulnerabilities are eliminated.
For example, the Sentry and Gofers are implemented using Go. Go was chosen for a
number of the features it provided. Go is a fast, statically-typed, compiled
language that has efficient multi-threading support, garbage collection and a
constrained set of "unsafe" operations.
Using these features enabled safe array and pointer handling. This means entire
classes of vulnerabilities were eliminated, such as buffer overflows and
use-after-free.
Another example is our use of very strict syscall switching to ensure that the
Sentry is always the first software component that parses and interprets the
calls being made by the untrusted container. Here is an instance where different
platforms use different solutions, but all of them share this common trait,
whether it is through the use of ptrace "a la PTRACE_ATTACH"[^11] or kvm's
ring0[^12].
Finally, one of the most restrictive choices was to use seccomp, to restrict the
Sentry from being able to open or create a file descriptor on the host. All file
I/O is required to go through Gofers. Preventing the opening or creation of file
descriptions eliminates whole categories of bugs around file permissions
[like this one](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)[^13].
# To be continued - Part 2
In part 2 of this blog post, we will explore gVisor from an attacker's point of
view. We will use it as an opportunity to examine the specific strengths and
weaknesses of each gVisor component.
We will also use it to introduce Google's Vulnerability Reward Program[^14], and
other ways the community can contribute to help make gVisor safe, fast and
stable.
<br>
<br>
--------------------------------------------------------------------------------
[^1]: [https://en.wikipedia.org/wiki/Secure_by_design](https://en.wikipedia.org/wiki/Secure_by_design)
[^2]: [https://gvisor.dev/docs/architecture_guide](https://gvisor.dev/docs/architecture_guide/)
[^3]: [https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/linux64_amd64.go](https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/syscalls.go)
<!-- mdformat off(mdformat formats this into multiple lines) -->
[^4]: Internally that is, it doesn't call to the Host OS to implement them, in fact that is explicitly disallowed, more on that in the future.
<!-- mdformat on -->
[^5]: [https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345](https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345)
[^6]: [https://github.com/google/gvisor/tree/master/runsc/boot/filter](https://github.com/google/gvisor/tree/master/runsc/boot/filter)
[^7]: [https://en.wikipedia.org/wiki/Dirty_COW](https://en.wikipedia.org/wiki/Dirty_COW)
[^8]: [https://github.com/google/gvisor/blob/master/runsc/boot/config.go](https://github.com/google/gvisor/blob/master/runsc/boot/config.go)
<!-- mdformat off(mdformat breaks this url by escaping the parenthesis) -->
[^9]: [https://en.wikipedia.org/wiki/9P_(protocol)](https://en.wikipedia.org/wiki/9P_(protocol))
<!-- mdformat on -->
[^10]: [https://gvisor.dev/docs/user_guide/networking/#network-passthrough](https://gvisor.dev/docs/user_guide/networking/#network-passthrough)
[^11]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390)
[^12]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182)
[^13]: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)
[^14]: [https://www.google.com/about/appsecurity/reward-program/index.html](https://www.google.com/about/appsecurity/reward-program/index.html)
|