1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
|
# gVisor Networking Security
In our
[first blog post](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/),
we covered some secure design principles and how they guided the architecture of
gVisor as a whole. In this post, we will cover how these principles guided the
networking architecture of gVisor, and the tradeoffs involved. In particular, we
will cover how these principles culminated in two networking modes, how they
work, and the properties of each.
## gVisor's security architecture in the context of networking
Linux networking is complicated. The TCP protocol is over 40 years old, and has
been repeatedly extended over the years to keep up with the rapid pace of
network infrastructure improvements, all while maintaining compatibility. On top
of that, Linux networking has a fairly large API surface. Linux supports
[over 150 options](https://github.com/google/gvisor/blob/960f6a975b7e44c0efe8fd38c66b02017c4fe137/pkg/sentry/strace/socket.go#L476-L644)
for the most common socket types alone. In fact, the net subsystem is one of the
largest and fastest growing in Linux at approximately 1.1 million lines of code.
For comparison, that is several times the size of the entire gVisor codebase.
At the same time, networking is increasingly important. The cloud era is
arguably about making everything a network service, and in order to make that
work, the interconnect performance is critical. Adding networking support to
gVisor was difficult, not just due to the inherent complexity, but also because
it has the potential to significantly weaken gVisor's security model.
As outlined in the previous blog post, gVisor's
[secure design principles](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#design-principles)
are:
1. Defense in Depth: each component of the software stack trusts each other
component as little as possible.
1. Least Privilege: each software component has only the permissions it needs
to function, and no more.
1. Attack Surface Reduction: limit the surface area of the host exposed to the
sandbox.
1. Secure by Default: the default choice for a user should be safe.
gVisor manifests these principles as a multi-layered system. An application
running in the sandbox interacts with the Sentry, a userspace kernel, which
mediates all interactions with the Host OS and beyond. The Sentry is written in
pure Go with minimal unsafe code, making it less vulnerable to buffer overflows
and related memory bugs that can lead to a variety of compromises including code
injection. It emulates Linux using only a minimal and audited set of Host OS
syscalls that limit the Host OS's attack surface exposed to the Sentry itself.
The syscall restrictions are enforced by running the Sentry with seccomp
filters, which enforce that the Sentry can only use the expected set of
syscalls. The Sentry runs as an unprivileged user and in namespaces, which,
along with the seccomp filters, ensure that the Sentry is run with the Least
Privilege required.
gVisor's multi-layered design provides Defense in Depth. The Sentry, which does
not trust the application because it may attack the Sentry and try to bypass it,
is the first layer. The sandbox that the Sentry runs in is the second layer. If
the Sentry were compromised, the attacker would still be in a highly restrictive
sandbox which they must also break out of in order to compromise the Host OS.
To enable networking functionality while preserving gVisor's security
properties, we implemented a
[userspace network stack](https://github.com/google/gvisor/tree/master/pkg/tcpip)
in the Sentry, which we creatively named Netstack. Netstack is also written in
Go, not only to avoid unsafe code in the network stack itself, but also to avoid
a complicated and unsafe Foreign Function Interface. Having its own integrated
network stack allows the Sentry to implement networking operations using up to
three Host OS syscalls to read and write packets. These syscalls allow a very
minimal set of operations which are already allowed (either through the same or
a similar syscall). Moreover, because packets typically come from off-host (e.g.
the internet), the Host OS's packet processing code has received a lot of
scrutiny, hopefully resulting in a high degree of hardening.
![Figure 1](/assets/images/2020-04-02-networking-security-figure1.png "Network and gVisor.")
## Writing a network stack
Netstack was written from scratch specifically for gVisor. Because Netstack was
designed and implemented to be modular, flexible and self-contained, there are
now several more projects using Netstack in creative and exciting ways. As we
discussed, a custom network stack has enabled a variety of security-related
goals which would not have been possible any other way. This came at a cost
though. Network stacks are complex and writing a new one comes with many
challenges, mostly related to application compatibility and performance.
Compatibility issues typically come in two forms: missing features, and features
with behavior that differs from Linux (usually due to bugs). Both of these are
inevitable in an implementation of a complex system spanning many quickly
evolving and ambiguous standards. However, we have invested heavily in this
area, and the vast majority of applications have no issues using Netstack. For
example,
[we now support setting 34 different socket options](https://github.com/google/gvisor/blob/815df2959a76e4a19f5882e40402b9bbca9e70be/pkg/sentry/socket/netstack/netstack.go#L830-L1764)
versus
[only 7 in our initial git commit](https://github.com/google/gvisor/blob/d02b74a5dcfed4bfc8f2f8e545bca4d2afabb296/pkg/sentry/socket/epsocket/epsocket.go#L445-L702).
We are continuing to make good progress in this area.
Performance issues typically come from TCP behavior and packet processing speed.
To improve our TCP behavior, we are working on implementing the full set of TCP
RFCs. There are many RFCs which are significant to performance (e.g.
[RACK](https://tools.ietf.org/id/draft-ietf-tcpm-rack-03.html) and
[BBR](https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00))
that we have yet to implement. This mostly affects TCP performance with
non-ideal network conditions (e.g. cross continent connections). Faster packet
processing mostly improves TCP performance when network conditions are very good
(e.g. within a datacenter). Our primary strategy here is to reduce interactions
with the Go runtime, specifically the garbage collector (GC) and scheduler. We
are currently optimizing buffer management to reduce the amount of garbage,
which will lower the GC cost. To reduce scheduler interactions, we are
re-architecting the TCP implementation to use fewer goroutines. Performance
today is good enough for most applications and we are making steady
improvements. For example, since May of 2019, we have improved the Netstack
runsc
[iperf3 download benchmark](https://github.com/google/gvisor/blob/master/benchmarks/suites/network.py)
score by roughly 15% and upload score by around 10,000X. Current numbers are
about 17 Gbps download and about 8 Gbps upload versus about 42 Gbps and 43 Gbps
for native (Linux) respectively.
## An alternative
We also offer an alternative network mode: passthrough. This name can be
misleading as syscalls are never passed through from the app to the Host OS.
Instead, the passthrough mode implements networking in gVisor using the Host
OS's network stack. (This mode is called
[hostinet](https://github.com/google/gvisor/tree/master/pkg/sentry/socket/hostinet)
in the codebase.) Passthrough mode can improve performance for some use cases as
the Host OS's network stack has had an enormous number of person-years poured
into making it highly performant. However, there is a rather large downside to
using passthrough mode: it weakens gVisor's security model by increasing the
Host OS's Attack Surface. This is because using the Host OS's network stack
requires the Sentry to use the Host OS's
[Berkeley socket interface](https://en.wikipedia.org/wiki/Berkeley_sockets). The
Berkeley socket interface is a much larger API surface than the packet interface
that our network stack uses. When passthrough mode is in use, the Sentry is
allowed to use
[15 additional syscalls](https://github.com/google/gvisor/blob/b1576e533223e98ebe4bd1b82b04e3dcda8c4bf1/runsc/boot/filter/config.go#L312-L517).
Further, this set of syscalls includes some that allow the Sentry to create file
descriptors, something that
[we don't normally allow](https://gvisor.dev/blog/2019/11/18/gvisor-security-basics-part-1/#sentry-host-os-interface)
as it opens up classes of file-based attacks.
There are some networking features that we can't implement on top of syscalls
that we feel are safe (most notably those behind
[ioctl](http://man7.org/linux/man-pages/man2/ioctl.2.html)) and therefore are
not supported. Because of this, we actually support fewer networking features in
passthrough mode than we do in Netstack, reducing application compatibility.
That's right: using our networking stack provides better overall application
compatibility than using our passthrough mode.
That said, gVisor with passthrough networking still provides a high level of
isolation. Applications cannot specify host syscall arguments directly, and the
sentry's seccomp policy restricts its syscall use significantly more than a
general purpose seccomp policy.
## Secure by Default
The goal of the Secure by Default principle is to make it easy to securely
sandbox containers. Of course, disabling network access entirely is the most
secure option, but that is not practical for most applications. To make gVisor
Secure by Default, we have made Netstack the default networking mode in gVisor
as we believe that it provides significantly better isolation. For this reason
we strongly caution users from changing the default unless Netstack flat out
won't work for them. The passthrough mode option is still provided, but we want
users to make an informed decision when selecting it.
Another way in which gVisor makes it easy to securely sandbox containers is by
allowing applications to run unmodified, with no special configuration needed.
In order to do this, gVisor needs to support all of the features and syscalls
that applications use. Neither seccomp nor gVisor's passthrough mode can do this
as applications commonly use syscalls which are too dangerous to be included in
a secure policy. Even if this dream isn't fully realized today, gVisor's
architecture with Netstack makes this possible.
## Give Netstack a Try
If you haven't already, try running a workload in gVisor with Netstack. You can
find instructions on how to get started in our
[Quick Start](/docs/user_guide/quick_start/docker/). We want to hear about both
your successes and any issues you encounter. We welcome your contributions,
whether that be verbal feedback or code contributions, via our
[Gitter channel](https://gitter.im/gvisor/community),
[email list](https://groups.google.com/forum/#!forum/gvisor-users),
[issue tracker](https://gvisor.dev/issue/new), and
[Github repository](https://github.com/google/gvisor). Feel free to express
interest in an [open issue](https://gvisor.dev/issue/), or reach out if you
aren't sure where to start.
|