summaryrefslogtreecommitdiffhomepage
path: root/README.md
blob: 3b8ad185725bb43e8b014c0891f0216a1b0fdc3f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
# gVisor

gVisor is a user-space kernel, written in Go, that implements a substantial
portion of the Linux system surface. It includes an [Open Container Initiative
(OCI)][oci] runtime called `runsc` that provides an isolation boundary between
the application and the host kernel. The `runsc` runtime integrates with Docker
and Kubernetes, making it simple to run sandboxed containers.

gVisor takes a distinct approach to container sandboxing and makes a different
set of technical trade-offs compared to existing sandbox technologies, thus
providing new tools and ideas for the container security landscape.

### Why does gVisor exist?

Containers are not a [**sandbox**][sandbox]. While containers have
revolutionized how we develop, package, and deploy applications, running
untrusted or potentially malicious code without additional isolation is not a
good idea. The efficiency and performance gains from using a single, shared
kernel also mean that container escape is possible with a single vulnerability.

gVisor is a user-space kernel for containers. It limits the host kernel surface
accessible to the application while still giving the application access to all
the features it expects. Unlike most kernels, gVisor does not assume or require
a fixed set of physical resources; instead, it leverages existing host kernel
functionality and runs as a normal user-space process. In other words, gVisor
implements Linux by way of Linux.

gVisor should not be confused with technologies and tools to harden containers
against external threats, provide additional integrity checks, or limit the
scope of access for a service. One should always be careful about what data is
made available to a container.

### How is gVisor different from other container isolation mechanisms?

Two other approaches are commonly taken to provide stronger isolation than
native containers.

![Machine-level virtualization](g3doc/Machine-Virtualization.png "Machine-level
virtualization"){style="display:block;margin:auto"}

**Machine-level virtualization**, such as [KVM][kvm] and [Xen][xen], exposes
virtualized hardware to a guest kernel via a Virtual Machine Monitor (VMM). This
virtualized hardware is generally enlightened (paravirtualized) and additional
mechanisms can be used to improve the visibility between the guest and host
(e.g. balloon drivers, paravirtualized spinlocks). Running containers in
distinct virtual machines can provide great isolation, compatibility and
performance (though nested virtualization may bring challenges in this area),
but for containers it often requires additional proxies and agents, and may
require a larger resource footprint and slower start-up times.

![Rule-based execution](g3doc/Rule-Based-Execution.png "Rule-based execution"){style="display:block;margin:auto"}

**Rule-based execution**, such as [seccomp][seccomp], [SELinux][selinux] and
[AppArmor][apparmor], allows the specification of a fine-grained security policy
for an application or container. These schemes typically rely on hooks
implemented inside the host kernel to enforce the rules. If the surface can be
made small enough (i.e. a sufficiently complete policy defined), then this is
an excellent way to sandbox applications and maintain native performance.
However, in practice it can be extremely difficult (if not impossible) to
reliably define a policy for arbitrary, previously unknown applications,
making this approach challenging to apply universally.

Rule-based execution is often combined with additional layers for
defense-in-depth.

![gVisor](g3doc/Layers.png "gVisor"){style="display:block;margin:auto"}

**gVisor** provides a third isolation mechanism, distinct from those mentioned
above.

gVisor intercepts application system calls and acts as the guest kernel, without
the need for translation through virtualized hardware. gVisor may be thought of
as either a merged guest kernel and VMM, or as seccomp on steroids. This
architecture allows it to provide a flexible resource footprint (i.e. one based
on threads and memory mappings, not fixed guest physical resources) while also
lowering the fixed costs of virtualization. However, this comes at the price of
reduced application compatibility and higher per-system call overhead.

On top of this, gVisor employs rule-based execution to provide defense-in-depth
(details below).

gVisor's approach is similar to [User Mode Linux (UML)][uml], although UML
virtualizes hardware internally and thus provides a fixed resource footprint.

Each of the above approaches may excel in distinct scenarios. For example,
machine-level virtualization will face challenges achieving high density, while
gVisor may provide poor performance for system call heavy workloads.

### Why Go?

gVisor was written in Go in order to avoid security pitfalls that can plague
kernels. With Go, there are strong types, built-in bounds checks, no
uninitialized variables, no use-after-free, no stack overflow, and a built-in
race detector. (The use of Go has its challenges too, and isn't free.)

## Architecture

gVisor intercepts all system calls made by the application, and does the
necessary work to service them. Importantly, gVisor does not simply redirect
application system calls through to the host kernel. Instead, gVisor implements
most kernel primitives (signals, file systems, futexes, pipes, mm, etc.) and has
complete system call handlers built on top of these primitives.

Since gVisor is itself a user-space application, it will make some host system
calls to support its operation, but much like a VMM, it will not allow the
application to directly control the system calls it makes.

### File System Access

![Sentry](g3doc/Sentry-Gofer.png){style="display:block;margin:auto"}

In order to provide defense-in-depth and limit the host system surface, the
gVisor container runtime is normally split into two separate processes. First,
the *Sentry* process includes the kernel and is responsible for executing user
code and handling system calls. Second, file system operations that extend beyond
the sandbox (not internal proc or tmp files, pipes, etc.) are sent to a proxy,
called a *Gofer*, via a 9P connection.

The Gofer acts as a file system proxy by opening host files on behalf of the
application, and passing them to the Sentry process, which has no host file
access itself. Furthermore, the Sentry runs in an empty user namespace, and the
system calls made by gVisor to the host are restricted using seccomp filters in
order to provide defense-in-depth.

### Network Access

The Sentry implements its own network stack (also written in Go) called
[netstack][netstack]. All aspects of the network stack are handled inside the
Sentry — including TCP connection state, control messages, and packet assembly —
keeping it isolated from the host network stack. Data link layer packets are
written directly to the virtual device inside the network namespace setup by
Docker or Kubernetes.

A network passthrough mode is also supported, but comes at the cost of reduced
isolation (see below).

### Platforms

The Sentry requires a *platform* to implement basic context switching and memory
mapping functionality. Today, gVisor supports two platforms:

* The **Ptrace** platform uses SYSEMU functionality to execute user code without
  executing host system calls. This platform can run anywhere that `ptrace`
  works (even VMs without nested virtualization).

* The **KVM** platform allows the Sentry to act as both guest OS and VMM,
  switching back and forth between the two worlds seamlessly. The KVM platform
  can run on bare-metal or on a VM with nested virtualization enabled. While
  there is no virtualized hardware layer -- the sandbox retains a process model
  -- gVisor leverages virtualization extensions available on modern processors
  in order to improve isolation and performance of address space switches.

### Performance

There are several factors influencing performance. The platform choice has the
largest direct impact that varies depending on the specific workload. There is
no best platform: Ptrace works universally, including on VM instances, but
applications may perform at a fraction of their original levels. Beyond the
platform choice, passthrough modes may be useful for improving perfomance at the
cost of some isolation.

## Installation

These instructions will get you up-and-running sandboxed containers with gVisor
and Docker.

### Requirements

* [git][git]
* [Bazel][bazel]
* [Docker version 17.09.0 or greater][docker]

### Getting the source

Clone the gVisor repo:

```
git clone https://gvisor.googlesource.com/gvisor gvisor
cd gvisor
```

### Building

Build and install the `runsc` binary.

It is important to copy this binary to some place that is accessible to all
users, since `runsc` executes itself as user `nobody` to avoid unnecessary
privileges. The `/usr/local/bin` directory is a good choice.

```
bazel build runsc
sudo cp ./bazel-bin/runsc/linux_amd64_pure_stripped/runsc /usr/local/bin
```

### Configuring Docker

Next, configure Docker to use `runsc` by adding a runtime entry to your Docker
configuration (`/etc/docker/daemon.json`). You may have to create this file if
it does not exist. Also, some Docker versions also require you to [specify the
`storage-driver` field][docker-storage-driver].

In the end, the file should look something like:

```
{
    "runtimes": {
        "runsc": {
            "path": "/usr/local/bin/runsc"
        }
    }
}
```

You must restart the Docker daemon after making changes to this file, typically this is done via:

```
sudo systemctl restart docker
```

Now run your container in `runsc`:

```
docker run --runtime=runsc hello-world
```

Terminal support works too:

```
docker run --runtime=runsc -it ubuntu /bin/bash
```


### Kubernetes Support (Experimental)

gVisor can run sandboxed containers in a Kubernetes cluster with cri-o, although
this is not recommended for production environments yet. Follow [these
instructions][cri-o-k8s] to run [cri-o][cri-o] on a node in a Kubernetes
cluster. Build `runsc` and put it on the node, and set it as the
`runtime_untrusted_workload` in `/etc/crio/crio.conf`.

Any Pod without the `io.kubernetes.cri-o.TrustedSandbox` annotation (or with the
annotation set to false) will be run with `runsc`.

Currently, gVisor only supports Pods with a single container (not counting the
ever-present pause container). Support for multiple containers within a single
Pod is coming soon.

## Advanced Usage

### Testing

The gVisor test suite can be run with Bazel:

```
bazel test ...
```

### Debugging

To enable debug + system call logging, add the `runtimeArgs` below to your
Docker configuration (`/etc/docker/daemon.json`):

```
{
    "runtimes": {
        "runsc": {
            "path": "/usr/local/bin/runsc"
             "runtimeArgs": [
                "--debug-log-dir=/tmp/runsc",
                "--debug",
                "--strace"
            ]
       }
    }
}
```

You may also want to pass `--log-packets` to troubleshoot network problems. Then
restart the Docker daemon:

```
sudo systemctl restart docker
```

Run your container again, and inspect the files under `/tmp/runsc`. The log file
with name `boot` will contain the strace logs from your application, which can
be useful for identifying missing or broken system calls in gVisor.

### Enabling network passthrough

For high-performance networking applications, you may choose to disable the user
space network stack and instead use the host network stack. Note that this mode
decreases the isolation to the host.

Add the following `runtimeArgs` to your Docker configuration
(`/etc/docker/daemon.json`) and restart the Docker daemon:

```
{
    "runtimes": {
        "runsc": {
            "path": "/usr/local/bin/runsc"
             "runtimeArgs": [
                "--network=host"
            ]
       }
    }
}
```

### Selecting a different platform

Depending on hardware and performance characteristics, you may choose to use a
different platform. The Ptrace platform is the default, but the KVM platform may
be specified by passing the `--platform` flag to `runsc` in your Docker
configuration (`/etc/docker/daemon.json`):

```
{
    "runtimes": {
        "runsc": {
            "path": "/usr/local/bin/runsc"
             "runtimeArgs": [
                "--platform=kvm"
            ]
       }
    }
}
```

Then restart the Docker daemon.

## FAQ & Known Issues

### What works?

The following applications/images have been tested:

* golang
* httpd
* java8
* jenkins
* mariadb
* memcached
* mongo
* mysql
* node
* php
* prometheus
* python
* redis
* registry
* tomcat
* wordpress

### What doesn't work yet?

The following applications have been tested and may not yet work:

* elasticsearch: Requires unimplemented socket ioctls. See [bug
  #2](https://github.com/google/gvisor/issues/2).
* nginx: Requires `ioctl(FIOASYNC)`, but see workaround in [bug
  #1](https://github.com/google/gvisor/issues/1).
* postgres: Requires SysV shared memory support. See [bug
  #3](https://github.com/google/gvisor/issues/3).

### Will my container work with gVisor?

gVisor implements a large portion of the Linux surface and while we strive to
make it broadly compatible, there are (and always will be) unimplemented
features and bugs. The only real way to know if it will work is to try. If you
find a container that doesn’t work and there is no known issue, please [file a
bug][bug] indicating the full command you used to run the image. Providing the
debug logs is also helpful.

### My container runs fine with *runc* but fails with *runsc*.

If you’re having problems running a container with `runsc` it’s most likely due
to a compatibility issue or a missing feature in gVisor. See **Debugging**,
above.

### I can’t see a file copied with `docker cp` or `kubectl cp`.

For performance reasons, gVisor caches directory contents, and therefore it may
not realize a new file was copied to a given directory. To invalidate the cache
and force a refresh, create a file under the directory in question and list the
contents again.

This bug is tracked in [bug #4](https://github.com/google/gvisor/issues/4).

## Technical details

We plan to release a full paper with technical details and will include it
here when available.

## Community

Join the [gvisor-discuss mailing list][gvisor-discuss-list] to discuss all things
gVisor.

Sensitive security-related questions and comments can be sent to the private
[gvisor-security mailing list][gvisor-security-list].

## Contributing

See [Contributing.md](CONTRIBUTING.md).

## Disclaimer

This is not an official Google product (experimental or otherwise), it is just
code that happens to be owned by Google.

[apparmor]: https://wiki.ubuntu.com/AppArmor
[bazel]: https://bazel.build
[bug]: https://github.com/google/gvisor/issues
[cri-o]: https://github.com/kubernetes-incubator/cri-o
[cri-o-k8s]: https://github.com/kubernetes-incubator/cri-o/blob/master/kubernetes.md
[docker]: https://www.docker.com
[docker-storage-driver]: https://docs.docker.com/engine/reference/commandline/dockerd/#daemon-storage-driver
[git]: https://git-scm.com
[gvisor-discuss-list]: https://groups.google.com/forum/#!forum/gvisor-users
[gvisor-security-list]: https://groups.google.com/forum/#!forum/gvisor-security
[kvm]: https://www.linux-kvm.org
[netstack]: https://github.com/google/netstack
[oci]: https://www.opencontainers.org
[sandbox]: https://en.wikipedia.org/wiki/Sandbox_(computer_security)
[seccomp]: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
[selinux]: https://selinuxproject.org
[uml]: http://user-mode-linux.sourceforge.net/
[xen]: https://www.xenproject.org