diff options
-rw-r--r-- | CONTRIBUTING.md | 74 | ||||
-rw-r--r-- | README.md | 119 | ||||
-rw-r--r-- | pkg/sentry/fs/README.md | 8 | ||||
-rw-r--r-- | pkg/sentry/fs/proc/README.md | 5 | ||||
-rw-r--r-- | pkg/sentry/kernel/README.md | 80 | ||||
-rw-r--r-- | pkg/sentry/mm/README.md | 161 | ||||
-rw-r--r-- | pkg/sentry/usermem/README.md | 42 |
7 files changed, 252 insertions, 237 deletions
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index fa607113c..7ad19fb02 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -7,15 +7,16 @@ Before we can use your code, you must sign the online. The CLA is necessary mainly because you own the copyright to your changes, even after your contribution becomes part of our codebase, so we need your permission to use and distribute your code. We also need to be sure of -various other things—for instance that you'll tell us if you know that your -code infringes on other people's patents. You don't have to sign the CLA until -after you've submitted your code for review and a member has approved it, but -you must do it before we can put your code into our codebase. Before you start -working on a larger contribution, you should get in touch with us first through -the issue tracker with your idea so that we can help out and possibly guide you. +various other things—for instance that you'll tell us if you know that your code +infringes on other people's patents. You don't have to sign the CLA until after +you've submitted your code for review and a member has approved it, but you must +do it before we can put your code into our codebase. Before you start working on +a larger contribution, you should get in touch with us first through the issue +tracker with your idea so that we can help out and possibly guide you. Coordinating up front makes it much easier to avoid frustration later on. ### Coding Guidelines + All code should conform to the [Go style guidelines][gostyle]. As a secure runtime, we need to maintain the safety of all of code included in @@ -25,34 +26,41 @@ Definitions for the rules below: `core`: - * `//pkg/sentry/...` - * Transitive dependencies in `//pkg/...` +* `//pkg/sentry/...` +* Transitive dependencies in `//pkg/...` `runsc`: - * `//runsc/...` +* `//runsc/...` Rules: - * No cgo in `core` or `runsc`. The final binary must be a statically-linked +* No cgo in `core` or `runsc`. The final binary must be a statically-linked pure Go binary. - * Any files importing "unsafe" must have a name ending in `_unsafe.go`. +* Any files importing "unsafe" must have a name ending in `_unsafe.go`. + +* `core` may only depend on the following packages: + + * Itself. + * Go standard library. + * Except (transitively) package "net" (this will result in a non-cgo + binary). Use `//pkg/unet` instead. + * `@org_golang_x_sys//unix:go_default_library` (Go import + `golang.org/x/sys/unix`). + * Generated Go protobuf packages. + * `@com_github_golang_protobuf//proto:go_default_library` (Go import + `github.com/golang/protobuf/proto`). + * `@com_github_golang_protobuf//ptypes:go_default_library` (Go import + `github.com/golang/protobuf/ptypes`). - * `core` may only depend on the following packages: - * Itself. - * Go standard library. - * Except (transitively) package "net" (this will result in a non-cgo - binary). Use `//pkg/unet` instead. - * `@org_golang_x_sys//unix:go_default_library` (Go import `golang.org/x/sys/unix`). - * Generated Go protobuf packages. - * `@com_github_golang_protobuf//proto:go_default_library` (Go import `github.com/golang/protobuf/proto`). - * `@com_github_golang_protobuf//ptypes:go_default_library` (Go import `github.com/golang/protobuf/ptypes`). +* `runsc` may only depend on the following packages: - * `runsc` may only depend on the following packages: - * All packages allowed for `core`. - * `@com_github_google_subcommands//:go_default_library` (Go import `github.com/google/subcommands`). - * `@com_github_opencontainers_runtime_spec//specs_go:go_default_library` (Go import `github.com/opencontainers/runtime-spec/specs_go`). + * All packages allowed for `core`. + * `@com_github_google_subcommands//:go_default_library` (Go import + `github.com/google/subcommands`). + * `@com_github_opencontainers_runtime_spec//specs_go:go_default_library` + (Go import `github.com/opencontainers/runtime-spec/specs_go`). ### Code reviews @@ -66,8 +74,8 @@ To submit a patch, first clone the canonical repository. git clone https://gvisor.googlesource.com/gvisor ``` -From within the cloned directory, install the commit hooks (optional, but if -you don't you will need to generate Change-Ids manually in your commits). +From within the cloned directory, install the commit hooks (optional, but if you +don't you will need to generate Change-Ids manually in your commits). ``` curl -Lo `git rev-parse --git-dir`/hooks/commit-msg https://gerrit-review.googlesource.com/tools/hooks/commit-msg @@ -79,8 +87,8 @@ changes, remember to organize commits logically. Changes are not reviewed per branch (as with a pull request), they are reviewed per commit. Before posting a new patch, you will need to generate an appropriate -authentication cookie. Visit the [repository][repo] and click the -"Generate Password" link at the top of the page for instructions. +authentication cookie. Visit the [repository][repo] and click the "Generate +Password" link at the top of the page for instructions. To post a patch for review, push to a special "for" reference. @@ -90,17 +98,17 @@ git push origin HEAD:refs/for/master A change link will be generated for the commit, and a team member will review your change request, provide feedback (and submit when appropriate). To address -feedback, you may be required to amend your commit and repush (don't change -the Commit-Id in the commit message). This will generate a new version of -the change. +feedback, you may be required to amend your commit and repush (don't change the +Commit-Id in the commit message). This will generate a new version of the +change. When approved, the change will be submitted by a team member and automatically merged into the repository. ### The small print -Contributions made by corporations are covered by a different agreement than -the one above, the +Contributions made by corporations are covered by a different agreement than the +one above, the [Software Grant and Corporate Contributor License Agreement][gccla]. [gcla]: https://cla.developers.google.com/about/google-individual @@ -1,10 +1,11 @@ # gVisor gVisor is a user-space kernel, written in Go, that implements a substantial -portion of the Linux system surface. It includes an [Open Container Initiative -(OCI)][oci] runtime called `runsc` that provides an isolation boundary between -the application and the host kernel. The `runsc` runtime integrates with Docker -and Kubernetes, making it simple to run sandboxed containers. +portion of the Linux system surface. It includes an +[Open Container Initiative (OCI)][oci] runtime called `runsc` that provides an +isolation boundary between the application and the host kernel. The `runsc` +runtime integrates with Docker and Kubernetes, making it simple to run sandboxed +containers. gVisor takes a distinct approach to container sandboxing and makes a different set of technical trade-offs compared to existing sandbox technologies, thus @@ -51,11 +52,11 @@ require a larger resource footprint and slower start-up times. [AppArmor][apparmor], allows the specification of a fine-grained security policy for an application or container. These schemes typically rely on hooks implemented inside the host kernel to enforce the rules. If the surface can be -made small enough (i.e. a sufficiently complete policy defined), then this is -an excellent way to sandbox applications and maintain native performance. -However, in practice it can be extremely difficult (if not impossible) to -reliably define a policy for arbitrary, previously unknown applications, -making this approach challenging to apply universally. +made small enough (i.e. a sufficiently complete policy defined), then this is an +excellent way to sandbox applications and maintain native performance. However, +in practice it can be extremely difficult (if not impossible) to reliably define +a policy for arbitrary, previously unknown applications, making this approach +challenging to apply universally. ![Rule-based execution](g3doc/Rule-Based-Execution.png "Rule-based execution") @@ -109,9 +110,9 @@ application to directly control the system calls it makes. In order to provide defense-in-depth and limit the host system surface, the gVisor container runtime is normally split into two separate processes. First, the *Sentry* process includes the kernel and is responsible for executing user -code and handling system calls. Second, file system operations that extend beyond -the sandbox (not internal proc or tmp files, pipes, etc.) are sent to a proxy, -called a *Gofer*, via a 9P connection. +code and handling system calls. Second, file system operations that extend +beyond the sandbox (not internal proc or tmp files, pipes, etc.) are sent to a +proxy, called a *Gofer*, via a 9P connection. ![Sentry](g3doc/Sentry-Gofer.png "Sentry and Gofer") @@ -138,17 +139,17 @@ isolation (see below). The Sentry requires a *platform* to implement basic context switching and memory mapping functionality. Today, gVisor supports two platforms: -* The **Ptrace** platform uses SYSEMU functionality to execute user code without - executing host system calls. This platform can run anywhere that `ptrace` - works (even VMs without nested virtualization). +* The **Ptrace** platform uses SYSEMU functionality to execute user code + without executing host system calls. This platform can run anywhere that + `ptrace` works (even VMs without nested virtualization). -* The **KVM** platform (experimental) allows the Sentry to act as both guest OS - and VMM, switching back and forth between the two worlds seamlessly. The KVM - platform can run on bare-metal or on a VM with nested virtualization enabled. - While there is no virtualized hardware layer -- the sandbox retains a process - model -- gVisor leverages virtualization extensions available on modern - processors in order to improve isolation and performance of address space - switches. +* The **KVM** platform (experimental) allows the Sentry to act as both guest + OS and VMM, switching back and forth between the two worlds seamlessly. The + KVM platform can run on bare-metal or on a VM with nested virtualization + enabled. While there is no virtualized hardware layer -- the sandbox retains + a process model -- gVisor leverages virtualization extensions available on + modern processors in order to improve isolation and performance of address + space switches. ### Performance @@ -172,8 +173,8 @@ binaries). The easiest way to get `runsc` is from the [latest nightly build][runsc-nightly]. After you download the binary, check it -against the SHA512 [checksum file][runsc-nightly-sha]. Older builds can be -found here: +against the SHA512 [checksum file][runsc-nightly-sha]. Older builds can be found +here: `https://storage.googleapis.com/gvisor/releases/nightly/${yyyy-mm-dd}/runsc` and `https://storage.googleapis.com/gvisor/releases/nightly/${yyyy-mm-dd}/runsc.sha512` @@ -193,8 +194,8 @@ sudo mv runsc /usr/local/bin Next, configure Docker to use `runsc` by adding a runtime entry to your Docker configuration (`/etc/docker/daemon.json`). You may have to create this file if -it does not exist. Also, some Docker versions also require you to [specify the -`storage-driver` field][docker-storage-driver]. +it does not exist. Also, some Docker versions also require you to +[specify the `storage-driver` field][docker-storage-driver]. In the end, the file should look something like: @@ -208,7 +209,8 @@ In the end, the file should look something like: } ``` -You must restart the Docker daemon after making changes to this file, typically this is done via: +You must restart the Docker daemon after making changes to this file, typically +this is done via: ``` sudo systemctl restart docker @@ -229,8 +231,8 @@ docker run --runtime=runsc -it ubuntu /bin/bash ### Kubernetes Support (Experimental) gVisor can run sandboxed containers in a Kubernetes cluster with cri-o, although -this is not recommended for production environments yet. Follow [these -instructions][cri-o-k8s] to run [cri-o][cri-o] on a node in a Kubernetes +this is not recommended for production environments yet. Follow +[these instructions][cri-o-k8s] to run [cri-o][cri-o] on a node in a Kubernetes cluster. Build `runsc` and put it on the node, and set it as the `runtime_untrusted_workload` in `/etc/crio/crio.conf`. @@ -251,11 +253,11 @@ gVisor currently requires x86\_64 Linux to build. Make sure the following dependencies are installed: -* [git][git] -* [Bazel][bazel] -* [Python][python] -* [Docker version 17.09.0 or greater][docker] -* Gold linker (e.g. `binutils-gold` package on Ubuntu) +* [git][git] +* [Bazel][bazel] +* [Python][python] +* [Docker version 17.09.0 or greater][docker] +* Gold linker (e.g. `binutils-gold` package on Ubuntu) #### Getting the source @@ -275,7 +277,6 @@ bazel build runsc sudo cp ./bazel-bin/runsc/linux_amd64_pure_stripped/runsc /usr/local/bin ``` - ### Testing The gVisor test suite can be run with Bazel: @@ -366,33 +367,33 @@ Then restart the Docker daemon. gVisor implements a large portion of the Linux surface and while we strive to make it broadly compatible, there are (and always will be) unimplemented features and bugs. The only real way to know if it will work is to try. If you -find a container that doesn’t work and there is no known issue, please [file a -bug][bug] indicating the full command you used to run the image. Providing the -debug logs is also helpful. +find a container that doesn’t work and there is no known issue, please +[file a bug][bug] indicating the full command you used to run the image. +Providing the debug logs is also helpful. ### What works? The following applications/images have been tested: -* elasticsearch -* golang -* httpd -* java8 -* jenkins -* mariadb -* memcached -* mongo -* mysql -* nginx -* node -* php -* postgres -* prometheus -* python -* redis -* registry -* tomcat -* wordpress +* elasticsearch +* golang +* httpd +* java8 +* jenkins +* mariadb +* memcached +* mongo +* mysql +* nginx +* node +* php +* postgres +* prometheus +* python +* redis +* registry +* tomcat +* wordpress ### My container runs fine with *runc* but fails with *runsc*. @@ -416,8 +417,8 @@ This bug is tracked in [bug #4](https://github.com/google/gvisor/issues/4). ## Technical details -We plan to release a full paper with technical details and will include it -here when available. +We plan to release a full paper with technical details and will include it here +when available. ## Community diff --git a/pkg/sentry/fs/README.md b/pkg/sentry/fs/README.md index 898271ee8..76638cdae 100644 --- a/pkg/sentry/fs/README.md +++ b/pkg/sentry/fs/README.md @@ -149,10 +149,10 @@ An `fs.File` references the following filesystem objects: fs.File -> fs.Dirent -> fs.Inode -> fs.MountedFilesystem ``` -The `fs.Inode` is restored using its `fs.MountedFilesystem`. The [Mount -points](#mount-points) section above describes how this happens in detail. The -`fs.Dirent` restores its pointer to an `fs.Inode`, pointers to parent and -children `fs.Dirents`, and the basename of the file. +The `fs.Inode` is restored using its `fs.MountedFilesystem`. The +[Mount points](#mount-points) section above describes how this happens in +detail. The `fs.Dirent` restores its pointer to an `fs.Inode`, pointers to +parent and children `fs.Dirents`, and the basename of the file. Otherwise an `fs.File` restores flags, an offset, and a unique identifier (only used internally). diff --git a/pkg/sentry/fs/proc/README.md b/pkg/sentry/fs/proc/README.md index 6ad7297d2..cec842403 100644 --- a/pkg/sentry/fs/proc/README.md +++ b/pkg/sentry/fs/proc/README.md @@ -6,6 +6,7 @@ procfs generally. inconsistency, please file a bug. [TOC] + ## Kernel data The following files are implemented: @@ -91,6 +92,7 @@ Num currently running processes | Always zero Total num processes | Always zero TODO: Populate the columns with accurate statistics. + ### meminfo ```bash @@ -122,7 +124,7 @@ Shmem: 0 kB Notable divergences: Field name | Notes -:---------------- | :-------------------------------------------------------- +:---------------- | :----------------------------------------------------- Buffers | Always zero, no block devices SwapCache | Always zero, no swap Inactive(anon) | Always zero, see SwapCache @@ -182,6 +184,7 @@ softirq 0 0 0 0 0 0 0 0 0 0 0 ``` All fields except for `btime` are always zero. + TODO: Populate with accurate fields. ### sys diff --git a/pkg/sentry/kernel/README.md b/pkg/sentry/kernel/README.md index 88760a9bb..427311be8 100644 --- a/pkg/sentry/kernel/README.md +++ b/pkg/sentry/kernel/README.md @@ -1,12 +1,12 @@ This package contains: -- A (partial) emulation of the "core Linux kernel", which governs task - execution and scheduling, system call dispatch, and signal handling. See - below for details. +- A (partial) emulation of the "core Linux kernel", which governs task + execution and scheduling, system call dispatch, and signal handling. See + below for details. -- The top-level interface for the sentry's Linux kernel emulation in general, - used by the `main` function of all versions of the sentry. This interface - revolves around the `Env` type (defined in `kernel.go`). +- The top-level interface for the sentry's Linux kernel emulation in general, + used by the `main` function of all versions of the sentry. This interface + revolves around the `Env` type (defined in `kernel.go`). # Background @@ -20,15 +20,15 @@ sentry's notion of a task unless otherwise specified.) At a high level, Linux application threads can be thought of as repeating a "run loop": -- Some amount of application code is executed in userspace. +- Some amount of application code is executed in userspace. -- A trap (explicit syscall invocation, hardware interrupt or exception, etc.) - causes control flow to switch to the kernel. +- A trap (explicit syscall invocation, hardware interrupt or exception, etc.) + causes control flow to switch to the kernel. -- Some amount of kernel code is executed in kernelspace, e.g. to handle the - cause of the trap. +- Some amount of kernel code is executed in kernelspace, e.g. to handle the + cause of the trap. -- The kernel "returns from the trap" into application code. +- The kernel "returns from the trap" into application code. Analogously, each task in the sentry is associated with a *task goroutine* that executes that task's run loop (`Task.run` in `task_run.go`). However, the @@ -38,24 +38,25 @@ state to, and resuming execution from, checkpoints. While in kernelspace, a Linux thread can be descheduled (cease execution) in a variety of ways: -- It can yield or be preempted, becoming temporarily descheduled but still - runnable. At present, the sentry delegates scheduling of runnable threads to - the Go runtime. +- It can yield or be preempted, becoming temporarily descheduled but still + runnable. At present, the sentry delegates scheduling of runnable threads to + the Go runtime. -- It can exit, becoming permanently descheduled. The sentry's equivalent is - returning from `Task.run`, terminating the task goroutine. +- It can exit, becoming permanently descheduled. The sentry's equivalent is + returning from `Task.run`, terminating the task goroutine. -- It can enter interruptible sleep, a state in which it can be woken by a - caller-defined wakeup or the receipt of a signal. In the sentry, interruptible - sleep (which is ambiguously referred to as *blocking*) is implemented by - making all events that can end blocking (including signal notifications) - communicated via Go channels and using `select` to multiplex wakeup sources; - see `task_block.go`. +- It can enter interruptible sleep, a state in which it can be woken by a + caller-defined wakeup or the receipt of a signal. In the sentry, + interruptible sleep (which is ambiguously referred to as *blocking*) is + implemented by making all events that can end blocking (including signal + notifications) communicated via Go channels and using `select` to multiplex + wakeup sources; see `task_block.go`. -- It can enter uninterruptible sleep, a state in which it can only be woken by a - caller-defined wakeup. Killable sleep is a closely related variant in which - the task can also be woken by SIGKILL. (These definitions also include Linux's - "group-stopped" (`TASK_STOPPED`) and "ptrace-stopped" (`TASK_TRACED`) states.) +- It can enter uninterruptible sleep, a state in which it can only be woken by + a caller-defined wakeup. Killable sleep is a closely related variant in + which the task can also be woken by SIGKILL. (These definitions also include + Linux's "group-stopped" (`TASK_STOPPED`) and "ptrace-stopped" + (`TASK_TRACED`) states.) To maximize compatibility with Linux, sentry checkpointing appears as a spurious signal-delivery interrupt on all tasks; interrupted system calls return `EINTR` @@ -71,21 +72,22 @@ through sleeping operations. We break the task's control flow graph into *states*, delimited by: -1. Points where uninterruptible and killable sleeps may occur. For example, -there exists a state boundary between signal dequeueing and signal delivery -because there may be an intervening ptrace signal-delivery-stop. +1. Points where uninterruptible and killable sleeps may occur. For example, + there exists a state boundary between signal dequeueing and signal delivery + because there may be an intervening ptrace signal-delivery-stop. -2. Points where sleep-induced branches may "rejoin" normal execution. For -example, the syscall exit state exists because it can be reached immediately -following a synchronous syscall, or after a task that is sleeping in `execve()` -or `vfork()` resumes execution. +2. Points where sleep-induced branches may "rejoin" normal execution. For + example, the syscall exit state exists because it can be reached immediately + following a synchronous syscall, or after a task that is sleeping in + `execve()` or `vfork()` resumes execution. -3. Points containing large branches. This is strictly for organizational -purposes. For example, the state that processes interrupt-signaled conditions is -kept separate from the main "app" state to reduce the size of the latter. +3. Points containing large branches. This is strictly for organizational + purposes. For example, the state that processes interrupt-signaled + conditions is kept separate from the main "app" state to reduce the size of + the latter. -4. `SyscallReinvoke`, which does not correspond to anything in Linux, and exists -solely to serve the autosave feature. +4. `SyscallReinvoke`, which does not correspond to anything in Linux, and + exists solely to serve the autosave feature. ![dot -Tpng -Goverlap=false -orun_states.png run_states.dot](g3doc/run_states.png "Task control flow graph") diff --git a/pkg/sentry/mm/README.md b/pkg/sentry/mm/README.md index 067733475..e485a5ca5 100644 --- a/pkg/sentry/mm/README.md +++ b/pkg/sentry/mm/README.md @@ -38,50 +38,50 @@ forces the kernel to create such a mapping to service the read. For a file, doing so consists of several logical phases: -1. The kernel allocates physical memory to store the contents of the required - part of the file, and copies file contents to the allocated memory. Supposing - that the kernel chooses the physical memory at physical address (PA) - 0x2fb000, the resulting state of the system is: +1. The kernel allocates physical memory to store the contents of the required + part of the file, and copies file contents to the allocated memory. + Supposing that the kernel chooses the physical memory at physical address + (PA) 0x2fb000, the resulting state of the system is: VMA: VA:0x400000 -> /tmp/foo:0x0 Filemap: /tmp/foo:0x0 -> PA:0x2fb000 - (In Linux the state of the mapping from file offset to physical memory is - stored in `struct address_space`, but to avoid confusion with other notions - of address space we will refer to this system as filemap, named after Linux - kernel source file `mm/filemap.c`.) + (In Linux the state of the mapping from file offset to physical memory is + stored in `struct address_space`, but to avoid confusion with other notions + of address space we will refer to this system as filemap, named after Linux + kernel source file `mm/filemap.c`.) -2. The kernel stores the effective mapping from virtual to physical address in a - *page table entry* (PTE) in the application's *page tables*, which are used - by the CPU's virtual memory hardware to perform address translation. The - resulting state of the system is: +2. The kernel stores the effective mapping from virtual to physical address in + a *page table entry* (PTE) in the application's *page tables*, which are + used by the CPU's virtual memory hardware to perform address translation. + The resulting state of the system is: VMA: VA:0x400000 -> /tmp/foo:0x0 Filemap: /tmp/foo:0x0 -> PA:0x2fb000 PTE: VA:0x400000 -----------------> PA:0x2fb000 - The PTE is required for the application to actually use the contents of the - mapped file as virtual memory. However, the PTE is derived from the VMA and - filemap state, both of which are independently mutable, such that mutations - to either will affect the PTE. For example: - - - The application may remove the VMA using the `munmap` system call. This - breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently the - mapping from VA:0x400000 to PA:0x2fb000. However, it does not necessarily - break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a future mapping of - the same file offset may reuse this physical memory. - - - The application may invalidate the file's contents by passing a length of 0 - to the `ftruncate` system call. This breaks the mapping from /tmp/foo:0x0 - to PA:0x2fb000, and consequently the mapping from VA:0x400000 to - PA:0x2fb000. However, it does not break the mapping from VA:0x400000 to - /tmp/foo:0x0, so future changes to the file's contents may again be made - visible at VA:0x400000 after another page fault results in the allocation - of a new physical address. - - Note that, in order to correctly break the mapping from VA:0x400000 to - PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping* - from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE. + The PTE is required for the application to actually use the contents of the + mapped file as virtual memory. However, the PTE is derived from the VMA and + filemap state, both of which are independently mutable, such that mutations + to either will affect the PTE. For example: + + - The application may remove the VMA using the `munmap` system call. This + breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently + the mapping from VA:0x400000 to PA:0x2fb000. However, it does not + necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a + future mapping of the same file offset may reuse this physical memory. + + - The application may invalidate the file's contents by passing a length + of 0 to the `ftruncate` system call. This breaks the mapping from + /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from + VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from + VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents + may again be made visible at VA:0x400000 after another page fault + results in the allocation of a new physical address. + + Note that, in order to correctly break the mapping from VA:0x400000 to + PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping* + from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE. [^mmap-anon]: Memory mappings to non-files are discussed in later sections. @@ -146,30 +146,30 @@ When the application first incurs a page fault on this address, the host kernel delivers information about the page fault to the sentry in a platform-dependent manner, and the sentry handles the fault: -1. The sentry allocates memory to store the contents of the required part of the - file, and copies file contents to the allocated memory. However, since the - sentry is implemented atop a host kernel, it does not configure mappings to - physical memory directly. Instead, mappable "memory" in the sentry is - represented by a host file descriptor and offset, since (as noted in - "Background") this is the memory mapping primitive provided by the host - kernel. In general, memory is allocated from a temporary host file using the - `filemem` package. Supposing that the sentry allocates offset 0x3000 from - host file "memory-file", the resulting state is: +1. The sentry allocates memory to store the contents of the required part of + the file, and copies file contents to the allocated memory. However, since + the sentry is implemented atop a host kernel, it does not configure mappings + to physical memory directly. Instead, mappable "memory" in the sentry is + represented by a host file descriptor and offset, since (as noted in + "Background") this is the memory mapping primitive provided by the host + kernel. In general, memory is allocated from a temporary host file using the + `filemem` package. Supposing that the sentry allocates offset 0x3000 from + host file "memory-file", the resulting state is: Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 -2. The sentry stores the effective mapping from virtual address to host file in - a host VMA by invoking the `mmap` system call: +2. The sentry stores the effective mapping from virtual address to host file in + a host VMA by invoking the `mmap` system call: Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 -3. The sentry returns control to the application, which immediately incurs the - page fault again.[^mmap-populate] However, since a host VMA now exists for - the faulting virtual address, the host kernel now handles the page fault as - described in "Background": +3. The sentry returns control to the application, which immediately incurs the + page fault again.[^mmap-populate] However, since a host VMA now exists for + the faulting virtual address, the host kernel now handles the page fault as + described in "Background": Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 @@ -183,12 +183,12 @@ independently mutable, and the desired state of host VMAs is derived from that state. [^mmap-populate]: The sentry could force the host kernel to establish PTEs when - it creates the host VMA by passing the `MAP_POPULATE` flag to - the `mmap` system call, but usually does not. This is because, - to reduce the number of page faults that require handling by - the sentry and (correspondingly) the number of host `mmap` - system calls, the sentry usually creates host VMAs that are - much larger than the single faulting page. + it creates the host VMA by passing the `MAP_POPULATE` flag to + the `mmap` system call, but usually does not. This is because, + to reduce the number of page faults that require handling by + the sentry and (correspondingly) the number of host `mmap` + system calls, the sentry usually creates host VMAs that are + much larger than the single faulting page. ## Private Mappings @@ -233,45 +233,46 @@ there is no shared zero page. In Linux: -- A virtual address space is represented by `struct mm_struct`. +- A virtual address space is represented by `struct mm_struct`. -- VMAs are represented by `struct vm_area_struct`, stored in `struct - mm_struct::mmap`. +- VMAs are represented by `struct vm_area_struct`, stored in `struct + mm_struct::mmap`. -- Mappings from file offsets to physical memory are stored in `struct - address_space`. +- Mappings from file offsets to physical memory are stored in `struct + address_space`. -- Reverse mappings from file offsets to virtual mappings are stored in `struct - address_space::i_mmap`. +- Reverse mappings from file offsets to virtual mappings are stored in `struct + address_space::i_mmap`. -- Physical memory pages are represented by a pointer to `struct page` or an - index called a *page frame number* (PFN), represented by `pfn_t`. +- Physical memory pages are represented by a pointer to `struct page` or an + index called a *page frame number* (PFN), represented by `pfn_t`. -- PTEs are represented by architecture-dependent type `pte_t`, stored in a table - hierarchy rooted at `struct mm_struct::pgd`. +- PTEs are represented by architecture-dependent type `pte_t`, stored in a + table hierarchy rooted at `struct mm_struct::pgd`. In the sentry: -- A virtual address space is represented by type [`mm.MemoryManager`][mm]. +- A virtual address space is represented by type [`mm.MemoryManager`][mm]. -- Sentry VMAs are represented by type [`mm.vma`][mm], stored in - `mm.MemoryManager.vmas`. +- Sentry VMAs are represented by type [`mm.vma`][mm], stored in + `mm.MemoryManager.vmas`. -- Mappings from sentry file offsets to host file offsets are abstracted through - interface method [`memmap.Mappable.Translate`][memmap]. +- Mappings from sentry file offsets to host file offsets are abstracted + through interface method [`memmap.Mappable.Translate`][memmap]. -- Reverse mappings from sentry file offsets to virtual mappings are abstracted - through interface methods [`memmap.Mappable.AddMapping` and - `memmap.Mappable.RemoveMapping`][memmap]. +- Reverse mappings from sentry file offsets to virtual mappings are abstracted + through interface methods + [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap]. -- Host files that may be mapped into host VMAs are represented by type - [`platform.File`][platform]. +- Host files that may be mapped into host VMAs are represented by type + [`platform.File`][platform]. -- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform - mapping area"), stored in `mm.MemoryManager.pmas`. +- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform + mapping area"), stored in `mm.MemoryManager.pmas`. -- Creation and destruction of host VMAs is abstracted through interface methods - [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform]. +- Creation and destruction of host VMAs is abstracted through interface + methods + [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform]. [filemem]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/platform/filemem/filemem.go [memmap]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/memmap/memmap.go diff --git a/pkg/sentry/usermem/README.md b/pkg/sentry/usermem/README.md index 2ebd3bcc1..f6d2137eb 100644 --- a/pkg/sentry/usermem/README.md +++ b/pkg/sentry/usermem/README.md @@ -2,30 +2,30 @@ This package defines primitives for sentry access to application memory. Major types: -- The `IO` interface represents a virtual address space and provides I/O methods - on that address space. `IO` is the lowest-level primitive. The primary - implementation of the `IO` interface is `mm.MemoryManager`. +- The `IO` interface represents a virtual address space and provides I/O + methods on that address space. `IO` is the lowest-level primitive. The + primary implementation of the `IO` interface is `mm.MemoryManager`. -- `IOSequence` represents a collection of individually-contiguous address ranges - in a `IO` that is operated on sequentially, analogous to Linux's `struct - iov_iter`. +- `IOSequence` represents a collection of individually-contiguous address + ranges in a `IO` that is operated on sequentially, analogous to Linux's + `struct iov_iter`. Major usage patterns: -- Access to a task's virtual memory, subject to the application's memory - protections and while running on that task's goroutine, from a context that is - at or above the level of the `kernel` package (e.g. most syscall - implementations in `syscalls/linux`); use the `kernel.Task.Copy*` wrappers - defined in `kernel/task_usermem.go`. +- Access to a task's virtual memory, subject to the application's memory + protections and while running on that task's goroutine, from a context that + is at or above the level of the `kernel` package (e.g. most syscall + implementations in `syscalls/linux`); use the `kernel.Task.Copy*` wrappers + defined in `kernel/task_usermem.go`. -- Access to a task's virtual memory, from a context that is at or above the - level of the `kernel` package, but where any of the above constraints does not - hold (e.g. `PTRACE_POKEDATA`, which ignores application memory protections); - obtain the task's `mm.MemoryManager` by calling `kernel.Task.MemoryManager`, - and call its `IO` methods directly. +- Access to a task's virtual memory, from a context that is at or above the + level of the `kernel` package, but where any of the above constraints does + not hold (e.g. `PTRACE_POKEDATA`, which ignores application memory + protections); obtain the task's `mm.MemoryManager` by calling + `kernel.Task.MemoryManager`, and call its `IO` methods directly. -- Access to a task's virtual memory, from a context that is below the level of - the `kernel` package (e.g. filesystem I/O); clients must pass I/O arguments - from higher layers, usually in the form of an `IOSequence`. The - `kernel.Task.SingleIOSequence` and `kernel.Task.IovecsIOSequence` functions in - `kernel/task_usermem.go` are convenience functions for doing so. +- Access to a task's virtual memory, from a context that is below the level of + the `kernel` package (e.g. filesystem I/O); clients must pass I/O arguments + from higher layers, usually in the form of an `IOSequence`. The + `kernel.Task.SingleIOSequence` and `kernel.Task.IovecsIOSequence` functions + in `kernel/task_usermem.go` are convenience functions for doing so. |