summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/mm/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'pkg/sentry/mm/README.md')
-rw-r--r--pkg/sentry/mm/README.md280
1 files changed, 280 insertions, 0 deletions
diff --git a/pkg/sentry/mm/README.md b/pkg/sentry/mm/README.md
new file mode 100644
index 000000000..f4d43d927
--- /dev/null
+++ b/pkg/sentry/mm/README.md
@@ -0,0 +1,280 @@
+This package provides an emulation of Linux semantics for application virtual
+memory mappings.
+
+For completeness, this document also describes aspects of the memory management
+subsystem defined outside this package.
+
+# Background
+
+We begin by describing semantics for virtual memory in Linux.
+
+A virtual address space is defined as a collection of mappings from virtual
+addresses to physical memory. However, userspace applications do not configure
+mappings to physical memory directly. Instead, applications configure memory
+mappings from virtual addresses to offsets into a file using the `mmap` system
+call.[^mmap-anon] For example, a call to:
+
+ mmap(
+ /* addr = */ 0x400000,
+ /* length = */ 0x1000,
+ PROT_READ | PROT_WRITE,
+ MAP_SHARED,
+ /* fd = */ 3,
+ /* offset = */ 0);
+
+creates a mapping of length 0x1000 bytes, starting at virtual address (VA)
+0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within
+the Linux kernel, virtual memory mappings are represented by *virtual memory
+areas* (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the
+virtual memory subsystem after the `mmap` call may be depicted as:
+
+ VMA: VA:0x400000 -> /tmp/foo:0x0
+
+Establishing a virtual memory area does not necessarily establish a mapping to a
+physical address, because Linux has not necessarily provisioned physical memory
+to store the file's contents. Thus, if the application attempts to read the
+contents of VA 0x400000, it may incur a *page fault*, a CPU exception that
+forces the kernel to create such a mapping to service the read.
+
+For a file, doing so consists of several logical phases:
+
+1. The kernel allocates physical memory to store the contents of the required
+ part of the file, and copies file contents to the allocated memory.
+ Supposing that the kernel chooses the physical memory at physical address
+ (PA) 0x2fb000, the resulting state of the system is:
+
+ VMA: VA:0x400000 -> /tmp/foo:0x0
+ Filemap: /tmp/foo:0x0 -> PA:0x2fb000
+
+ (In Linux the state of the mapping from file offset to physical memory is
+ stored in `struct address_space`, but to avoid confusion with other notions
+ of address space we will refer to this system as filemap, named after Linux
+ kernel source file `mm/filemap.c`.)
+
+2. The kernel stores the effective mapping from virtual to physical address in
+ a *page table entry* (PTE) in the application's *page tables*, which are
+ used by the CPU's virtual memory hardware to perform address translation.
+ The resulting state of the system is:
+
+ VMA: VA:0x400000 -> /tmp/foo:0x0
+ Filemap: /tmp/foo:0x0 -> PA:0x2fb000
+ PTE: VA:0x400000 -----------------> PA:0x2fb000
+
+ The PTE is required for the application to actually use the contents of the
+ mapped file as virtual memory. However, the PTE is derived from the VMA and
+ filemap state, both of which are independently mutable, such that mutations
+ to either will affect the PTE. For example:
+
+ - The application may remove the VMA using the `munmap` system call. This
+ breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently
+ the mapping from VA:0x400000 to PA:0x2fb000. However, it does not
+ necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a
+ future mapping of the same file offset may reuse this physical memory.
+
+ - The application may invalidate the file's contents by passing a length
+ of 0 to the `ftruncate` system call. This breaks the mapping from
+ /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from
+ VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from
+ VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents
+ may again be made visible at VA:0x400000 after another page fault
+ results in the allocation of a new physical address.
+
+ Note that, in order to correctly break the mapping from VA:0x400000 to
+ PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping*
+ from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.
+
+[^mmap-anon]: Memory mappings to non-files are discussed in later sections.
+
+## Private Mappings
+
+The preceding example considered VMAs created using the `MAP_SHARED` flag, which
+means that PTEs derived from the mapping should always use physical memory that
+represents the current state of the mapped file.[^mmap-dev-zero] Applications
+can alternatively pass the `MAP_PRIVATE` flag to create a *private mapping*.
+Private mappings are *copy-on-write*.
+
+Suppose that the application instead created a private mapping in the previous
+example. In Linux, the state of the system after a read page fault would be:
+
+ VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
+ Filemap: /tmp/foo:0x0 -> PA:0x2fb000
+ PTE: VA:0x400000 -----------------> PA:0x2fb000 (read-only)
+
+Now suppose the application attempts to write to VA:0x400000. For a shared
+mapping, the write would be propagated to PA:0x2fb000, and the kernel would be
+responsible for ensuring that the write is later propagated to the mapped file.
+For a private mapping, the write incurs another page fault since the PTE is
+marked read-only. In response, the kernel allocates physical memory to store the
+mapping's *private copy* of the file's contents, copies file contents to the
+allocated memory, and changes the PTE to map to the private copy. Supposing that
+the kernel chooses the physical memory at physical address (PA) 0x5ea000, the
+resulting state of the system is:
+
+ VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
+ Filemap: /tmp/foo:0x0 -> PA:0x2fb000
+ PTE: VA:0x400000 -----------------> PA:0x5ea000
+
+Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist,
+but is now irrelevant to this mapping.
+
+[^mmap-dev-zero]: Modulo files with special mmap semantics such as `/dev/zero`.
+
+## Anonymous Mappings
+
+Instead of passing a file to the `mmap` system call, applications can instead
+request an *anonymous* mapping by passing the `MAP_ANONYMOUS` flag.
+Semantically, an anonymous mapping is essentially a mapping to an ephemeral file
+initially filled with zero bytes. Practically speaking, this is how shared
+anonymous mappings are implemented, but private anonymous mappings do not result
+in the creation of an ephemeral file; since there would be no way to modify the
+contents of the underlying file through a private mapping, all private anonymous
+mappings use a single shared page filled with zero bytes until copy-on-write
+occurs.
+
+# Virtual Memory in the Sentry
+
+The sentry implements application virtual memory atop a host kernel, introducing
+an additional level of indirection to the above.
+
+Consider the same scenario as in the previous section. Since the sentry handles
+application system calls, the effect of an application `mmap` system call is to
+create a VMA in the sentry (as opposed to the host kernel):
+
+ Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
+
+When the application first incurs a page fault on this address, the host kernel
+delivers information about the page fault to the sentry in a platform-dependent
+manner, and the sentry handles the fault:
+
+1. The sentry allocates memory to store the contents of the required part of
+ the file, and copies file contents to the allocated memory. However, since
+ the sentry is implemented atop a host kernel, it does not configure mappings
+ to physical memory directly. Instead, mappable "memory" in the sentry is
+ represented by a host file descriptor and offset, since (as noted in
+ "Background") this is the memory mapping primitive provided by the host
+ kernel. In general, memory is allocated from a temporary host file using the
+ `pgalloc` package. Supposing that the sentry allocates offset 0x3000 from
+ host file "memory-file", the resulting state is:
+
+ Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
+ Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
+
+2. The sentry stores the effective mapping from virtual address to host file in
+ a host VMA by invoking the `mmap` system call:
+
+ Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
+ Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
+ Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
+
+3. The sentry returns control to the application, which immediately incurs the
+ page fault again.[^mmap-populate] However, since a host VMA now exists for
+ the faulting virtual address, the host kernel now handles the page fault as
+ described in "Background":
+
+ Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
+ Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
+ Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
+ Host filemap: host:memory-file:0x3000 -> PA:0x2fb000
+ Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000
+
+Thus, from an implementation standpoint, host VMAs serve the same purpose in the
+sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is
+independently mutable, and the desired state of host VMAs is derived from that
+state.
+
+[^mmap-populate]: The sentry could force the host kernel to establish PTEs when
+ it creates the host VMA by passing the `MAP_POPULATE` flag to
+ the `mmap` system call, but usually does not. This is because,
+ to reduce the number of page faults that require handling by
+ the sentry and (correspondingly) the number of host `mmap`
+ system calls, the sentry usually creates host VMAs that are
+ much larger than the single faulting page.
+
+## Private Mappings
+
+The sentry implements private mappings consistently with Linux. Before
+copy-on-write, the private mapping example given in the Background results in:
+
+ Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
+ Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
+ Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 (read-only)
+ Host filemap: host:memory-file:0x3000 -> PA:0x2fb000
+ Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only)
+
+When the application attempts to write to this address, the host kernel delivers
+information about the resulting page fault to the sentry. Analogous to Linux,
+the sentry allocates memory to store the mapping's private copy of the file's
+contents, copies file contents to the allocated memory, and changes the host VMA
+to map to the private copy. Supposing that the sentry chooses the offset 0x4000
+in host file `memory-file` to store the private copy, the state of the system
+after copy-on-write is:
+
+ Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
+ Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
+ Host VMA: VA:0x400000 -----------------> host:memory-file:0x4000
+ Host filemap: host:memory-file:0x4000 -> PA:0x5ea000
+ Host PTE: VA:0x400000 --------------------------------------------> PA:0x5ea000
+
+However, this highlights an important difference between Linux and the sentry.
+In Linux, page tables are concrete (architecture-dependent) data structures
+owned by the kernel. Conversely, the sentry has the ability to create and
+destroy host VMAs using host system calls, but it does not have direct access to
+their state. Thus, as written, if the application invokes the `munmap` system
+call to remove the sentry VMA, it is non-trivial for the sentry to determine
+that it should deallocate `host:memory-file:0x4000`. This implies that the
+sentry must retain information about the host VMAs that it has created.
+
+## Anonymous Mappings
+
+The sentry implements anonymous mappings consistently with Linux, except that
+there is no shared zero page.
+
+# Implementation Constructs
+
+In Linux:
+
+- A virtual address space is represented by `struct mm_struct`.
+
+- VMAs are represented by `struct vm_area_struct`, stored in `struct
+ mm_struct::mmap`.
+
+- Mappings from file offsets to physical memory are stored in `struct
+ address_space`.
+
+- Reverse mappings from file offsets to virtual mappings are stored in `struct
+ address_space::i_mmap`.
+
+- Physical memory pages are represented by a pointer to `struct page` or an
+ index called a *page frame number* (PFN), represented by `pfn_t`.
+
+- PTEs are represented by architecture-dependent type `pte_t`, stored in a
+ table hierarchy rooted at `struct mm_struct::pgd`.
+
+In the sentry:
+
+- A virtual address space is represented by type [`mm.MemoryManager`][mm].
+
+- Sentry VMAs are represented by type [`mm.vma`][mm], stored in
+ `mm.MemoryManager.vmas`.
+
+- Mappings from sentry file offsets to host file offsets are abstracted
+ through interface method [`memmap.Mappable.Translate`][memmap].
+
+- Reverse mappings from sentry file offsets to virtual mappings are abstracted
+ through interface methods
+ [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap].
+
+- Host files that may be mapped into host VMAs are represented by type
+ [`platform.File`][platform].
+
+- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform
+ mapping area"), stored in `mm.MemoryManager.pmas`.
+
+- Creation and destruction of host VMAs is abstracted through interface
+ methods
+ [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform].
+
+[memmap]: https://github.com/google/gvisor/blob/master/pkg/sentry/memmap/memmap.go
+[mm]: https://github.com/google/gvisor/blob/master/pkg/sentry/mm/mm.go
+[pgalloc]: https://github.com/google/gvisor/blob/master/pkg/sentry/pgalloc/pgalloc.go
+[platform]: https://github.com/google/gvisor/blob/master/pkg/sentry/platform/platform.go