diff options
Diffstat (limited to 'pkg/sentry/mm/README.md')
-rw-r--r-- | pkg/sentry/mm/README.md | 280 |
1 files changed, 280 insertions, 0 deletions
diff --git a/pkg/sentry/mm/README.md b/pkg/sentry/mm/README.md new file mode 100644 index 000000000..f4d43d927 --- /dev/null +++ b/pkg/sentry/mm/README.md @@ -0,0 +1,280 @@ +This package provides an emulation of Linux semantics for application virtual +memory mappings. + +For completeness, this document also describes aspects of the memory management +subsystem defined outside this package. + +# Background + +We begin by describing semantics for virtual memory in Linux. + +A virtual address space is defined as a collection of mappings from virtual +addresses to physical memory. However, userspace applications do not configure +mappings to physical memory directly. Instead, applications configure memory +mappings from virtual addresses to offsets into a file using the `mmap` system +call.[^mmap-anon] For example, a call to: + + mmap( + /* addr = */ 0x400000, + /* length = */ 0x1000, + PROT_READ | PROT_WRITE, + MAP_SHARED, + /* fd = */ 3, + /* offset = */ 0); + +creates a mapping of length 0x1000 bytes, starting at virtual address (VA) +0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within +the Linux kernel, virtual memory mappings are represented by *virtual memory +areas* (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the +virtual memory subsystem after the `mmap` call may be depicted as: + + VMA: VA:0x400000 -> /tmp/foo:0x0 + +Establishing a virtual memory area does not necessarily establish a mapping to a +physical address, because Linux has not necessarily provisioned physical memory +to store the file's contents. Thus, if the application attempts to read the +contents of VA 0x400000, it may incur a *page fault*, a CPU exception that +forces the kernel to create such a mapping to service the read. + +For a file, doing so consists of several logical phases: + +1. The kernel allocates physical memory to store the contents of the required + part of the file, and copies file contents to the allocated memory. + Supposing that the kernel chooses the physical memory at physical address + (PA) 0x2fb000, the resulting state of the system is: + + VMA: VA:0x400000 -> /tmp/foo:0x0 + Filemap: /tmp/foo:0x0 -> PA:0x2fb000 + + (In Linux the state of the mapping from file offset to physical memory is + stored in `struct address_space`, but to avoid confusion with other notions + of address space we will refer to this system as filemap, named after Linux + kernel source file `mm/filemap.c`.) + +2. The kernel stores the effective mapping from virtual to physical address in + a *page table entry* (PTE) in the application's *page tables*, which are + used by the CPU's virtual memory hardware to perform address translation. + The resulting state of the system is: + + VMA: VA:0x400000 -> /tmp/foo:0x0 + Filemap: /tmp/foo:0x0 -> PA:0x2fb000 + PTE: VA:0x400000 -----------------> PA:0x2fb000 + + The PTE is required for the application to actually use the contents of the + mapped file as virtual memory. However, the PTE is derived from the VMA and + filemap state, both of which are independently mutable, such that mutations + to either will affect the PTE. For example: + + - The application may remove the VMA using the `munmap` system call. This + breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently + the mapping from VA:0x400000 to PA:0x2fb000. However, it does not + necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a + future mapping of the same file offset may reuse this physical memory. + + - The application may invalidate the file's contents by passing a length + of 0 to the `ftruncate` system call. This breaks the mapping from + /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from + VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from + VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents + may again be made visible at VA:0x400000 after another page fault + results in the allocation of a new physical address. + + Note that, in order to correctly break the mapping from VA:0x400000 to + PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping* + from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE. + +[^mmap-anon]: Memory mappings to non-files are discussed in later sections. + +## Private Mappings + +The preceding example considered VMAs created using the `MAP_SHARED` flag, which +means that PTEs derived from the mapping should always use physical memory that +represents the current state of the mapped file.[^mmap-dev-zero] Applications +can alternatively pass the `MAP_PRIVATE` flag to create a *private mapping*. +Private mappings are *copy-on-write*. + +Suppose that the application instead created a private mapping in the previous +example. In Linux, the state of the system after a read page fault would be: + + VMA: VA:0x400000 -> /tmp/foo:0x0 (private) + Filemap: /tmp/foo:0x0 -> PA:0x2fb000 + PTE: VA:0x400000 -----------------> PA:0x2fb000 (read-only) + +Now suppose the application attempts to write to VA:0x400000. For a shared +mapping, the write would be propagated to PA:0x2fb000, and the kernel would be +responsible for ensuring that the write is later propagated to the mapped file. +For a private mapping, the write incurs another page fault since the PTE is +marked read-only. In response, the kernel allocates physical memory to store the +mapping's *private copy* of the file's contents, copies file contents to the +allocated memory, and changes the PTE to map to the private copy. Supposing that +the kernel chooses the physical memory at physical address (PA) 0x5ea000, the +resulting state of the system is: + + VMA: VA:0x400000 -> /tmp/foo:0x0 (private) + Filemap: /tmp/foo:0x0 -> PA:0x2fb000 + PTE: VA:0x400000 -----------------> PA:0x5ea000 + +Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist, +but is now irrelevant to this mapping. + +[^mmap-dev-zero]: Modulo files with special mmap semantics such as `/dev/zero`. + +## Anonymous Mappings + +Instead of passing a file to the `mmap` system call, applications can instead +request an *anonymous* mapping by passing the `MAP_ANONYMOUS` flag. +Semantically, an anonymous mapping is essentially a mapping to an ephemeral file +initially filled with zero bytes. Practically speaking, this is how shared +anonymous mappings are implemented, but private anonymous mappings do not result +in the creation of an ephemeral file; since there would be no way to modify the +contents of the underlying file through a private mapping, all private anonymous +mappings use a single shared page filled with zero bytes until copy-on-write +occurs. + +# Virtual Memory in the Sentry + +The sentry implements application virtual memory atop a host kernel, introducing +an additional level of indirection to the above. + +Consider the same scenario as in the previous section. Since the sentry handles +application system calls, the effect of an application `mmap` system call is to +create a VMA in the sentry (as opposed to the host kernel): + + Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 + +When the application first incurs a page fault on this address, the host kernel +delivers information about the page fault to the sentry in a platform-dependent +manner, and the sentry handles the fault: + +1. The sentry allocates memory to store the contents of the required part of + the file, and copies file contents to the allocated memory. However, since + the sentry is implemented atop a host kernel, it does not configure mappings + to physical memory directly. Instead, mappable "memory" in the sentry is + represented by a host file descriptor and offset, since (as noted in + "Background") this is the memory mapping primitive provided by the host + kernel. In general, memory is allocated from a temporary host file using the + `pgalloc` package. Supposing that the sentry allocates offset 0x3000 from + host file "memory-file", the resulting state is: + + Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 + Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 + +2. The sentry stores the effective mapping from virtual address to host file in + a host VMA by invoking the `mmap` system call: + + Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 + Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 + Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 + +3. The sentry returns control to the application, which immediately incurs the + page fault again.[^mmap-populate] However, since a host VMA now exists for + the faulting virtual address, the host kernel now handles the page fault as + described in "Background": + + Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 + Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 + Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 + Host filemap: host:memory-file:0x3000 -> PA:0x2fb000 + Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000 + +Thus, from an implementation standpoint, host VMAs serve the same purpose in the +sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is +independently mutable, and the desired state of host VMAs is derived from that +state. + +[^mmap-populate]: The sentry could force the host kernel to establish PTEs when + it creates the host VMA by passing the `MAP_POPULATE` flag to + the `mmap` system call, but usually does not. This is because, + to reduce the number of page faults that require handling by + the sentry and (correspondingly) the number of host `mmap` + system calls, the sentry usually creates host VMAs that are + much larger than the single faulting page. + +## Private Mappings + +The sentry implements private mappings consistently with Linux. Before +copy-on-write, the private mapping example given in the Background results in: + + Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private) + Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 + Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 (read-only) + Host filemap: host:memory-file:0x3000 -> PA:0x2fb000 + Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only) + +When the application attempts to write to this address, the host kernel delivers +information about the resulting page fault to the sentry. Analogous to Linux, +the sentry allocates memory to store the mapping's private copy of the file's +contents, copies file contents to the allocated memory, and changes the host VMA +to map to the private copy. Supposing that the sentry chooses the offset 0x4000 +in host file `memory-file` to store the private copy, the state of the system +after copy-on-write is: + + Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private) + Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 + Host VMA: VA:0x400000 -----------------> host:memory-file:0x4000 + Host filemap: host:memory-file:0x4000 -> PA:0x5ea000 + Host PTE: VA:0x400000 --------------------------------------------> PA:0x5ea000 + +However, this highlights an important difference between Linux and the sentry. +In Linux, page tables are concrete (architecture-dependent) data structures +owned by the kernel. Conversely, the sentry has the ability to create and +destroy host VMAs using host system calls, but it does not have direct access to +their state. Thus, as written, if the application invokes the `munmap` system +call to remove the sentry VMA, it is non-trivial for the sentry to determine +that it should deallocate `host:memory-file:0x4000`. This implies that the +sentry must retain information about the host VMAs that it has created. + +## Anonymous Mappings + +The sentry implements anonymous mappings consistently with Linux, except that +there is no shared zero page. + +# Implementation Constructs + +In Linux: + +- A virtual address space is represented by `struct mm_struct`. + +- VMAs are represented by `struct vm_area_struct`, stored in `struct + mm_struct::mmap`. + +- Mappings from file offsets to physical memory are stored in `struct + address_space`. + +- Reverse mappings from file offsets to virtual mappings are stored in `struct + address_space::i_mmap`. + +- Physical memory pages are represented by a pointer to `struct page` or an + index called a *page frame number* (PFN), represented by `pfn_t`. + +- PTEs are represented by architecture-dependent type `pte_t`, stored in a + table hierarchy rooted at `struct mm_struct::pgd`. + +In the sentry: + +- A virtual address space is represented by type [`mm.MemoryManager`][mm]. + +- Sentry VMAs are represented by type [`mm.vma`][mm], stored in + `mm.MemoryManager.vmas`. + +- Mappings from sentry file offsets to host file offsets are abstracted + through interface method [`memmap.Mappable.Translate`][memmap]. + +- Reverse mappings from sentry file offsets to virtual mappings are abstracted + through interface methods + [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap]. + +- Host files that may be mapped into host VMAs are represented by type + [`platform.File`][platform]. + +- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform + mapping area"), stored in `mm.MemoryManager.pmas`. + +- Creation and destruction of host VMAs is abstracted through interface + methods + [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform]. + +[memmap]: https://github.com/google/gvisor/blob/master/pkg/sentry/memmap/memmap.go +[mm]: https://github.com/google/gvisor/blob/master/pkg/sentry/mm/mm.go +[pgalloc]: https://github.com/google/gvisor/blob/master/pkg/sentry/pgalloc/pgalloc.go +[platform]: https://github.com/google/gvisor/blob/master/pkg/sentry/platform/platform.go |