summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/mm/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'pkg/sentry/mm/README.md')
-rw-r--r--pkg/sentry/mm/README.md280
1 files changed, 0 insertions, 280 deletions
diff --git a/pkg/sentry/mm/README.md b/pkg/sentry/mm/README.md
deleted file mode 100644
index e6efbf565..000000000
--- a/pkg/sentry/mm/README.md
+++ /dev/null
@@ -1,280 +0,0 @@
-This package provides an emulation of Linux semantics for application virtual
-memory mappings.
-
-For completeness, this document also describes aspects of the memory management
-subsystem defined outside this package.
-
-# Background
-
-We begin by describing semantics for virtual memory in Linux.
-
-A virtual address space is defined as a collection of mappings from virtual
-addresses to physical memory. However, userspace applications do not configure
-mappings to physical memory directly. Instead, applications configure memory
-mappings from virtual addresses to offsets into a file using the `mmap` system
-call.[^mmap-anon] For example, a call to:
-
- mmap(
- /* addr = */ 0x400000,
- /* length = */ 0x1000,
- PROT_READ | PROT_WRITE,
- MAP_SHARED,
- /* fd = */ 3,
- /* offset = */ 0);
-
-creates a mapping of length 0x1000 bytes, starting at virtual address (VA)
-0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within
-the Linux kernel, virtual memory mappings are represented by *virtual memory
-areas* (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the
-virtual memory subsystem after the `mmap` call may be depicted as:
-
- VMA: VA:0x400000 -> /tmp/foo:0x0
-
-Establishing a virtual memory area does not necessarily establish a mapping to a
-physical address, because Linux has not necessarily provisioned physical memory
-to store the file's contents. Thus, if the application attempts to read the
-contents of VA 0x400000, it may incur a *page fault*, a CPU exception that
-forces the kernel to create such a mapping to service the read.
-
-For a file, doing so consists of several logical phases:
-
-1. The kernel allocates physical memory to store the contents of the required
- part of the file, and copies file contents to the allocated memory.
- Supposing that the kernel chooses the physical memory at physical address
- (PA) 0x2fb000, the resulting state of the system is:
-
- VMA: VA:0x400000 -> /tmp/foo:0x0
- Filemap: /tmp/foo:0x0 -> PA:0x2fb000
-
- (In Linux the state of the mapping from file offset to physical memory is
- stored in `struct address_space`, but to avoid confusion with other notions
- of address space we will refer to this system as filemap, named after Linux
- kernel source file `mm/filemap.c`.)
-
-2. The kernel stores the effective mapping from virtual to physical address in
- a *page table entry* (PTE) in the application's *page tables*, which are
- used by the CPU's virtual memory hardware to perform address translation.
- The resulting state of the system is:
-
- VMA: VA:0x400000 -> /tmp/foo:0x0
- Filemap: /tmp/foo:0x0 -> PA:0x2fb000
- PTE: VA:0x400000 -----------------> PA:0x2fb000
-
- The PTE is required for the application to actually use the contents of the
- mapped file as virtual memory. However, the PTE is derived from the VMA and
- filemap state, both of which are independently mutable, such that mutations
- to either will affect the PTE. For example:
-
- - The application may remove the VMA using the `munmap` system call. This
- breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently
- the mapping from VA:0x400000 to PA:0x2fb000. However, it does not
- necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a
- future mapping of the same file offset may reuse this physical memory.
-
- - The application may invalidate the file's contents by passing a length
- of 0 to the `ftruncate` system call. This breaks the mapping from
- /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from
- VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from
- VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents
- may again be made visible at VA:0x400000 after another page fault
- results in the allocation of a new physical address.
-
- Note that, in order to correctly break the mapping from VA:0x400000 to
- PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping*
- from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.
-
-[^mmap-anon]: Memory mappings to non-files are discussed in later sections.
-
-## Private Mappings
-
-The preceding example considered VMAs created using the `MAP_SHARED` flag, which
-means that PTEs derived from the mapping should always use physical memory that
-represents the current state of the mapped file.[^mmap-dev-zero] Applications
-can alternatively pass the `MAP_PRIVATE` flag to create a *private mapping*.
-Private mappings are *copy-on-write*.
-
-Suppose that the application instead created a private mapping in the previous
-example. In Linux, the state of the system after a read page fault would be:
-
- VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
- Filemap: /tmp/foo:0x0 -> PA:0x2fb000
- PTE: VA:0x400000 -----------------> PA:0x2fb000 (read-only)
-
-Now suppose the application attempts to write to VA:0x400000. For a shared
-mapping, the write would be propagated to PA:0x2fb000, and the kernel would be
-responsible for ensuring that the write is later propagated to the mapped file.
-For a private mapping, the write incurs another page fault since the PTE is
-marked read-only. In response, the kernel allocates physical memory to store the
-mapping's *private copy* of the file's contents, copies file contents to the
-allocated memory, and changes the PTE to map to the private copy. Supposing that
-the kernel chooses the physical memory at physical address (PA) 0x5ea000, the
-resulting state of the system is:
-
- VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
- Filemap: /tmp/foo:0x0 -> PA:0x2fb000
- PTE: VA:0x400000 -----------------> PA:0x5ea000
-
-Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist,
-but is now irrelevant to this mapping.
-
-[^mmap-dev-zero]: Modulo files with special mmap semantics such as `/dev/zero`.
-
-## Anonymous Mappings
-
-Instead of passing a file to the `mmap` system call, applications can instead
-request an *anonymous* mapping by passing the `MAP_ANONYMOUS` flag.
-Semantically, an anonymous mapping is essentially a mapping to an ephemeral file
-initially filled with zero bytes. Practically speaking, this is how shared
-anonymous mappings are implemented, but private anonymous mappings do not result
-in the creation of an ephemeral file; since there would be no way to modify the
-contents of the underlying file through a private mapping, all private anonymous
-mappings use a single shared page filled with zero bytes until copy-on-write
-occurs.
-
-# Virtual Memory in the Sentry
-
-The sentry implements application virtual memory atop a host kernel, introducing
-an additional level of indirection to the above.
-
-Consider the same scenario as in the previous section. Since the sentry handles
-application system calls, the effect of an application `mmap` system call is to
-create a VMA in the sentry (as opposed to the host kernel):
-
- Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
-
-When the application first incurs a page fault on this address, the host kernel
-delivers information about the page fault to the sentry in a platform-dependent
-manner, and the sentry handles the fault:
-
-1. The sentry allocates memory to store the contents of the required part of
- the file, and copies file contents to the allocated memory. However, since
- the sentry is implemented atop a host kernel, it does not configure mappings
- to physical memory directly. Instead, mappable "memory" in the sentry is
- represented by a host file descriptor and offset, since (as noted in
- "Background") this is the memory mapping primitive provided by the host
- kernel. In general, memory is allocated from a temporary host file using the
- `pgalloc` package. Supposing that the sentry allocates offset 0x3000 from
- host file "memory-file", the resulting state is:
-
- Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
- Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
-
-2. The sentry stores the effective mapping from virtual address to host file in
- a host VMA by invoking the `mmap` system call:
-
- Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
- Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
- Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
-
-3. The sentry returns control to the application, which immediately incurs the
- page fault again.[^mmap-populate] However, since a host VMA now exists for
- the faulting virtual address, the host kernel now handles the page fault as
- described in "Background":
-
- Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
- Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
- Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
- Host filemap: host:memory-file:0x3000 -> PA:0x2fb000
- Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000
-
-Thus, from an implementation standpoint, host VMAs serve the same purpose in the
-sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is
-independently mutable, and the desired state of host VMAs is derived from that
-state.
-
-[^mmap-populate]: The sentry could force the host kernel to establish PTEs when
- it creates the host VMA by passing the `MAP_POPULATE` flag to
- the `mmap` system call, but usually does not. This is because,
- to reduce the number of page faults that require handling by
- the sentry and (correspondingly) the number of host `mmap`
- system calls, the sentry usually creates host VMAs that are
- much larger than the single faulting page.
-
-## Private Mappings
-
-The sentry implements private mappings consistently with Linux. Before
-copy-on-write, the private mapping example given in the Background results in:
-
- Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
- Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
- Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 (read-only)
- Host filemap: host:memory-file:0x3000 -> PA:0x2fb000
- Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only)
-
-When the application attempts to write to this address, the host kernel delivers
-information about the resulting page fault to the sentry. Analogous to Linux,
-the sentry allocates memory to store the mapping's private copy of the file's
-contents, copies file contents to the allocated memory, and changes the host VMA
-to map to the private copy. Supposing that the sentry chooses the offset 0x4000
-in host file `memory-file` to store the private copy, the state of the system
-after copy-on-write is:
-
- Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
- Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
- Host VMA: VA:0x400000 -----------------> host:memory-file:0x4000
- Host filemap: host:memory-file:0x4000 -> PA:0x5ea000
- Host PTE: VA:0x400000 --------------------------------------------> PA:0x5ea000
-
-However, this highlights an important difference between Linux and the sentry.
-In Linux, page tables are concrete (architecture-dependent) data structures
-owned by the kernel. Conversely, the sentry has the ability to create and
-destroy host VMAs using host system calls, but it does not have direct access to
-their state. Thus, as written, if the application invokes the `munmap` system
-call to remove the sentry VMA, it is non-trivial for the sentry to determine
-that it should deallocate `host:memory-file:0x4000`. This implies that the
-sentry must retain information about the host VMAs that it has created.
-
-## Anonymous Mappings
-
-The sentry implements anonymous mappings consistently with Linux, except that
-there is no shared zero page.
-
-# Implementation Constructs
-
-In Linux:
-
-- A virtual address space is represented by `struct mm_struct`.
-
-- VMAs are represented by `struct vm_area_struct`, stored in `struct
- mm_struct::mmap`.
-
-- Mappings from file offsets to physical memory are stored in `struct
- address_space`.
-
-- Reverse mappings from file offsets to virtual mappings are stored in `struct
- address_space::i_mmap`.
-
-- Physical memory pages are represented by a pointer to `struct page` or an
- index called a *page frame number* (PFN), represented by `pfn_t`.
-
-- PTEs are represented by architecture-dependent type `pte_t`, stored in a
- table hierarchy rooted at `struct mm_struct::pgd`.
-
-In the sentry:
-
-- A virtual address space is represented by type [`mm.MemoryManager`][mm].
-
-- Sentry VMAs are represented by type [`mm.vma`][mm], stored in
- `mm.MemoryManager.vmas`.
-
-- Mappings from sentry file offsets to host file offsets are abstracted
- through interface method [`memmap.Mappable.Translate`][memmap].
-
-- Reverse mappings from sentry file offsets to virtual mappings are abstracted
- through interface methods
- [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap].
-
-- Host files that may be mapped into host VMAs are represented by type
- [`platform.File`][platform].
-
-- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform
- mapping area"), stored in `mm.MemoryManager.pmas`.
-
-- Creation and destruction of host VMAs is abstracted through interface
- methods
- [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform].
-
-[memmap]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/memmap/memmap.go
-[mm]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/mm/mm.go
-[pgalloc]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/pgalloc/pgalloc.go
-[platform]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/platform/platform.go