This package provides an emulation of Linux semantics for application virtual
memory mappings.

For completeness, this document also describes aspects of the memory management
subsystem defined outside this package.

# Background

We begin by describing semantics for virtual memory in Linux.

A virtual address space is defined as a collection of mappings from virtual
addresses to physical memory. However, userspace applications do not configure
mappings to physical memory directly. Instead, applications configure memory
mappings from virtual addresses to offsets into a file using the `mmap` system
call.[^mmap-anon] For example, a call to:

    mmap(
        /* addr = */ 0x400000,
        /* length = */ 0x1000,
        PROT_READ | PROT_WRITE,
        MAP_SHARED,
        /* fd = */ 3,
        /* offset = */ 0);

creates a mapping of length 0x1000 bytes, starting at virtual address (VA)
0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within
the Linux kernel, virtual memory mappings are represented by *virtual memory
areas* (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the
virtual memory subsystem after the `mmap` call may be depicted as:

    VMA:     VA:0x400000 -> /tmp/foo:0x0

Establishing a virtual memory area does not necessarily establish a mapping to a
physical address, because Linux has not necessarily provisioned physical memory
to store the file's contents. Thus, if the application attempts to read the
contents of VA 0x400000, it may incur a *page fault*, a CPU exception that
forces the kernel to create such a mapping to service the read.

For a file, doing so consists of several logical phases:

1.  The kernel allocates physical memory to store the contents of the required
    part of the file, and copies file contents to the allocated memory.
    Supposing that the kernel chooses the physical memory at physical address
    (PA) 0x2fb000, the resulting state of the system is:

        VMA:     VA:0x400000 -> /tmp/foo:0x0
        Filemap:                /tmp/foo:0x0 -> PA:0x2fb000

    (In Linux the state of the mapping from file offset to physical memory is
    stored in `struct address_space`, but to avoid confusion with other notions
    of address space we will refer to this system as filemap, named after Linux
    kernel source file `mm/filemap.c`.)

2.  The kernel stores the effective mapping from virtual to physical address in
    a *page table entry* (PTE) in the application's *page tables*, which are
    used by the CPU's virtual memory hardware to perform address translation.
    The resulting state of the system is:

        VMA:     VA:0x400000 -> /tmp/foo:0x0
        Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
        PTE:     VA:0x400000 -----------------> PA:0x2fb000

    The PTE is required for the application to actually use the contents of the
    mapped file as virtual memory. However, the PTE is derived from the VMA and
    filemap state, both of which are independently mutable, such that mutations
    to either will affect the PTE. For example:

    -   The application may remove the VMA using the `munmap` system call. This
        breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently
        the mapping from VA:0x400000 to PA:0x2fb000. However, it does not
        necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a
        future mapping of the same file offset may reuse this physical memory.

    -   The application may invalidate the file's contents by passing a length
        of 0 to the `ftruncate` system call. This breaks the mapping from
        /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from
        VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from
        VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents
        may again be made visible at VA:0x400000 after another page fault
        results in the allocation of a new physical address.

    Note that, in order to correctly break the mapping from VA:0x400000 to
    PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping*
    from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.

[^mmap-anon]: Memory mappings to non-files are discussed in later sections.

## Private Mappings

The preceding example considered VMAs created using the `MAP_SHARED` flag, which
means that PTEs derived from the mapping should always use physical memory that
represents the current state of the mapped file.[^mmap-dev-zero] Applications
can alternatively pass the `MAP_PRIVATE` flag to create a *private mapping*.
Private mappings are *copy-on-write*.

Suppose that the application instead created a private mapping in the previous
example. In Linux, the state of the system after a read page fault would be:

    VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
    Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
    PTE:     VA:0x400000 -----------------> PA:0x2fb000 (read-only)

Now suppose the application attempts to write to VA:0x400000. For a shared
mapping, the write would be propagated to PA:0x2fb000, and the kernel would be
responsible for ensuring that the write is later propagated to the mapped file.
For a private mapping, the write incurs another page fault since the PTE is
marked read-only. In response, the kernel allocates physical memory to store the
mapping's *private copy* of the file's contents, copies file contents to the
allocated memory, and changes the PTE to map to the private copy. Supposing that
the kernel chooses the physical memory at physical address (PA) 0x5ea000, the
resulting state of the system is:

    VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
    Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
    PTE:     VA:0x400000 -----------------> PA:0x5ea000

Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist,
but is now irrelevant to this mapping.

[^mmap-dev-zero]: Modulo files with special mmap semantics such as `/dev/zero`.

## Anonymous Mappings

Instead of passing a file to the `mmap` system call, applications can instead
request an *anonymous* mapping by passing the `MAP_ANONYMOUS` flag.
Semantically, an anonymous mapping is essentially a mapping to an ephemeral file
initially filled with zero bytes. Practically speaking, this is how shared
anonymous mappings are implemented, but private anonymous mappings do not result
in the creation of an ephemeral file; since there would be no way to modify the
contents of the underlying file through a private mapping, all private anonymous
mappings use a single shared page filled with zero bytes until copy-on-write
occurs.

# Virtual Memory in the Sentry

The sentry implements application virtual memory atop a host kernel, introducing
an additional level of indirection to the above.

Consider the same scenario as in the previous section. Since the sentry handles
application system calls, the effect of an application `mmap` system call is to
create a VMA in the sentry (as opposed to the host kernel):

    Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0

When the application first incurs a page fault on this address, the host kernel
delivers information about the page fault to the sentry in a platform-dependent
manner, and the sentry handles the fault:

1.  The sentry allocates memory to store the contents of the required part of
    the file, and copies file contents to the allocated memory. However, since
    the sentry is implemented atop a host kernel, it does not configure mappings
    to physical memory directly. Instead, mappable "memory" in the sentry is
    represented by a host file descriptor and offset, since (as noted in
    "Background") this is the memory mapping primitive provided by the host
    kernel. In general, memory is allocated from a temporary host file using the
    `pgalloc` package. Supposing that the sentry allocates offset 0x3000 from
    host file "memory-file", the resulting state is:

        Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
        Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000

2.  The sentry stores the effective mapping from virtual address to host file in
    a host VMA by invoking the `mmap` system call:

        Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
        Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
          Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000

3.  The sentry returns control to the application, which immediately incurs the
    page fault again.[^mmap-populate] However, since a host VMA now exists for
    the faulting virtual address, the host kernel now handles the page fault as
    described in "Background":

        Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
        Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
          Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000
          Host filemap:                                host:memory-file:0x3000 -> PA:0x2fb000
          Host PTE:     VA:0x400000 --------------------------------------------> PA:0x2fb000

Thus, from an implementation standpoint, host VMAs serve the same purpose in the
sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is
independently mutable, and the desired state of host VMAs is derived from that
state.

[^mmap-populate]: The sentry could force the host kernel to establish PTEs when
    it creates the host VMA by passing the `MAP_POPULATE` flag to
    the `mmap` system call, but usually does not. This is because,
    to reduce the number of page faults that require handling by
    the sentry and (correspondingly) the number of host `mmap`
    system calls, the sentry usually creates host VMAs that are
    much larger than the single faulting page.

## Private Mappings

The sentry implements private mappings consistently with Linux. Before
copy-on-write, the private mapping example given in the Background results in:

    Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
    Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
      Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000 (read-only)
      Host filemap:                                host:memory-file:0x3000 -> PA:0x2fb000
      Host PTE:     VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only)

When the application attempts to write to this address, the host kernel delivers
information about the resulting page fault to the sentry. Analogous to Linux,
the sentry allocates memory to store the mapping's private copy of the file's
contents, copies file contents to the allocated memory, and changes the host VMA
to map to the private copy. Supposing that the sentry chooses the offset 0x4000
in host file `memory-file` to store the private copy, the state of the system
after copy-on-write is:

    Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
    Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
      Host VMA:     VA:0x400000 -----------------> host:memory-file:0x4000
      Host filemap:                                host:memory-file:0x4000 -> PA:0x5ea000
      Host PTE:     VA:0x400000 --------------------------------------------> PA:0x5ea000

However, this highlights an important difference between Linux and the sentry.
In Linux, page tables are concrete (architecture-dependent) data structures
owned by the kernel. Conversely, the sentry has the ability to create and
destroy host VMAs using host system calls, but it does not have direct access to
their state. Thus, as written, if the application invokes the `munmap` system
call to remove the sentry VMA, it is non-trivial for the sentry to determine
that it should deallocate `host:memory-file:0x4000`. This implies that the
sentry must retain information about the host VMAs that it has created.

## Anonymous Mappings

The sentry implements anonymous mappings consistently with Linux, except that
there is no shared zero page.

# Implementation Constructs

In Linux:

-   A virtual address space is represented by `struct mm_struct`.

-   VMAs are represented by `struct vm_area_struct`, stored in `struct
    mm_struct::mmap`.

-   Mappings from file offsets to physical memory are stored in `struct
    address_space`.

-   Reverse mappings from file offsets to virtual mappings are stored in `struct
    address_space::i_mmap`.

-   Physical memory pages are represented by a pointer to `struct page` or an
    index called a *page frame number* (PFN), represented by `pfn_t`.

-   PTEs are represented by architecture-dependent type `pte_t`, stored in a
    table hierarchy rooted at `struct mm_struct::pgd`.

In the sentry:

-   A virtual address space is represented by type [`mm.MemoryManager`][mm].

-   Sentry VMAs are represented by type [`mm.vma`][mm], stored in
    `mm.MemoryManager.vmas`.

-   Mappings from sentry file offsets to host file offsets are abstracted
    through interface method [`memmap.Mappable.Translate`][memmap].

-   Reverse mappings from sentry file offsets to virtual mappings are abstracted
    through interface methods
    [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap].

-   Host files that may be mapped into host VMAs are represented by type
    [`platform.File`][platform].

-   Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform
    mapping area"), stored in `mm.MemoryManager.pmas`.

-   Creation and destruction of host VMAs is abstracted through interface
    methods
    [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform].

[memmap]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/memmap/memmap.go
[mm]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/mm/mm.go
[pgalloc]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/pgalloc/pgalloc.go
[platform]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/platform/platform.go