summaryrefslogtreecommitdiffhomepage
path: root/pkg
diff options
context:
space:
mode:
Diffstat (limited to 'pkg')
-rw-r--r--pkg/sentry/fs/README.md8
-rw-r--r--pkg/sentry/fs/proc/README.md5
-rw-r--r--pkg/sentry/kernel/README.md80
-rw-r--r--pkg/sentry/mm/README.md161
-rw-r--r--pkg/sentry/usermem/README.md42
5 files changed, 151 insertions, 145 deletions
diff --git a/pkg/sentry/fs/README.md b/pkg/sentry/fs/README.md
index 898271ee8..76638cdae 100644
--- a/pkg/sentry/fs/README.md
+++ b/pkg/sentry/fs/README.md
@@ -149,10 +149,10 @@ An `fs.File` references the following filesystem objects:
fs.File -> fs.Dirent -> fs.Inode -> fs.MountedFilesystem
```
-The `fs.Inode` is restored using its `fs.MountedFilesystem`. The [Mount
-points](#mount-points) section above describes how this happens in detail. The
-`fs.Dirent` restores its pointer to an `fs.Inode`, pointers to parent and
-children `fs.Dirents`, and the basename of the file.
+The `fs.Inode` is restored using its `fs.MountedFilesystem`. The
+[Mount points](#mount-points) section above describes how this happens in
+detail. The `fs.Dirent` restores its pointer to an `fs.Inode`, pointers to
+parent and children `fs.Dirents`, and the basename of the file.
Otherwise an `fs.File` restores flags, an offset, and a unique identifier (only
used internally).
diff --git a/pkg/sentry/fs/proc/README.md b/pkg/sentry/fs/proc/README.md
index 6ad7297d2..cec842403 100644
--- a/pkg/sentry/fs/proc/README.md
+++ b/pkg/sentry/fs/proc/README.md
@@ -6,6 +6,7 @@ procfs generally.
inconsistency, please file a bug.
[TOC]
+
## Kernel data
The following files are implemented:
@@ -91,6 +92,7 @@ Num currently running processes | Always zero
Total num processes | Always zero
TODO: Populate the columns with accurate statistics.
+
### meminfo
```bash
@@ -122,7 +124,7 @@ Shmem: 0 kB
Notable divergences:
Field name | Notes
-:---------------- | :--------------------------------------------------------
+:---------------- | :-----------------------------------------------------
Buffers | Always zero, no block devices
SwapCache | Always zero, no swap
Inactive(anon) | Always zero, see SwapCache
@@ -182,6 +184,7 @@ softirq 0 0 0 0 0 0 0 0 0 0 0
```
All fields except for `btime` are always zero.
+
TODO: Populate with accurate fields.
### sys
diff --git a/pkg/sentry/kernel/README.md b/pkg/sentry/kernel/README.md
index 88760a9bb..427311be8 100644
--- a/pkg/sentry/kernel/README.md
+++ b/pkg/sentry/kernel/README.md
@@ -1,12 +1,12 @@
This package contains:
-- A (partial) emulation of the "core Linux kernel", which governs task
- execution and scheduling, system call dispatch, and signal handling. See
- below for details.
+- A (partial) emulation of the "core Linux kernel", which governs task
+ execution and scheduling, system call dispatch, and signal handling. See
+ below for details.
-- The top-level interface for the sentry's Linux kernel emulation in general,
- used by the `main` function of all versions of the sentry. This interface
- revolves around the `Env` type (defined in `kernel.go`).
+- The top-level interface for the sentry's Linux kernel emulation in general,
+ used by the `main` function of all versions of the sentry. This interface
+ revolves around the `Env` type (defined in `kernel.go`).
# Background
@@ -20,15 +20,15 @@ sentry's notion of a task unless otherwise specified.)
At a high level, Linux application threads can be thought of as repeating a "run
loop":
-- Some amount of application code is executed in userspace.
+- Some amount of application code is executed in userspace.
-- A trap (explicit syscall invocation, hardware interrupt or exception, etc.)
- causes control flow to switch to the kernel.
+- A trap (explicit syscall invocation, hardware interrupt or exception, etc.)
+ causes control flow to switch to the kernel.
-- Some amount of kernel code is executed in kernelspace, e.g. to handle the
- cause of the trap.
+- Some amount of kernel code is executed in kernelspace, e.g. to handle the
+ cause of the trap.
-- The kernel "returns from the trap" into application code.
+- The kernel "returns from the trap" into application code.
Analogously, each task in the sentry is associated with a *task goroutine* that
executes that task's run loop (`Task.run` in `task_run.go`). However, the
@@ -38,24 +38,25 @@ state to, and resuming execution from, checkpoints.
While in kernelspace, a Linux thread can be descheduled (cease execution) in a
variety of ways:
-- It can yield or be preempted, becoming temporarily descheduled but still
- runnable. At present, the sentry delegates scheduling of runnable threads to
- the Go runtime.
+- It can yield or be preempted, becoming temporarily descheduled but still
+ runnable. At present, the sentry delegates scheduling of runnable threads to
+ the Go runtime.
-- It can exit, becoming permanently descheduled. The sentry's equivalent is
- returning from `Task.run`, terminating the task goroutine.
+- It can exit, becoming permanently descheduled. The sentry's equivalent is
+ returning from `Task.run`, terminating the task goroutine.
-- It can enter interruptible sleep, a state in which it can be woken by a
- caller-defined wakeup or the receipt of a signal. In the sentry, interruptible
- sleep (which is ambiguously referred to as *blocking*) is implemented by
- making all events that can end blocking (including signal notifications)
- communicated via Go channels and using `select` to multiplex wakeup sources;
- see `task_block.go`.
+- It can enter interruptible sleep, a state in which it can be woken by a
+ caller-defined wakeup or the receipt of a signal. In the sentry,
+ interruptible sleep (which is ambiguously referred to as *blocking*) is
+ implemented by making all events that can end blocking (including signal
+ notifications) communicated via Go channels and using `select` to multiplex
+ wakeup sources; see `task_block.go`.
-- It can enter uninterruptible sleep, a state in which it can only be woken by a
- caller-defined wakeup. Killable sleep is a closely related variant in which
- the task can also be woken by SIGKILL. (These definitions also include Linux's
- "group-stopped" (`TASK_STOPPED`) and "ptrace-stopped" (`TASK_TRACED`) states.)
+- It can enter uninterruptible sleep, a state in which it can only be woken by
+ a caller-defined wakeup. Killable sleep is a closely related variant in
+ which the task can also be woken by SIGKILL. (These definitions also include
+ Linux's "group-stopped" (`TASK_STOPPED`) and "ptrace-stopped"
+ (`TASK_TRACED`) states.)
To maximize compatibility with Linux, sentry checkpointing appears as a spurious
signal-delivery interrupt on all tasks; interrupted system calls return `EINTR`
@@ -71,21 +72,22 @@ through sleeping operations.
We break the task's control flow graph into *states*, delimited by:
-1. Points where uninterruptible and killable sleeps may occur. For example,
-there exists a state boundary between signal dequeueing and signal delivery
-because there may be an intervening ptrace signal-delivery-stop.
+1. Points where uninterruptible and killable sleeps may occur. For example,
+ there exists a state boundary between signal dequeueing and signal delivery
+ because there may be an intervening ptrace signal-delivery-stop.
-2. Points where sleep-induced branches may "rejoin" normal execution. For
-example, the syscall exit state exists because it can be reached immediately
-following a synchronous syscall, or after a task that is sleeping in `execve()`
-or `vfork()` resumes execution.
+2. Points where sleep-induced branches may "rejoin" normal execution. For
+ example, the syscall exit state exists because it can be reached immediately
+ following a synchronous syscall, or after a task that is sleeping in
+ `execve()` or `vfork()` resumes execution.
-3. Points containing large branches. This is strictly for organizational
-purposes. For example, the state that processes interrupt-signaled conditions is
-kept separate from the main "app" state to reduce the size of the latter.
+3. Points containing large branches. This is strictly for organizational
+ purposes. For example, the state that processes interrupt-signaled
+ conditions is kept separate from the main "app" state to reduce the size of
+ the latter.
-4. `SyscallReinvoke`, which does not correspond to anything in Linux, and exists
-solely to serve the autosave feature.
+4. `SyscallReinvoke`, which does not correspond to anything in Linux, and
+ exists solely to serve the autosave feature.
![dot -Tpng -Goverlap=false -orun_states.png run_states.dot](g3doc/run_states.png "Task control flow graph")
diff --git a/pkg/sentry/mm/README.md b/pkg/sentry/mm/README.md
index 067733475..e485a5ca5 100644
--- a/pkg/sentry/mm/README.md
+++ b/pkg/sentry/mm/README.md
@@ -38,50 +38,50 @@ forces the kernel to create such a mapping to service the read.
For a file, doing so consists of several logical phases:
-1. The kernel allocates physical memory to store the contents of the required
- part of the file, and copies file contents to the allocated memory. Supposing
- that the kernel chooses the physical memory at physical address (PA)
- 0x2fb000, the resulting state of the system is:
+1. The kernel allocates physical memory to store the contents of the required
+ part of the file, and copies file contents to the allocated memory.
+ Supposing that the kernel chooses the physical memory at physical address
+ (PA) 0x2fb000, the resulting state of the system is:
VMA: VA:0x400000 -> /tmp/foo:0x0
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
- (In Linux the state of the mapping from file offset to physical memory is
- stored in `struct address_space`, but to avoid confusion with other notions
- of address space we will refer to this system as filemap, named after Linux
- kernel source file `mm/filemap.c`.)
+ (In Linux the state of the mapping from file offset to physical memory is
+ stored in `struct address_space`, but to avoid confusion with other notions
+ of address space we will refer to this system as filemap, named after Linux
+ kernel source file `mm/filemap.c`.)
-2. The kernel stores the effective mapping from virtual to physical address in a
- *page table entry* (PTE) in the application's *page tables*, which are used
- by the CPU's virtual memory hardware to perform address translation. The
- resulting state of the system is:
+2. The kernel stores the effective mapping from virtual to physical address in
+ a *page table entry* (PTE) in the application's *page tables*, which are
+ used by the CPU's virtual memory hardware to perform address translation.
+ The resulting state of the system is:
VMA: VA:0x400000 -> /tmp/foo:0x0
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
PTE: VA:0x400000 -----------------> PA:0x2fb000
- The PTE is required for the application to actually use the contents of the
- mapped file as virtual memory. However, the PTE is derived from the VMA and
- filemap state, both of which are independently mutable, such that mutations
- to either will affect the PTE. For example:
-
- - The application may remove the VMA using the `munmap` system call. This
- breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently the
- mapping from VA:0x400000 to PA:0x2fb000. However, it does not necessarily
- break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a future mapping of
- the same file offset may reuse this physical memory.
-
- - The application may invalidate the file's contents by passing a length of 0
- to the `ftruncate` system call. This breaks the mapping from /tmp/foo:0x0
- to PA:0x2fb000, and consequently the mapping from VA:0x400000 to
- PA:0x2fb000. However, it does not break the mapping from VA:0x400000 to
- /tmp/foo:0x0, so future changes to the file's contents may again be made
- visible at VA:0x400000 after another page fault results in the allocation
- of a new physical address.
-
- Note that, in order to correctly break the mapping from VA:0x400000 to
- PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping*
- from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.
+ The PTE is required for the application to actually use the contents of the
+ mapped file as virtual memory. However, the PTE is derived from the VMA and
+ filemap state, both of which are independently mutable, such that mutations
+ to either will affect the PTE. For example:
+
+ - The application may remove the VMA using the `munmap` system call. This
+ breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently
+ the mapping from VA:0x400000 to PA:0x2fb000. However, it does not
+ necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a
+ future mapping of the same file offset may reuse this physical memory.
+
+ - The application may invalidate the file's contents by passing a length
+ of 0 to the `ftruncate` system call. This breaks the mapping from
+ /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from
+ VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from
+ VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents
+ may again be made visible at VA:0x400000 after another page fault
+ results in the allocation of a new physical address.
+
+ Note that, in order to correctly break the mapping from VA:0x400000 to
+ PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping*
+ from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.
[^mmap-anon]: Memory mappings to non-files are discussed in later sections.
@@ -146,30 +146,30 @@ When the application first incurs a page fault on this address, the host kernel
delivers information about the page fault to the sentry in a platform-dependent
manner, and the sentry handles the fault:
-1. The sentry allocates memory to store the contents of the required part of the
- file, and copies file contents to the allocated memory. However, since the
- sentry is implemented atop a host kernel, it does not configure mappings to
- physical memory directly. Instead, mappable "memory" in the sentry is
- represented by a host file descriptor and offset, since (as noted in
- "Background") this is the memory mapping primitive provided by the host
- kernel. In general, memory is allocated from a temporary host file using the
- `filemem` package. Supposing that the sentry allocates offset 0x3000 from
- host file "memory-file", the resulting state is:
+1. The sentry allocates memory to store the contents of the required part of
+ the file, and copies file contents to the allocated memory. However, since
+ the sentry is implemented atop a host kernel, it does not configure mappings
+ to physical memory directly. Instead, mappable "memory" in the sentry is
+ represented by a host file descriptor and offset, since (as noted in
+ "Background") this is the memory mapping primitive provided by the host
+ kernel. In general, memory is allocated from a temporary host file using the
+ `filemem` package. Supposing that the sentry allocates offset 0x3000 from
+ host file "memory-file", the resulting state is:
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
-2. The sentry stores the effective mapping from virtual address to host file in
- a host VMA by invoking the `mmap` system call:
+2. The sentry stores the effective mapping from virtual address to host file in
+ a host VMA by invoking the `mmap` system call:
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
-3. The sentry returns control to the application, which immediately incurs the
- page fault again.[^mmap-populate] However, since a host VMA now exists for
- the faulting virtual address, the host kernel now handles the page fault as
- described in "Background":
+3. The sentry returns control to the application, which immediately incurs the
+ page fault again.[^mmap-populate] However, since a host VMA now exists for
+ the faulting virtual address, the host kernel now handles the page fault as
+ described in "Background":
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
@@ -183,12 +183,12 @@ independently mutable, and the desired state of host VMAs is derived from that
state.
[^mmap-populate]: The sentry could force the host kernel to establish PTEs when
- it creates the host VMA by passing the `MAP_POPULATE` flag to
- the `mmap` system call, but usually does not. This is because,
- to reduce the number of page faults that require handling by
- the sentry and (correspondingly) the number of host `mmap`
- system calls, the sentry usually creates host VMAs that are
- much larger than the single faulting page.
+ it creates the host VMA by passing the `MAP_POPULATE` flag to
+ the `mmap` system call, but usually does not. This is because,
+ to reduce the number of page faults that require handling by
+ the sentry and (correspondingly) the number of host `mmap`
+ system calls, the sentry usually creates host VMAs that are
+ much larger than the single faulting page.
## Private Mappings
@@ -233,45 +233,46 @@ there is no shared zero page.
In Linux:
-- A virtual address space is represented by `struct mm_struct`.
+- A virtual address space is represented by `struct mm_struct`.
-- VMAs are represented by `struct vm_area_struct`, stored in `struct
- mm_struct::mmap`.
+- VMAs are represented by `struct vm_area_struct`, stored in `struct
+ mm_struct::mmap`.
-- Mappings from file offsets to physical memory are stored in `struct
- address_space`.
+- Mappings from file offsets to physical memory are stored in `struct
+ address_space`.
-- Reverse mappings from file offsets to virtual mappings are stored in `struct
- address_space::i_mmap`.
+- Reverse mappings from file offsets to virtual mappings are stored in `struct
+ address_space::i_mmap`.
-- Physical memory pages are represented by a pointer to `struct page` or an
- index called a *page frame number* (PFN), represented by `pfn_t`.
+- Physical memory pages are represented by a pointer to `struct page` or an
+ index called a *page frame number* (PFN), represented by `pfn_t`.
-- PTEs are represented by architecture-dependent type `pte_t`, stored in a table
- hierarchy rooted at `struct mm_struct::pgd`.
+- PTEs are represented by architecture-dependent type `pte_t`, stored in a
+ table hierarchy rooted at `struct mm_struct::pgd`.
In the sentry:
-- A virtual address space is represented by type [`mm.MemoryManager`][mm].
+- A virtual address space is represented by type [`mm.MemoryManager`][mm].
-- Sentry VMAs are represented by type [`mm.vma`][mm], stored in
- `mm.MemoryManager.vmas`.
+- Sentry VMAs are represented by type [`mm.vma`][mm], stored in
+ `mm.MemoryManager.vmas`.
-- Mappings from sentry file offsets to host file offsets are abstracted through
- interface method [`memmap.Mappable.Translate`][memmap].
+- Mappings from sentry file offsets to host file offsets are abstracted
+ through interface method [`memmap.Mappable.Translate`][memmap].
-- Reverse mappings from sentry file offsets to virtual mappings are abstracted
- through interface methods [`memmap.Mappable.AddMapping` and
- `memmap.Mappable.RemoveMapping`][memmap].
+- Reverse mappings from sentry file offsets to virtual mappings are abstracted
+ through interface methods
+ [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap].
-- Host files that may be mapped into host VMAs are represented by type
- [`platform.File`][platform].
+- Host files that may be mapped into host VMAs are represented by type
+ [`platform.File`][platform].
-- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform
- mapping area"), stored in `mm.MemoryManager.pmas`.
+- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform
+ mapping area"), stored in `mm.MemoryManager.pmas`.
-- Creation and destruction of host VMAs is abstracted through interface methods
- [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform].
+- Creation and destruction of host VMAs is abstracted through interface
+ methods
+ [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform].
[filemem]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/platform/filemem/filemem.go
[memmap]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/memmap/memmap.go
diff --git a/pkg/sentry/usermem/README.md b/pkg/sentry/usermem/README.md
index 2ebd3bcc1..f6d2137eb 100644
--- a/pkg/sentry/usermem/README.md
+++ b/pkg/sentry/usermem/README.md
@@ -2,30 +2,30 @@ This package defines primitives for sentry access to application memory.
Major types:
-- The `IO` interface represents a virtual address space and provides I/O methods
- on that address space. `IO` is the lowest-level primitive. The primary
- implementation of the `IO` interface is `mm.MemoryManager`.
+- The `IO` interface represents a virtual address space and provides I/O
+ methods on that address space. `IO` is the lowest-level primitive. The
+ primary implementation of the `IO` interface is `mm.MemoryManager`.
-- `IOSequence` represents a collection of individually-contiguous address ranges
- in a `IO` that is operated on sequentially, analogous to Linux's `struct
- iov_iter`.
+- `IOSequence` represents a collection of individually-contiguous address
+ ranges in a `IO` that is operated on sequentially, analogous to Linux's
+ `struct iov_iter`.
Major usage patterns:
-- Access to a task's virtual memory, subject to the application's memory
- protections and while running on that task's goroutine, from a context that is
- at or above the level of the `kernel` package (e.g. most syscall
- implementations in `syscalls/linux`); use the `kernel.Task.Copy*` wrappers
- defined in `kernel/task_usermem.go`.
+- Access to a task's virtual memory, subject to the application's memory
+ protections and while running on that task's goroutine, from a context that
+ is at or above the level of the `kernel` package (e.g. most syscall
+ implementations in `syscalls/linux`); use the `kernel.Task.Copy*` wrappers
+ defined in `kernel/task_usermem.go`.
-- Access to a task's virtual memory, from a context that is at or above the
- level of the `kernel` package, but where any of the above constraints does not
- hold (e.g. `PTRACE_POKEDATA`, which ignores application memory protections);
- obtain the task's `mm.MemoryManager` by calling `kernel.Task.MemoryManager`,
- and call its `IO` methods directly.
+- Access to a task's virtual memory, from a context that is at or above the
+ level of the `kernel` package, but where any of the above constraints does
+ not hold (e.g. `PTRACE_POKEDATA`, which ignores application memory
+ protections); obtain the task's `mm.MemoryManager` by calling
+ `kernel.Task.MemoryManager`, and call its `IO` methods directly.
-- Access to a task's virtual memory, from a context that is below the level of
- the `kernel` package (e.g. filesystem I/O); clients must pass I/O arguments
- from higher layers, usually in the form of an `IOSequence`. The
- `kernel.Task.SingleIOSequence` and `kernel.Task.IovecsIOSequence` functions in
- `kernel/task_usermem.go` are convenience functions for doing so.
+- Access to a task's virtual memory, from a context that is below the level of
+ the `kernel` package (e.g. filesystem I/O); clients must pass I/O arguments
+ from higher layers, usually in the form of an `IOSequence`. The
+ `kernel.Task.SingleIOSequence` and `kernel.Task.IovecsIOSequence` functions
+ in `kernel/task_usermem.go` are convenience functions for doing so.