diff options
Diffstat (limited to 'pkg/sentry')
-rw-r--r-- | pkg/sentry/fs/README.md | 8 | ||||
-rw-r--r-- | pkg/sentry/fs/proc/README.md | 5 | ||||
-rw-r--r-- | pkg/sentry/kernel/README.md | 80 | ||||
-rw-r--r-- | pkg/sentry/mm/README.md | 161 | ||||
-rw-r--r-- | pkg/sentry/usermem/README.md | 42 |
5 files changed, 151 insertions, 145 deletions
diff --git a/pkg/sentry/fs/README.md b/pkg/sentry/fs/README.md index 898271ee8..76638cdae 100644 --- a/pkg/sentry/fs/README.md +++ b/pkg/sentry/fs/README.md @@ -149,10 +149,10 @@ An `fs.File` references the following filesystem objects: fs.File -> fs.Dirent -> fs.Inode -> fs.MountedFilesystem ``` -The `fs.Inode` is restored using its `fs.MountedFilesystem`. The [Mount -points](#mount-points) section above describes how this happens in detail. The -`fs.Dirent` restores its pointer to an `fs.Inode`, pointers to parent and -children `fs.Dirents`, and the basename of the file. +The `fs.Inode` is restored using its `fs.MountedFilesystem`. The +[Mount points](#mount-points) section above describes how this happens in +detail. The `fs.Dirent` restores its pointer to an `fs.Inode`, pointers to +parent and children `fs.Dirents`, and the basename of the file. Otherwise an `fs.File` restores flags, an offset, and a unique identifier (only used internally). diff --git a/pkg/sentry/fs/proc/README.md b/pkg/sentry/fs/proc/README.md index 6ad7297d2..cec842403 100644 --- a/pkg/sentry/fs/proc/README.md +++ b/pkg/sentry/fs/proc/README.md @@ -6,6 +6,7 @@ procfs generally. inconsistency, please file a bug. [TOC] + ## Kernel data The following files are implemented: @@ -91,6 +92,7 @@ Num currently running processes | Always zero Total num processes | Always zero TODO: Populate the columns with accurate statistics. + ### meminfo ```bash @@ -122,7 +124,7 @@ Shmem: 0 kB Notable divergences: Field name | Notes -:---------------- | :-------------------------------------------------------- +:---------------- | :----------------------------------------------------- Buffers | Always zero, no block devices SwapCache | Always zero, no swap Inactive(anon) | Always zero, see SwapCache @@ -182,6 +184,7 @@ softirq 0 0 0 0 0 0 0 0 0 0 0 ``` All fields except for `btime` are always zero. + TODO: Populate with accurate fields. ### sys diff --git a/pkg/sentry/kernel/README.md b/pkg/sentry/kernel/README.md index 88760a9bb..427311be8 100644 --- a/pkg/sentry/kernel/README.md +++ b/pkg/sentry/kernel/README.md @@ -1,12 +1,12 @@ This package contains: -- A (partial) emulation of the "core Linux kernel", which governs task - execution and scheduling, system call dispatch, and signal handling. See - below for details. +- A (partial) emulation of the "core Linux kernel", which governs task + execution and scheduling, system call dispatch, and signal handling. See + below for details. -- The top-level interface for the sentry's Linux kernel emulation in general, - used by the `main` function of all versions of the sentry. This interface - revolves around the `Env` type (defined in `kernel.go`). +- The top-level interface for the sentry's Linux kernel emulation in general, + used by the `main` function of all versions of the sentry. This interface + revolves around the `Env` type (defined in `kernel.go`). # Background @@ -20,15 +20,15 @@ sentry's notion of a task unless otherwise specified.) At a high level, Linux application threads can be thought of as repeating a "run loop": -- Some amount of application code is executed in userspace. +- Some amount of application code is executed in userspace. -- A trap (explicit syscall invocation, hardware interrupt or exception, etc.) - causes control flow to switch to the kernel. +- A trap (explicit syscall invocation, hardware interrupt or exception, etc.) + causes control flow to switch to the kernel. -- Some amount of kernel code is executed in kernelspace, e.g. to handle the - cause of the trap. +- Some amount of kernel code is executed in kernelspace, e.g. to handle the + cause of the trap. -- The kernel "returns from the trap" into application code. +- The kernel "returns from the trap" into application code. Analogously, each task in the sentry is associated with a *task goroutine* that executes that task's run loop (`Task.run` in `task_run.go`). However, the @@ -38,24 +38,25 @@ state to, and resuming execution from, checkpoints. While in kernelspace, a Linux thread can be descheduled (cease execution) in a variety of ways: -- It can yield or be preempted, becoming temporarily descheduled but still - runnable. At present, the sentry delegates scheduling of runnable threads to - the Go runtime. +- It can yield or be preempted, becoming temporarily descheduled but still + runnable. At present, the sentry delegates scheduling of runnable threads to + the Go runtime. -- It can exit, becoming permanently descheduled. The sentry's equivalent is - returning from `Task.run`, terminating the task goroutine. +- It can exit, becoming permanently descheduled. The sentry's equivalent is + returning from `Task.run`, terminating the task goroutine. -- It can enter interruptible sleep, a state in which it can be woken by a - caller-defined wakeup or the receipt of a signal. In the sentry, interruptible - sleep (which is ambiguously referred to as *blocking*) is implemented by - making all events that can end blocking (including signal notifications) - communicated via Go channels and using `select` to multiplex wakeup sources; - see `task_block.go`. +- It can enter interruptible sleep, a state in which it can be woken by a + caller-defined wakeup or the receipt of a signal. In the sentry, + interruptible sleep (which is ambiguously referred to as *blocking*) is + implemented by making all events that can end blocking (including signal + notifications) communicated via Go channels and using `select` to multiplex + wakeup sources; see `task_block.go`. -- It can enter uninterruptible sleep, a state in which it can only be woken by a - caller-defined wakeup. Killable sleep is a closely related variant in which - the task can also be woken by SIGKILL. (These definitions also include Linux's - "group-stopped" (`TASK_STOPPED`) and "ptrace-stopped" (`TASK_TRACED`) states.) +- It can enter uninterruptible sleep, a state in which it can only be woken by + a caller-defined wakeup. Killable sleep is a closely related variant in + which the task can also be woken by SIGKILL. (These definitions also include + Linux's "group-stopped" (`TASK_STOPPED`) and "ptrace-stopped" + (`TASK_TRACED`) states.) To maximize compatibility with Linux, sentry checkpointing appears as a spurious signal-delivery interrupt on all tasks; interrupted system calls return `EINTR` @@ -71,21 +72,22 @@ through sleeping operations. We break the task's control flow graph into *states*, delimited by: -1. Points where uninterruptible and killable sleeps may occur. For example, -there exists a state boundary between signal dequeueing and signal delivery -because there may be an intervening ptrace signal-delivery-stop. +1. Points where uninterruptible and killable sleeps may occur. For example, + there exists a state boundary between signal dequeueing and signal delivery + because there may be an intervening ptrace signal-delivery-stop. -2. Points where sleep-induced branches may "rejoin" normal execution. For -example, the syscall exit state exists because it can be reached immediately -following a synchronous syscall, or after a task that is sleeping in `execve()` -or `vfork()` resumes execution. +2. Points where sleep-induced branches may "rejoin" normal execution. For + example, the syscall exit state exists because it can be reached immediately + following a synchronous syscall, or after a task that is sleeping in + `execve()` or `vfork()` resumes execution. -3. Points containing large branches. This is strictly for organizational -purposes. For example, the state that processes interrupt-signaled conditions is -kept separate from the main "app" state to reduce the size of the latter. +3. Points containing large branches. This is strictly for organizational + purposes. For example, the state that processes interrupt-signaled + conditions is kept separate from the main "app" state to reduce the size of + the latter. -4. `SyscallReinvoke`, which does not correspond to anything in Linux, and exists -solely to serve the autosave feature. +4. `SyscallReinvoke`, which does not correspond to anything in Linux, and + exists solely to serve the autosave feature. ![dot -Tpng -Goverlap=false -orun_states.png run_states.dot](g3doc/run_states.png "Task control flow graph") diff --git a/pkg/sentry/mm/README.md b/pkg/sentry/mm/README.md index 067733475..e485a5ca5 100644 --- a/pkg/sentry/mm/README.md +++ b/pkg/sentry/mm/README.md @@ -38,50 +38,50 @@ forces the kernel to create such a mapping to service the read. For a file, doing so consists of several logical phases: -1. The kernel allocates physical memory to store the contents of the required - part of the file, and copies file contents to the allocated memory. Supposing - that the kernel chooses the physical memory at physical address (PA) - 0x2fb000, the resulting state of the system is: +1. The kernel allocates physical memory to store the contents of the required + part of the file, and copies file contents to the allocated memory. + Supposing that the kernel chooses the physical memory at physical address + (PA) 0x2fb000, the resulting state of the system is: VMA: VA:0x400000 -> /tmp/foo:0x0 Filemap: /tmp/foo:0x0 -> PA:0x2fb000 - (In Linux the state of the mapping from file offset to physical memory is - stored in `struct address_space`, but to avoid confusion with other notions - of address space we will refer to this system as filemap, named after Linux - kernel source file `mm/filemap.c`.) + (In Linux the state of the mapping from file offset to physical memory is + stored in `struct address_space`, but to avoid confusion with other notions + of address space we will refer to this system as filemap, named after Linux + kernel source file `mm/filemap.c`.) -2. The kernel stores the effective mapping from virtual to physical address in a - *page table entry* (PTE) in the application's *page tables*, which are used - by the CPU's virtual memory hardware to perform address translation. The - resulting state of the system is: +2. The kernel stores the effective mapping from virtual to physical address in + a *page table entry* (PTE) in the application's *page tables*, which are + used by the CPU's virtual memory hardware to perform address translation. + The resulting state of the system is: VMA: VA:0x400000 -> /tmp/foo:0x0 Filemap: /tmp/foo:0x0 -> PA:0x2fb000 PTE: VA:0x400000 -----------------> PA:0x2fb000 - The PTE is required for the application to actually use the contents of the - mapped file as virtual memory. However, the PTE is derived from the VMA and - filemap state, both of which are independently mutable, such that mutations - to either will affect the PTE. For example: - - - The application may remove the VMA using the `munmap` system call. This - breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently the - mapping from VA:0x400000 to PA:0x2fb000. However, it does not necessarily - break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a future mapping of - the same file offset may reuse this physical memory. - - - The application may invalidate the file's contents by passing a length of 0 - to the `ftruncate` system call. This breaks the mapping from /tmp/foo:0x0 - to PA:0x2fb000, and consequently the mapping from VA:0x400000 to - PA:0x2fb000. However, it does not break the mapping from VA:0x400000 to - /tmp/foo:0x0, so future changes to the file's contents may again be made - visible at VA:0x400000 after another page fault results in the allocation - of a new physical address. - - Note that, in order to correctly break the mapping from VA:0x400000 to - PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping* - from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE. + The PTE is required for the application to actually use the contents of the + mapped file as virtual memory. However, the PTE is derived from the VMA and + filemap state, both of which are independently mutable, such that mutations + to either will affect the PTE. For example: + + - The application may remove the VMA using the `munmap` system call. This + breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently + the mapping from VA:0x400000 to PA:0x2fb000. However, it does not + necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a + future mapping of the same file offset may reuse this physical memory. + + - The application may invalidate the file's contents by passing a length + of 0 to the `ftruncate` system call. This breaks the mapping from + /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from + VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from + VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents + may again be made visible at VA:0x400000 after another page fault + results in the allocation of a new physical address. + + Note that, in order to correctly break the mapping from VA:0x400000 to + PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping* + from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE. [^mmap-anon]: Memory mappings to non-files are discussed in later sections. @@ -146,30 +146,30 @@ When the application first incurs a page fault on this address, the host kernel delivers information about the page fault to the sentry in a platform-dependent manner, and the sentry handles the fault: -1. The sentry allocates memory to store the contents of the required part of the - file, and copies file contents to the allocated memory. However, since the - sentry is implemented atop a host kernel, it does not configure mappings to - physical memory directly. Instead, mappable "memory" in the sentry is - represented by a host file descriptor and offset, since (as noted in - "Background") this is the memory mapping primitive provided by the host - kernel. In general, memory is allocated from a temporary host file using the - `filemem` package. Supposing that the sentry allocates offset 0x3000 from - host file "memory-file", the resulting state is: +1. The sentry allocates memory to store the contents of the required part of + the file, and copies file contents to the allocated memory. However, since + the sentry is implemented atop a host kernel, it does not configure mappings + to physical memory directly. Instead, mappable "memory" in the sentry is + represented by a host file descriptor and offset, since (as noted in + "Background") this is the memory mapping primitive provided by the host + kernel. In general, memory is allocated from a temporary host file using the + `filemem` package. Supposing that the sentry allocates offset 0x3000 from + host file "memory-file", the resulting state is: Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 -2. The sentry stores the effective mapping from virtual address to host file in - a host VMA by invoking the `mmap` system call: +2. The sentry stores the effective mapping from virtual address to host file in + a host VMA by invoking the `mmap` system call: Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 -3. The sentry returns control to the application, which immediately incurs the - page fault again.[^mmap-populate] However, since a host VMA now exists for - the faulting virtual address, the host kernel now handles the page fault as - described in "Background": +3. The sentry returns control to the application, which immediately incurs the + page fault again.[^mmap-populate] However, since a host VMA now exists for + the faulting virtual address, the host kernel now handles the page fault as + described in "Background": Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 @@ -183,12 +183,12 @@ independently mutable, and the desired state of host VMAs is derived from that state. [^mmap-populate]: The sentry could force the host kernel to establish PTEs when - it creates the host VMA by passing the `MAP_POPULATE` flag to - the `mmap` system call, but usually does not. This is because, - to reduce the number of page faults that require handling by - the sentry and (correspondingly) the number of host `mmap` - system calls, the sentry usually creates host VMAs that are - much larger than the single faulting page. + it creates the host VMA by passing the `MAP_POPULATE` flag to + the `mmap` system call, but usually does not. This is because, + to reduce the number of page faults that require handling by + the sentry and (correspondingly) the number of host `mmap` + system calls, the sentry usually creates host VMAs that are + much larger than the single faulting page. ## Private Mappings @@ -233,45 +233,46 @@ there is no shared zero page. In Linux: -- A virtual address space is represented by `struct mm_struct`. +- A virtual address space is represented by `struct mm_struct`. -- VMAs are represented by `struct vm_area_struct`, stored in `struct - mm_struct::mmap`. +- VMAs are represented by `struct vm_area_struct`, stored in `struct + mm_struct::mmap`. -- Mappings from file offsets to physical memory are stored in `struct - address_space`. +- Mappings from file offsets to physical memory are stored in `struct + address_space`. -- Reverse mappings from file offsets to virtual mappings are stored in `struct - address_space::i_mmap`. +- Reverse mappings from file offsets to virtual mappings are stored in `struct + address_space::i_mmap`. -- Physical memory pages are represented by a pointer to `struct page` or an - index called a *page frame number* (PFN), represented by `pfn_t`. +- Physical memory pages are represented by a pointer to `struct page` or an + index called a *page frame number* (PFN), represented by `pfn_t`. -- PTEs are represented by architecture-dependent type `pte_t`, stored in a table - hierarchy rooted at `struct mm_struct::pgd`. +- PTEs are represented by architecture-dependent type `pte_t`, stored in a + table hierarchy rooted at `struct mm_struct::pgd`. In the sentry: -- A virtual address space is represented by type [`mm.MemoryManager`][mm]. +- A virtual address space is represented by type [`mm.MemoryManager`][mm]. -- Sentry VMAs are represented by type [`mm.vma`][mm], stored in - `mm.MemoryManager.vmas`. +- Sentry VMAs are represented by type [`mm.vma`][mm], stored in + `mm.MemoryManager.vmas`. -- Mappings from sentry file offsets to host file offsets are abstracted through - interface method [`memmap.Mappable.Translate`][memmap]. +- Mappings from sentry file offsets to host file offsets are abstracted + through interface method [`memmap.Mappable.Translate`][memmap]. -- Reverse mappings from sentry file offsets to virtual mappings are abstracted - through interface methods [`memmap.Mappable.AddMapping` and - `memmap.Mappable.RemoveMapping`][memmap]. +- Reverse mappings from sentry file offsets to virtual mappings are abstracted + through interface methods + [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap]. -- Host files that may be mapped into host VMAs are represented by type - [`platform.File`][platform]. +- Host files that may be mapped into host VMAs are represented by type + [`platform.File`][platform]. -- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform - mapping area"), stored in `mm.MemoryManager.pmas`. +- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform + mapping area"), stored in `mm.MemoryManager.pmas`. -- Creation and destruction of host VMAs is abstracted through interface methods - [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform]. +- Creation and destruction of host VMAs is abstracted through interface + methods + [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform]. [filemem]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/platform/filemem/filemem.go [memmap]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/memmap/memmap.go diff --git a/pkg/sentry/usermem/README.md b/pkg/sentry/usermem/README.md index 2ebd3bcc1..f6d2137eb 100644 --- a/pkg/sentry/usermem/README.md +++ b/pkg/sentry/usermem/README.md @@ -2,30 +2,30 @@ This package defines primitives for sentry access to application memory. Major types: -- The `IO` interface represents a virtual address space and provides I/O methods - on that address space. `IO` is the lowest-level primitive. The primary - implementation of the `IO` interface is `mm.MemoryManager`. +- The `IO` interface represents a virtual address space and provides I/O + methods on that address space. `IO` is the lowest-level primitive. The + primary implementation of the `IO` interface is `mm.MemoryManager`. -- `IOSequence` represents a collection of individually-contiguous address ranges - in a `IO` that is operated on sequentially, analogous to Linux's `struct - iov_iter`. +- `IOSequence` represents a collection of individually-contiguous address + ranges in a `IO` that is operated on sequentially, analogous to Linux's + `struct iov_iter`. Major usage patterns: -- Access to a task's virtual memory, subject to the application's memory - protections and while running on that task's goroutine, from a context that is - at or above the level of the `kernel` package (e.g. most syscall - implementations in `syscalls/linux`); use the `kernel.Task.Copy*` wrappers - defined in `kernel/task_usermem.go`. +- Access to a task's virtual memory, subject to the application's memory + protections and while running on that task's goroutine, from a context that + is at or above the level of the `kernel` package (e.g. most syscall + implementations in `syscalls/linux`); use the `kernel.Task.Copy*` wrappers + defined in `kernel/task_usermem.go`. -- Access to a task's virtual memory, from a context that is at or above the - level of the `kernel` package, but where any of the above constraints does not - hold (e.g. `PTRACE_POKEDATA`, which ignores application memory protections); - obtain the task's `mm.MemoryManager` by calling `kernel.Task.MemoryManager`, - and call its `IO` methods directly. +- Access to a task's virtual memory, from a context that is at or above the + level of the `kernel` package, but where any of the above constraints does + not hold (e.g. `PTRACE_POKEDATA`, which ignores application memory + protections); obtain the task's `mm.MemoryManager` by calling + `kernel.Task.MemoryManager`, and call its `IO` methods directly. -- Access to a task's virtual memory, from a context that is below the level of - the `kernel` package (e.g. filesystem I/O); clients must pass I/O arguments - from higher layers, usually in the form of an `IOSequence`. The - `kernel.Task.SingleIOSequence` and `kernel.Task.IovecsIOSequence` functions in - `kernel/task_usermem.go` are convenience functions for doing so. +- Access to a task's virtual memory, from a context that is below the level of + the `kernel` package (e.g. filesystem I/O); clients must pass I/O arguments + from higher layers, usually in the form of an `IOSequence`. The + `kernel.Task.SingleIOSequence` and `kernel.Task.IovecsIOSequence` functions + in `kernel/task_usermem.go` are convenience functions for doing so. |