1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
|
This package provides an emulation of Linux semantics for application virtual
memory mappings.
For completeness, this document also describes aspects of the memory management
subsystem defined outside this package.
# Background
We begin by describing semantics for virtual memory in Linux.
A virtual address space is defined as a collection of mappings from virtual
addresses to physical memory. However, userspace applications do not configure
mappings to physical memory directly. Instead, applications configure memory
mappings from virtual addresses to offsets into a file using the `mmap` system
call.[^mmap-anon] For example, a call to:
mmap(
/* addr = */ 0x400000,
/* length = */ 0x1000,
PROT_READ | PROT_WRITE,
MAP_SHARED,
/* fd = */ 3,
/* offset = */ 0);
creates a mapping of length 0x1000 bytes, starting at virtual address (VA)
0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within
the Linux kernel, virtual memory mappings are represented by *virtual memory
areas* (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the
virtual memory subsystem after the `mmap` call may be depicted as:
VMA: VA:0x400000 -> /tmp/foo:0x0
Establishing a virtual memory area does not necessarily establish a mapping to a
physical address, because Linux has not necessarily provisioned physical memory
to store the file's contents. Thus, if the application attempts to read the
contents of VA 0x400000, it may incur a *page fault*, a CPU exception that
forces the kernel to create such a mapping to service the read.
For a file, doing so consists of several logical phases:
1. The kernel allocates physical memory to store the contents of the required
part of the file, and copies file contents to the allocated memory.
Supposing that the kernel chooses the physical memory at physical address
(PA) 0x2fb000, the resulting state of the system is:
VMA: VA:0x400000 -> /tmp/foo:0x0
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
(In Linux the state of the mapping from file offset to physical memory is
stored in `struct address_space`, but to avoid confusion with other notions
of address space we will refer to this system as filemap, named after Linux
kernel source file `mm/filemap.c`.)
2. The kernel stores the effective mapping from virtual to physical address in
a *page table entry* (PTE) in the application's *page tables*, which are
used by the CPU's virtual memory hardware to perform address translation.
The resulting state of the system is:
VMA: VA:0x400000 -> /tmp/foo:0x0
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
PTE: VA:0x400000 -----------------> PA:0x2fb000
The PTE is required for the application to actually use the contents of the
mapped file as virtual memory. However, the PTE is derived from the VMA and
filemap state, both of which are independently mutable, such that mutations
to either will affect the PTE. For example:
- The application may remove the VMA using the `munmap` system call. This
breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently
the mapping from VA:0x400000 to PA:0x2fb000. However, it does not
necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a
future mapping of the same file offset may reuse this physical memory.
- The application may invalidate the file's contents by passing a length
of 0 to the `ftruncate` system call. This breaks the mapping from
/tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from
VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from
VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents
may again be made visible at VA:0x400000 after another page fault
results in the allocation of a new physical address.
Note that, in order to correctly break the mapping from VA:0x400000 to
PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping*
from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.
[^mmap-anon]: Memory mappings to non-files are discussed in later sections.
## Private Mappings
The preceding example considered VMAs created using the `MAP_SHARED` flag, which
means that PTEs derived from the mapping should always use physical memory that
represents the current state of the mapped file.[^mmap-dev-zero] Applications
can alternatively pass the `MAP_PRIVATE` flag to create a *private mapping*.
Private mappings are *copy-on-write*.
Suppose that the application instead created a private mapping in the previous
example. In Linux, the state of the system after a read page fault would be:
VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
PTE: VA:0x400000 -----------------> PA:0x2fb000 (read-only)
Now suppose the application attempts to write to VA:0x400000. For a shared
mapping, the write would be propagated to PA:0x2fb000, and the kernel would be
responsible for ensuring that the write is later propagated to the mapped file.
For a private mapping, the write incurs another page fault since the PTE is
marked read-only. In response, the kernel allocates physical memory to store the
mapping's *private copy* of the file's contents, copies file contents to the
allocated memory, and changes the PTE to map to the private copy. Supposing that
the kernel chooses the physical memory at physical address (PA) 0x5ea000, the
resulting state of the system is:
VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
Filemap: /tmp/foo:0x0 -> PA:0x2fb000
PTE: VA:0x400000 -----------------> PA:0x5ea000
Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist,
but is now irrelevant to this mapping.
[^mmap-dev-zero]: Modulo files with special mmap semantics such as `/dev/zero`.
## Anonymous Mappings
Instead of passing a file to the `mmap` system call, applications can instead
request an *anonymous* mapping by passing the `MAP_ANONYMOUS` flag.
Semantically, an anonymous mapping is essentially a mapping to an ephemeral file
initially filled with zero bytes. Practically speaking, this is how shared
anonymous mappings are implemented, but private anonymous mappings do not result
in the creation of an ephemeral file; since there would be no way to modify the
contents of the underlying file through a private mapping, all private anonymous
mappings use a single shared page filled with zero bytes until copy-on-write
occurs.
# Virtual Memory in the Sentry
The sentry implements application virtual memory atop a host kernel, introducing
an additional level of indirection to the above.
Consider the same scenario as in the previous section. Since the sentry handles
application system calls, the effect of an application `mmap` system call is to
create a VMA in the sentry (as opposed to the host kernel):
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
When the application first incurs a page fault on this address, the host kernel
delivers information about the page fault to the sentry in a platform-dependent
manner, and the sentry handles the fault:
1. The sentry allocates memory to store the contents of the required part of
the file, and copies file contents to the allocated memory. However, since
the sentry is implemented atop a host kernel, it does not configure mappings
to physical memory directly. Instead, mappable "memory" in the sentry is
represented by a host file descriptor and offset, since (as noted in
"Background") this is the memory mapping primitive provided by the host
kernel. In general, memory is allocated from a temporary host file using the
`pgalloc` package. Supposing that the sentry allocates offset 0x3000 from
host file "memory-file", the resulting state is:
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
2. The sentry stores the effective mapping from virtual address to host file in
a host VMA by invoking the `mmap` system call:
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
3. The sentry returns control to the application, which immediately incurs the
page fault again.[^mmap-populate] However, since a host VMA now exists for
the faulting virtual address, the host kernel now handles the page fault as
described in "Background":
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000
Host filemap: host:memory-file:0x3000 -> PA:0x2fb000
Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000
Thus, from an implementation standpoint, host VMAs serve the same purpose in the
sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is
independently mutable, and the desired state of host VMAs is derived from that
state.
[^mmap-populate]: The sentry could force the host kernel to establish PTEs when
it creates the host VMA by passing the `MAP_POPULATE` flag to
the `mmap` system call, but usually does not. This is because,
to reduce the number of page faults that require handling by
the sentry and (correspondingly) the number of host `mmap`
system calls, the sentry usually creates host VMAs that are
much larger than the single faulting page.
## Private Mappings
The sentry implements private mappings consistently with Linux. Before
copy-on-write, the private mapping example given in the Background results in:
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 (read-only)
Host filemap: host:memory-file:0x3000 -> PA:0x2fb000
Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only)
When the application attempts to write to this address, the host kernel delivers
information about the resulting page fault to the sentry. Analogous to Linux,
the sentry allocates memory to store the mapping's private copy of the file's
contents, copies file contents to the allocated memory, and changes the host VMA
to map to the private copy. Supposing that the sentry chooses the offset 0x4000
in host file `memory-file` to store the private copy, the state of the system
after copy-on-write is:
Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private)
Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000
Host VMA: VA:0x400000 -----------------> host:memory-file:0x4000
Host filemap: host:memory-file:0x4000 -> PA:0x5ea000
Host PTE: VA:0x400000 --------------------------------------------> PA:0x5ea000
However, this highlights an important difference between Linux and the sentry.
In Linux, page tables are concrete (architecture-dependent) data structures
owned by the kernel. Conversely, the sentry has the ability to create and
destroy host VMAs using host system calls, but it does not have direct access to
their state. Thus, as written, if the application invokes the `munmap` system
call to remove the sentry VMA, it is non-trivial for the sentry to determine
that it should deallocate `host:memory-file:0x4000`. This implies that the
sentry must retain information about the host VMAs that it has created.
## Anonymous Mappings
The sentry implements anonymous mappings consistently with Linux, except that
there is no shared zero page.
# Implementation Constructs
In Linux:
- A virtual address space is represented by `struct mm_struct`.
- VMAs are represented by `struct vm_area_struct`, stored in `struct
mm_struct::mmap`.
- Mappings from file offsets to physical memory are stored in `struct
address_space`.
- Reverse mappings from file offsets to virtual mappings are stored in `struct
address_space::i_mmap`.
- Physical memory pages are represented by a pointer to `struct page` or an
index called a *page frame number* (PFN), represented by `pfn_t`.
- PTEs are represented by architecture-dependent type `pte_t`, stored in a
table hierarchy rooted at `struct mm_struct::pgd`.
In the sentry:
- A virtual address space is represented by type [`mm.MemoryManager`][mm].
- Sentry VMAs are represented by type [`mm.vma`][mm], stored in
`mm.MemoryManager.vmas`.
- Mappings from sentry file offsets to host file offsets are abstracted
through interface method [`memmap.Mappable.Translate`][memmap].
- Reverse mappings from sentry file offsets to virtual mappings are abstracted
through interface methods
[`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap].
- Host files that may be mapped into host VMAs are represented by type
[`platform.File`][platform].
- Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform
mapping area"), stored in `mm.MemoryManager.pmas`.
- Creation and destruction of host VMAs is abstracted through interface
methods
[`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform].
[memmap]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/memmap/memmap.go
[mm]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/mm/mm.go
[pgalloc]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/pgalloc/pgalloc.go
[platform]: https://github.com/google/gvisor/blob/master/+/master/pkg/sentry/platform/platform.go
|