summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/fs/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'pkg/sentry/fs/README.md')
-rw-r--r--pkg/sentry/fs/README.md217
1 files changed, 217 insertions, 0 deletions
diff --git a/pkg/sentry/fs/README.md b/pkg/sentry/fs/README.md
new file mode 100644
index 000000000..898271ee8
--- /dev/null
+++ b/pkg/sentry/fs/README.md
@@ -0,0 +1,217 @@
+This package provides an implementation of the Linux virtual filesystem.
+
+[TOC]
+
+## Overview
+
+- An `fs.Dirent` caches an `fs.Inode` in memory at a path in the VFS, giving
+ the `fs.Inode` a relative position with respect to other `fs.Inode`s.
+
+- If an `fs.Dirent` is referenced by two file descriptors, then those file
+ descriptors are coherent with each other: they depend on the same
+ `fs.Inode`.
+
+- A mount point is an `fs.Dirent` for which `fs.Dirent.mounted` is true. It
+ exposes the root of a mounted filesystem.
+
+- The `fs.Inode` produced by a registered filesystem on mount(2) owns an
+ `fs.MountedFilesystem` from which other `fs.Inode`s will be looked up. For a
+ remote filesystem, the `fs.MountedFilesystem` owns the connection to that
+ remote filesystem.
+
+- In general:
+
+```
+fs.Inode <------------------------------
+| |
+| |
+produced by |
+exactly one |
+| responsible for the
+| virtual identity of
+v |
+fs.MountedFilesystem -------------------
+```
+
+Glossary:
+
+- VFS: virtual filesystem.
+
+- inode: a virtual file object holding a cached view of a file on a backing
+ filesystem (includes metadata and page caches).
+
+- superblock: the virtual state of a mounted filesystem (e.g. the virtual
+ inode number set).
+
+- mount namespace: a view of the mounts under a root (during path traversal,
+ the VFS makes visible/follows the mount point that is in the current task's
+ mount namespace).
+
+## Save and restore
+
+An application's hard dependencies on filesystem state can be broken down into
+two categories:
+
+- The state necessary to execute a traversal on or view the *virtual*
+ filesystem hierarchy, regardless of what files an application has open.
+
+- The state necessary to represent open files.
+
+The first is always necessary to save and restore. An application may never have
+any open file descriptors, but across save and restore it should see a coherent
+view of any mount namespace. NOTE: Currently only one "initial"
+mount namespace is supported.
+
+The second is so that system calls across save and restore are coherent with
+each other (e.g. so that unintended re-reads or overwrites do not occur).
+
+Specifically this state is:
+
+- An `fs.MountManager` containing mount points.
+
+- A `kernel.FDMap` containing pointers to open files.
+
+Anything else managed by the VFS that can be easily loaded into memory from a
+filesystem is synced back to those filesystems and is no saved. Examples are
+pages in page caches used for optimizations (i.e. readahead and writeback), and
+directory entries used to accelerate path lookups.
+
+### Mount points
+
+Saving and restoring a mount point means saving and restoring:
+
+- The root of the mounted filesystem.
+
+- Mount flags, which control how the VFS interacts with the mounted
+ filesystem.
+
+- Any relevant metadata about the mounted filesystem.
+
+- All `fs.Inode`s referenced by the application that reside under the mount
+ point.
+
+`fs.MountedFilesystem` is metadata about a filesystem that is mounted. It is
+referenced by every `fs.Inode` loaded into memory under the mount point
+including the `fs.Inode` of the mount point itself. The `fs.MountedFilesystem`
+maps file objects on the filesystem to a virtualized `fs.Inode` number and vice
+versa.
+
+To restore all `fs.Inode`s under a given mount point, each `fs.Inode` leverages
+its dependency on an `fs.MountedFilesystem`. Since the `fs.MountedFilesystem`
+knows how an `fs.Inode` maps to a file object on a backing filesystem, this
+mapping can be trivially consulted by each `fs.Inode` when the `fs.Inode` is
+restored.
+
+In detail, a mount point is saved in two steps:
+
+- First, after the kernel is paused but before state.Save, we walk all mount
+ namespaces and install a mapping from `fs.Inode` numbers to file paths
+ relative to the root of the mounted filesystem in each
+ `fs.MountedFilesystem`. This is subsequently called the set of `fs.Inode`
+ mappings.
+
+- Second, during state.Save, each `fs.MountedFilesystem` decides whether to
+ save the set of `fs.Inode` mappings. In-memory filesystems, like tmpfs, have
+ no need to save a set of `fs.Inode` mappings, since the `fs.Inode`s can be
+ entirely encoded in state file. Each `fs.MountedFilesystem` also optionally
+ saves the device name from when the filesystem was originally mounted. Each
+ `fs.Inode` saves its virtual identifier and a reference to a
+ `fs.MountedFilesystem`.
+
+A mount point is restored in two steps:
+
+- First, before state.Load, all mount configurations are stored in a global
+ `fs.RestoreEnvironment`. This tells us what mount points the user wants to
+ restore and how to re-establish pointers to backing filesystems.
+
+- Second, during state.Load, each `fs.MountedFilesystem` optionally searches
+ for a mount in the `fs.RestoreEnvironment` that matches its saved device
+ name. The `fs.MountedFilesystem` then restablishes a pointer to the root of
+ the mounted filesystem. For example, the mount specification provides the
+ network connection for a mounted remote filesystem client to communicate
+ with its remote file server. The `fs.MountedFilesystem` also trivially loads
+ its set of `fs.Inode` mappings. When an `fs.Inode` is encountered, the
+ `fs.Inode` loads its virtual identifier and its reference a
+ `fs.MountedFilesystem`. It uses the `fs.MountedFilesystem` to obtain the
+ root of the mounted filesystem and the `fs.Inode` mappings to obtain the
+ relative file path to its data. With these, the `fs.Inode` re-establishes a
+ pointer to its file object.
+
+A mount point can trivially restore its `fs.Inode`s in parallel since
+`fs.Inode`s have a restore dependency on their `fs.MountedFilesystem` and not on
+each other.
+
+### Open files
+
+An `fs.File` references the following filesystem objects:
+
+```go
+fs.File -> fs.Dirent -> fs.Inode -> fs.MountedFilesystem
+```
+
+The `fs.Inode` is restored using its `fs.MountedFilesystem`. The [Mount
+points](#mount-points) section above describes how this happens in detail. The
+`fs.Dirent` restores its pointer to an `fs.Inode`, pointers to parent and
+children `fs.Dirents`, and the basename of the file.
+
+Otherwise an `fs.File` restores flags, an offset, and a unique identifier (only
+used internally).
+
+It may use the `fs.Inode`, which it indirectly holds a reference on through the
+`fs.Dirent`, to restablish an open file handle on the backing filesystem (e.g.
+to continue reading and writing).
+
+## Overlay
+
+The overlay implementation in the fs package takes Linux overlayfs as a frame of
+reference but corrects for several POSIX consistency errors.
+
+In Linux overlayfs, the `struct inode` used for reading and writing to the same
+file may be different. This is because the `struct inode` is dissociated with
+the process of copying up the file from the upper to the lower directory. Since
+flock(2) and fcntl(2) locks, inotify(7) watches, page caches, and a file's
+identity are all stored directly or indirectly off the `struct inode`, these
+properties of the `struct inode` may be stale after the first modification. This
+can lead to file locking bugs, missed inotify events, and inconsistent data in
+shared memory mappings of files, to name a few problems.
+
+The fs package maintains a single `fs.Inode` to represent a directory entry in
+an overlay and defines operations on this `fs.Inode` which synchronize with the
+copy up process. This achieves several things:
+
++ File locks, inotify watches, and the identity of the file need not be copied
+ at all.
+
++ Memory mappings of files coordinate with the copy up process so that if a
+ file in the lower directory is memory mapped, all references to it are
+ invalidated, forcing the application to re-fault on memory mappings of the
+ file under the upper directory.
+
+The `fs.Inode` holds metadata about files in the upper and/or lower directories
+via an `fs.overlayEntry`. The `fs.overlayEntry` implements the `fs.Mappable`
+interface. It multiplexes between upper and lower directory memory mappings and
+stores a copy of memory references so they can be transferred to the upper
+directory `fs.Mappable` when the file is copied up.
+
+The `fs.Inode` also holds a reference to a `fs.MountedFilesystem` that
+normalizes across the mounted filesystem state of the upper and lower
+directories.
+
+When a file is copied from the lower to the upper directory, attempts to
+interact with the file block until the copy completes. All copying synchronizes
+with rename(2).
+
+## Future Work
+
+### Overlay
+
+When a file is copied from a lower directory to an upper directory, several
+locks are taken: the global renamuMu and the copyMu of the `fs.Inode` being
+copied. This blocks operations on the file, including fault handling of memory
+mappings. Performance could be improved by copying files into a temporary
+directory that resides on the same filesystem as the upper directory and doing
+an atomic rename, holding locks only during the rename operation.
+
+Additionally files are copied up synchronously. For large files, this causes a
+noticeable latency. Performance could be improved by pipelining copies at
+non-overlapping file offsets.