diff options
Diffstat (limited to 'pkg/sentry/fs/README.md')
-rw-r--r-- | pkg/sentry/fs/README.md | 217 |
1 files changed, 217 insertions, 0 deletions
diff --git a/pkg/sentry/fs/README.md b/pkg/sentry/fs/README.md new file mode 100644 index 000000000..898271ee8 --- /dev/null +++ b/pkg/sentry/fs/README.md @@ -0,0 +1,217 @@ +This package provides an implementation of the Linux virtual filesystem. + +[TOC] + +## Overview + +- An `fs.Dirent` caches an `fs.Inode` in memory at a path in the VFS, giving + the `fs.Inode` a relative position with respect to other `fs.Inode`s. + +- If an `fs.Dirent` is referenced by two file descriptors, then those file + descriptors are coherent with each other: they depend on the same + `fs.Inode`. + +- A mount point is an `fs.Dirent` for which `fs.Dirent.mounted` is true. It + exposes the root of a mounted filesystem. + +- The `fs.Inode` produced by a registered filesystem on mount(2) owns an + `fs.MountedFilesystem` from which other `fs.Inode`s will be looked up. For a + remote filesystem, the `fs.MountedFilesystem` owns the connection to that + remote filesystem. + +- In general: + +``` +fs.Inode <------------------------------ +| | +| | +produced by | +exactly one | +| responsible for the +| virtual identity of +v | +fs.MountedFilesystem ------------------- +``` + +Glossary: + +- VFS: virtual filesystem. + +- inode: a virtual file object holding a cached view of a file on a backing + filesystem (includes metadata and page caches). + +- superblock: the virtual state of a mounted filesystem (e.g. the virtual + inode number set). + +- mount namespace: a view of the mounts under a root (during path traversal, + the VFS makes visible/follows the mount point that is in the current task's + mount namespace). + +## Save and restore + +An application's hard dependencies on filesystem state can be broken down into +two categories: + +- The state necessary to execute a traversal on or view the *virtual* + filesystem hierarchy, regardless of what files an application has open. + +- The state necessary to represent open files. + +The first is always necessary to save and restore. An application may never have +any open file descriptors, but across save and restore it should see a coherent +view of any mount namespace. NOTE: Currently only one "initial" +mount namespace is supported. + +The second is so that system calls across save and restore are coherent with +each other (e.g. so that unintended re-reads or overwrites do not occur). + +Specifically this state is: + +- An `fs.MountManager` containing mount points. + +- A `kernel.FDMap` containing pointers to open files. + +Anything else managed by the VFS that can be easily loaded into memory from a +filesystem is synced back to those filesystems and is no saved. Examples are +pages in page caches used for optimizations (i.e. readahead and writeback), and +directory entries used to accelerate path lookups. + +### Mount points + +Saving and restoring a mount point means saving and restoring: + +- The root of the mounted filesystem. + +- Mount flags, which control how the VFS interacts with the mounted + filesystem. + +- Any relevant metadata about the mounted filesystem. + +- All `fs.Inode`s referenced by the application that reside under the mount + point. + +`fs.MountedFilesystem` is metadata about a filesystem that is mounted. It is +referenced by every `fs.Inode` loaded into memory under the mount point +including the `fs.Inode` of the mount point itself. The `fs.MountedFilesystem` +maps file objects on the filesystem to a virtualized `fs.Inode` number and vice +versa. + +To restore all `fs.Inode`s under a given mount point, each `fs.Inode` leverages +its dependency on an `fs.MountedFilesystem`. Since the `fs.MountedFilesystem` +knows how an `fs.Inode` maps to a file object on a backing filesystem, this +mapping can be trivially consulted by each `fs.Inode` when the `fs.Inode` is +restored. + +In detail, a mount point is saved in two steps: + +- First, after the kernel is paused but before state.Save, we walk all mount + namespaces and install a mapping from `fs.Inode` numbers to file paths + relative to the root of the mounted filesystem in each + `fs.MountedFilesystem`. This is subsequently called the set of `fs.Inode` + mappings. + +- Second, during state.Save, each `fs.MountedFilesystem` decides whether to + save the set of `fs.Inode` mappings. In-memory filesystems, like tmpfs, have + no need to save a set of `fs.Inode` mappings, since the `fs.Inode`s can be + entirely encoded in state file. Each `fs.MountedFilesystem` also optionally + saves the device name from when the filesystem was originally mounted. Each + `fs.Inode` saves its virtual identifier and a reference to a + `fs.MountedFilesystem`. + +A mount point is restored in two steps: + +- First, before state.Load, all mount configurations are stored in a global + `fs.RestoreEnvironment`. This tells us what mount points the user wants to + restore and how to re-establish pointers to backing filesystems. + +- Second, during state.Load, each `fs.MountedFilesystem` optionally searches + for a mount in the `fs.RestoreEnvironment` that matches its saved device + name. The `fs.MountedFilesystem` then restablishes a pointer to the root of + the mounted filesystem. For example, the mount specification provides the + network connection for a mounted remote filesystem client to communicate + with its remote file server. The `fs.MountedFilesystem` also trivially loads + its set of `fs.Inode` mappings. When an `fs.Inode` is encountered, the + `fs.Inode` loads its virtual identifier and its reference a + `fs.MountedFilesystem`. It uses the `fs.MountedFilesystem` to obtain the + root of the mounted filesystem and the `fs.Inode` mappings to obtain the + relative file path to its data. With these, the `fs.Inode` re-establishes a + pointer to its file object. + +A mount point can trivially restore its `fs.Inode`s in parallel since +`fs.Inode`s have a restore dependency on their `fs.MountedFilesystem` and not on +each other. + +### Open files + +An `fs.File` references the following filesystem objects: + +```go +fs.File -> fs.Dirent -> fs.Inode -> fs.MountedFilesystem +``` + +The `fs.Inode` is restored using its `fs.MountedFilesystem`. The [Mount +points](#mount-points) section above describes how this happens in detail. The +`fs.Dirent` restores its pointer to an `fs.Inode`, pointers to parent and +children `fs.Dirents`, and the basename of the file. + +Otherwise an `fs.File` restores flags, an offset, and a unique identifier (only +used internally). + +It may use the `fs.Inode`, which it indirectly holds a reference on through the +`fs.Dirent`, to restablish an open file handle on the backing filesystem (e.g. +to continue reading and writing). + +## Overlay + +The overlay implementation in the fs package takes Linux overlayfs as a frame of +reference but corrects for several POSIX consistency errors. + +In Linux overlayfs, the `struct inode` used for reading and writing to the same +file may be different. This is because the `struct inode` is dissociated with +the process of copying up the file from the upper to the lower directory. Since +flock(2) and fcntl(2) locks, inotify(7) watches, page caches, and a file's +identity are all stored directly or indirectly off the `struct inode`, these +properties of the `struct inode` may be stale after the first modification. This +can lead to file locking bugs, missed inotify events, and inconsistent data in +shared memory mappings of files, to name a few problems. + +The fs package maintains a single `fs.Inode` to represent a directory entry in +an overlay and defines operations on this `fs.Inode` which synchronize with the +copy up process. This achieves several things: + ++ File locks, inotify watches, and the identity of the file need not be copied + at all. + ++ Memory mappings of files coordinate with the copy up process so that if a + file in the lower directory is memory mapped, all references to it are + invalidated, forcing the application to re-fault on memory mappings of the + file under the upper directory. + +The `fs.Inode` holds metadata about files in the upper and/or lower directories +via an `fs.overlayEntry`. The `fs.overlayEntry` implements the `fs.Mappable` +interface. It multiplexes between upper and lower directory memory mappings and +stores a copy of memory references so they can be transferred to the upper +directory `fs.Mappable` when the file is copied up. + +The `fs.Inode` also holds a reference to a `fs.MountedFilesystem` that +normalizes across the mounted filesystem state of the upper and lower +directories. + +When a file is copied from the lower to the upper directory, attempts to +interact with the file block until the copy completes. All copying synchronizes +with rename(2). + +## Future Work + +### Overlay + +When a file is copied from a lower directory to an upper directory, several +locks are taken: the global renamuMu and the copyMu of the `fs.Inode` being +copied. This blocks operations on the file, including fault handling of memory +mappings. Performance could be improved by copying files into a temporary +directory that resides on the same filesystem as the upper directory and doing +an atomic rename, holding locks only during the rename operation. + +Additionally files are copied up synchronously. For large files, this causes a +noticeable latency. Performance could be improved by pipelining copies at +non-overlapping file offsets. |