summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/vfs/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'pkg/sentry/vfs/README.md')
-rw-r--r--pkg/sentry/vfs/README.md197
1 files changed, 197 insertions, 0 deletions
diff --git a/pkg/sentry/vfs/README.md b/pkg/sentry/vfs/README.md
new file mode 100644
index 000000000..7847854bc
--- /dev/null
+++ b/pkg/sentry/vfs/README.md
@@ -0,0 +1,197 @@
+# The gVisor Virtual Filesystem
+
+THIS PACKAGE IS CURRENTLY EXPERIMENTAL AND NOT READY OR ENABLED FOR PRODUCTION
+USE. For the filesystem implementation currently used by gVisor, see the `fs`
+package.
+
+## Implementation Notes
+
+### Reference Counting
+
+Filesystem, Dentry, Mount, MountNamespace, and FileDescription are all
+reference-counted. Mount and MountNamespace are exclusively VFS-managed; when
+their reference count reaches zero, VFS releases their resources. Filesystem and
+FileDescription management is shared between VFS and filesystem implementations;
+when their reference count reaches zero, VFS notifies the implementation by
+calling `FilesystemImpl.Release()` or `FileDescriptionImpl.Release()`
+respectively and then releases VFS-owned resources. Dentries are exclusively
+managed by filesystem implementations; reference count changes are abstracted
+through DentryImpl, which should release resources when reference count reaches
+zero.
+
+Filesystem references are held by:
+
+- Mount: Each referenced Mount holds a reference on the mounted Filesystem.
+
+Dentry references are held by:
+
+- FileDescription: Each referenced FileDescription holds a reference on the
+ Dentry through which it was opened, via `FileDescription.vd.dentry`.
+
+- Mount: Each referenced Mount holds a reference on its mount point and on the
+ mounted filesystem root. The mount point is mutable (`mount(MS_MOVE)`).
+
+Mount references are held by:
+
+- FileDescription: Each referenced FileDescription holds a reference on the
+ Mount on which it was opened, via `FileDescription.vd.mount`.
+
+- Mount: Each referenced Mount holds a reference on its parent, which is the
+ mount containing its mount point.
+
+- VirtualFilesystem: A reference is held on all Mounts that are attached
+ (reachable by Mount traversal).
+
+MountNamespace and FileDescription references are held by users of VFS. The
+expectation is that each `kernel.Task` holds a reference on its corresponding
+MountNamespace, and each file descriptor holds a reference on its represented
+FileDescription.
+
+Notes:
+
+- Dentries do not hold a reference on their owning Filesystem. Instead, all
+ uses of a Dentry occur in the context of a Mount, which holds a reference on
+ the relevant Filesystem (see e.g. the VirtualDentry type). As a corollary,
+ when releasing references on both a Dentry and its corresponding Mount, the
+ Dentry's reference must be released first (because releasing the Mount's
+ reference may release the last reference on the Filesystem, whose state may
+ be required to release the Dentry reference).
+
+### The Inheritance Pattern
+
+Filesystem, Dentry, and FileDescription are all concepts featuring both state
+that must be shared between VFS and filesystem implementations, and operations
+that are implementation-defined. To facilitate this, each of these three
+concepts follows the same pattern, shown below for Dentry:
+
+```go
+// Dentry represents a node in a filesystem tree.
+type Dentry struct {
+ // VFS-required dentry state.
+ parent *Dentry
+ // ...
+
+ // impl is the DentryImpl associated with this Dentry. impl is immutable.
+ // This should be the last field in Dentry.
+ impl DentryImpl
+}
+
+// Init must be called before first use of d.
+func (d *Dentry) Init(impl DentryImpl) {
+ d.impl = impl
+}
+
+// Impl returns the DentryImpl associated with d.
+func (d *Dentry) Impl() DentryImpl {
+ return d.impl
+}
+
+// DentryImpl contains implementation-specific details of a Dentry.
+// Implementations of DentryImpl should contain their associated Dentry by
+// value as their first field.
+type DentryImpl interface {
+ // VFS-required implementation-defined dentry operations.
+ IncRef()
+ // ...
+}
+```
+
+This construction, which is essentially a type-safe analogue to Linux's
+`container_of` pattern, has the following properties:
+
+- VFS works almost exclusively with pointers to Dentry rather than DentryImpl
+ interface objects, such as in the type of `Dentry.parent`. This avoids
+ interface method calls (which are somewhat expensive to perform, and defeat
+ inlining and escape analysis), reduces the size of VFS types (since an
+ interface object is two pointers in size), and allows pointers to be loaded
+ and stored atomically using `sync/atomic`. Implementation-defined behavior
+ is accessed via `Dentry.impl` when required.
+
+- Filesystem implementations can access the implementation-defined state
+ associated with objects of VFS types by type-asserting or type-switching
+ (e.g. `Dentry.Impl().(*myDentry)`). Type assertions to a concrete type
+ require only an equality comparison of the interface object's type pointer
+ to a static constant, and are consequently very fast.
+
+- Filesystem implementations can access the VFS state associated with objects
+ of implementation-defined types directly.
+
+- VFS and implementation-defined state for a given type occupy the same
+ object, minimizing memory allocations and maximizing memory locality. `impl`
+ is the last field in `Dentry`, and `Dentry` is the first field in
+ `DentryImpl` implementations, for similar reasons: this tends to cause
+ fetching of the `Dentry.impl` interface object to also fetch `DentryImpl`
+ fields, either because they are in the same cache line or via next-line
+ prefetching.
+
+## Future Work
+
+- Most `mount(2)` features, and unmounting, are incomplete.
+
+- VFS1 filesystems are not directly compatible with VFS2. It may be possible
+ to implement shims that implement `vfs.FilesystemImpl` for
+ `fs.MountNamespace`, `vfs.DentryImpl` for `fs.Dirent`, and
+ `vfs.FileDescriptionImpl` for `fs.File`, which may be adequate for
+ filesystems that are not performance-critical (e.g. sysfs); however, it is
+ not clear that this will be less effort than simply porting the filesystems
+ in question. Practically speaking, the following filesystems will probably
+ need to be ported or made compatible through a shim to evaluate filesystem
+ performance on realistic workloads:
+
+ - devfs/procfs/sysfs, which will realistically be necessary to execute
+ most applications. (Note that procfs and sysfs do not support hard
+ links, so they do not require the complexity of separate inode objects.
+ Also note that Linux's /dev is actually a variant of tmpfs called
+ devtmpfs.)
+
+ - tmpfs. This should be relatively straightforward: copy/paste memfs,
+ store regular file contents in pgalloc-allocated memory instead of
+ `[]byte`, and add support for file timestamps. (In fact, it probably
+ makes more sense to convert memfs to tmpfs and not keep the former.)
+
+ - A remote filesystem, either lisafs (if it is ready by the time that
+ other benchmarking prerequisites are) or v9fs (aka 9P, aka gofers).
+
+ - epoll files.
+
+ Filesystems that will need to be ported before switching to VFS2, but can
+ probably be skipped for early testing:
+
+ - overlayfs, which is needed for (at least) synthetic mount points.
+
+ - Support for host ttys.
+
+ - timerfd files.
+
+ Filesystems that can be probably dropped:
+
+ - ashmem, which is far too incomplete to use.
+
+ - binder, which is similarly far too incomplete to use.
+
+ - whitelistfs, which we are already actively attempting to remove.
+
+- Save/restore. For instance, it is unclear if the current implementation of
+ the `state` package supports the inheritance pattern described above.
+
+- Many features that were previously implemented by VFS must now be
+ implemented by individual filesystems (though, in most cases, this should
+ consist of calls to hooks or libraries provided by `vfs` or other packages).
+ This includes, but is not necessarily limited to:
+
+ - Block and character device special files
+
+ - Inotify
+
+ - File locking
+
+ - `O_ASYNC`
+
+- Reference counts in the `vfs` package do not use the `refs` package since
+ `refs.AtomicRefCount` adds 64 bytes of overhead to each 8-byte reference
+ count, resulting in considerable cache bloat. 24 bytes of this overhead is
+ for weak reference support, which have poor performance and will not be used
+ by VFS2. The remaining 40 bytes is to store a descriptive string and stack
+ trace for reference leak checking; we can support reference leak checking
+ without incurring this space overhead by including the applicable
+ information directly in finalizers for applicable types.