diff options
Diffstat (limited to 'pkg/sentry/vfs/README.md')
-rw-r--r-- | pkg/sentry/vfs/README.md | 197 |
1 files changed, 197 insertions, 0 deletions
diff --git a/pkg/sentry/vfs/README.md b/pkg/sentry/vfs/README.md new file mode 100644 index 000000000..7847854bc --- /dev/null +++ b/pkg/sentry/vfs/README.md @@ -0,0 +1,197 @@ +# The gVisor Virtual Filesystem + +THIS PACKAGE IS CURRENTLY EXPERIMENTAL AND NOT READY OR ENABLED FOR PRODUCTION +USE. For the filesystem implementation currently used by gVisor, see the `fs` +package. + +## Implementation Notes + +### Reference Counting + +Filesystem, Dentry, Mount, MountNamespace, and FileDescription are all +reference-counted. Mount and MountNamespace are exclusively VFS-managed; when +their reference count reaches zero, VFS releases their resources. Filesystem and +FileDescription management is shared between VFS and filesystem implementations; +when their reference count reaches zero, VFS notifies the implementation by +calling `FilesystemImpl.Release()` or `FileDescriptionImpl.Release()` +respectively and then releases VFS-owned resources. Dentries are exclusively +managed by filesystem implementations; reference count changes are abstracted +through DentryImpl, which should release resources when reference count reaches +zero. + +Filesystem references are held by: + +- Mount: Each referenced Mount holds a reference on the mounted Filesystem. + +Dentry references are held by: + +- FileDescription: Each referenced FileDescription holds a reference on the + Dentry through which it was opened, via `FileDescription.vd.dentry`. + +- Mount: Each referenced Mount holds a reference on its mount point and on the + mounted filesystem root. The mount point is mutable (`mount(MS_MOVE)`). + +Mount references are held by: + +- FileDescription: Each referenced FileDescription holds a reference on the + Mount on which it was opened, via `FileDescription.vd.mount`. + +- Mount: Each referenced Mount holds a reference on its parent, which is the + mount containing its mount point. + +- VirtualFilesystem: A reference is held on all Mounts that are attached + (reachable by Mount traversal). + +MountNamespace and FileDescription references are held by users of VFS. The +expectation is that each `kernel.Task` holds a reference on its corresponding +MountNamespace, and each file descriptor holds a reference on its represented +FileDescription. + +Notes: + +- Dentries do not hold a reference on their owning Filesystem. Instead, all + uses of a Dentry occur in the context of a Mount, which holds a reference on + the relevant Filesystem (see e.g. the VirtualDentry type). As a corollary, + when releasing references on both a Dentry and its corresponding Mount, the + Dentry's reference must be released first (because releasing the Mount's + reference may release the last reference on the Filesystem, whose state may + be required to release the Dentry reference). + +### The Inheritance Pattern + +Filesystem, Dentry, and FileDescription are all concepts featuring both state +that must be shared between VFS and filesystem implementations, and operations +that are implementation-defined. To facilitate this, each of these three +concepts follows the same pattern, shown below for Dentry: + +```go +// Dentry represents a node in a filesystem tree. +type Dentry struct { + // VFS-required dentry state. + parent *Dentry + // ... + + // impl is the DentryImpl associated with this Dentry. impl is immutable. + // This should be the last field in Dentry. + impl DentryImpl +} + +// Init must be called before first use of d. +func (d *Dentry) Init(impl DentryImpl) { + d.impl = impl +} + +// Impl returns the DentryImpl associated with d. +func (d *Dentry) Impl() DentryImpl { + return d.impl +} + +// DentryImpl contains implementation-specific details of a Dentry. +// Implementations of DentryImpl should contain their associated Dentry by +// value as their first field. +type DentryImpl interface { + // VFS-required implementation-defined dentry operations. + IncRef() + // ... +} +``` + +This construction, which is essentially a type-safe analogue to Linux's +`container_of` pattern, has the following properties: + +- VFS works almost exclusively with pointers to Dentry rather than DentryImpl + interface objects, such as in the type of `Dentry.parent`. This avoids + interface method calls (which are somewhat expensive to perform, and defeat + inlining and escape analysis), reduces the size of VFS types (since an + interface object is two pointers in size), and allows pointers to be loaded + and stored atomically using `sync/atomic`. Implementation-defined behavior + is accessed via `Dentry.impl` when required. + +- Filesystem implementations can access the implementation-defined state + associated with objects of VFS types by type-asserting or type-switching + (e.g. `Dentry.Impl().(*myDentry)`). Type assertions to a concrete type + require only an equality comparison of the interface object's type pointer + to a static constant, and are consequently very fast. + +- Filesystem implementations can access the VFS state associated with objects + of implementation-defined types directly. + +- VFS and implementation-defined state for a given type occupy the same + object, minimizing memory allocations and maximizing memory locality. `impl` + is the last field in `Dentry`, and `Dentry` is the first field in + `DentryImpl` implementations, for similar reasons: this tends to cause + fetching of the `Dentry.impl` interface object to also fetch `DentryImpl` + fields, either because they are in the same cache line or via next-line + prefetching. + +## Future Work + +- Most `mount(2)` features, and unmounting, are incomplete. + +- VFS1 filesystems are not directly compatible with VFS2. It may be possible + to implement shims that implement `vfs.FilesystemImpl` for + `fs.MountNamespace`, `vfs.DentryImpl` for `fs.Dirent`, and + `vfs.FileDescriptionImpl` for `fs.File`, which may be adequate for + filesystems that are not performance-critical (e.g. sysfs); however, it is + not clear that this will be less effort than simply porting the filesystems + in question. Practically speaking, the following filesystems will probably + need to be ported or made compatible through a shim to evaluate filesystem + performance on realistic workloads: + + - devfs/procfs/sysfs, which will realistically be necessary to execute + most applications. (Note that procfs and sysfs do not support hard + links, so they do not require the complexity of separate inode objects. + Also note that Linux's /dev is actually a variant of tmpfs called + devtmpfs.) + + - tmpfs. This should be relatively straightforward: copy/paste memfs, + store regular file contents in pgalloc-allocated memory instead of + `[]byte`, and add support for file timestamps. (In fact, it probably + makes more sense to convert memfs to tmpfs and not keep the former.) + + - A remote filesystem, either lisafs (if it is ready by the time that + other benchmarking prerequisites are) or v9fs (aka 9P, aka gofers). + + - epoll files. + + Filesystems that will need to be ported before switching to VFS2, but can + probably be skipped for early testing: + + - overlayfs, which is needed for (at least) synthetic mount points. + + - Support for host ttys. + + - timerfd files. + + Filesystems that can be probably dropped: + + - ashmem, which is far too incomplete to use. + + - binder, which is similarly far too incomplete to use. + + - whitelistfs, which we are already actively attempting to remove. + +- Save/restore. For instance, it is unclear if the current implementation of + the `state` package supports the inheritance pattern described above. + +- Many features that were previously implemented by VFS must now be + implemented by individual filesystems (though, in most cases, this should + consist of calls to hooks or libraries provided by `vfs` or other packages). + This includes, but is not necessarily limited to: + + - Block and character device special files + + - Inotify + + - File locking + + - `O_ASYNC` + +- Reference counts in the `vfs` package do not use the `refs` package since + `refs.AtomicRefCount` adds 64 bytes of overhead to each 8-byte reference + count, resulting in considerable cache bloat. 24 bytes of this overhead is + for weak reference support, which have poor performance and will not be used + by VFS2. The remaining 40 bytes is to store a descriptive string and stack + trace for reference leak checking; we can support reference leak checking + without incurring this space overhead by including the applicable + information directly in finalizers for applicable types. |