# The gVisor Virtual Filesystem

THIS PACKAGE IS CURRENTLY EXPERIMENTAL AND NOT READY OR ENABLED FOR PRODUCTION
USE. For the filesystem implementation currently used by gVisor, see the `fs`
package.

## Implementation Notes

### Reference Counting

Filesystem, Dentry, Mount, MountNamespace, and FileDescription are all
reference-counted. Mount and MountNamespace are exclusively VFS-managed; when
their reference count reaches zero, VFS releases their resources. Filesystem and
FileDescription management is shared between VFS and filesystem implementations;
when their reference count reaches zero, VFS notifies the implementation by
calling `FilesystemImpl.Release()` or `FileDescriptionImpl.Release()`
respectively and then releases VFS-owned resources. Dentries are exclusively
managed by filesystem implementations; reference count changes are abstracted
through DentryImpl, which should release resources when reference count reaches
zero.

Filesystem references are held by:

-   Mount: Each referenced Mount holds a reference on the mounted Filesystem.

Dentry references are held by:

-   FileDescription: Each referenced FileDescription holds a reference on the
    Dentry through which it was opened, via `FileDescription.vd.dentry`.

-   Mount: Each referenced Mount holds a reference on its mount point and on the
    mounted filesystem root. The mount point is mutable (`mount(MS_MOVE)`).

Mount references are held by:

-   FileDescription: Each referenced FileDescription holds a reference on the
    Mount on which it was opened, via `FileDescription.vd.mount`.

-   Mount: Each referenced Mount holds a reference on its parent, which is the
    mount containing its mount point.

-   VirtualFilesystem: A reference is held on all Mounts that are attached
    (reachable by Mount traversal).

MountNamespace and FileDescription references are held by users of VFS. The
expectation is that each `kernel.Task` holds a reference on its corresponding
MountNamespace, and each file descriptor holds a reference on its represented
FileDescription.

Notes:

-   Dentries do not hold a reference on their owning Filesystem. Instead, all
    uses of a Dentry occur in the context of a Mount, which holds a reference on
    the relevant Filesystem (see e.g. the VirtualDentry type). As a corollary,
    when releasing references on both a Dentry and its corresponding Mount, the
    Dentry's reference must be released first (because releasing the Mount's
    reference may release the last reference on the Filesystem, whose state may
    be required to release the Dentry reference).

### The Inheritance Pattern

Filesystem, Dentry, and FileDescription are all concepts featuring both state
that must be shared between VFS and filesystem implementations, and operations
that are implementation-defined. To facilitate this, each of these three
concepts follows the same pattern, shown below for Dentry:

```go
// Dentry represents a node in a filesystem tree.
type Dentry struct {
  // VFS-required dentry state.
  parent *Dentry
  // ...

  // impl is the DentryImpl associated with this Dentry. impl is immutable.
  // This should be the last field in Dentry.
  impl DentryImpl
}

// Init must be called before first use of d.
func (d *Dentry) Init(impl DentryImpl) {
  d.impl = impl
}

// Impl returns the DentryImpl associated with d.
func (d *Dentry) Impl() DentryImpl {
  return d.impl
}

// DentryImpl contains implementation-specific details of a Dentry.
// Implementations of DentryImpl should contain their associated Dentry by
// value as their first field.
type DentryImpl interface {
  // VFS-required implementation-defined dentry operations.
  IncRef()
  // ...
}
```

This construction, which is essentially a type-safe analogue to Linux's
`container_of` pattern, has the following properties:

-   VFS works almost exclusively with pointers to Dentry rather than DentryImpl
    interface objects, such as in the type of `Dentry.parent`. This avoids
    interface method calls (which are somewhat expensive to perform, and defeat
    inlining and escape analysis), reduces the size of VFS types (since an
    interface object is two pointers in size), and allows pointers to be loaded
    and stored atomically using `sync/atomic`. Implementation-defined behavior
    is accessed via `Dentry.impl` when required.

-   Filesystem implementations can access the implementation-defined state
    associated with objects of VFS types by type-asserting or type-switching
    (e.g. `Dentry.Impl().(*myDentry)`). Type assertions to a concrete type
    require only an equality comparison of the interface object's type pointer
    to a static constant, and are consequently very fast.

-   Filesystem implementations can access the VFS state associated with objects
    of implementation-defined types directly.

-   VFS and implementation-defined state for a given type occupy the same
    object, minimizing memory allocations and maximizing memory locality. `impl`
    is the last field in `Dentry`, and `Dentry` is the first field in
    `DentryImpl` implementations, for similar reasons: this tends to cause
    fetching of the `Dentry.impl` interface object to also fetch `DentryImpl`
    fields, either because they are in the same cache line or via next-line
    prefetching.

## Future Work

-   Most `mount(2)` features, and unmounting, are incomplete.

-   VFS1 filesystems are not directly compatible with VFS2. It may be possible
    to implement shims that implement `vfs.FilesystemImpl` for
    `fs.MountNamespace`, `vfs.DentryImpl` for `fs.Dirent`, and
    `vfs.FileDescriptionImpl` for `fs.File`, which may be adequate for
    filesystems that are not performance-critical (e.g. sysfs); however, it is
    not clear that this will be less effort than simply porting the filesystems
    in question. Practically speaking, the following filesystems will probably
    need to be ported or made compatible through a shim to evaluate filesystem
    performance on realistic workloads:

    -   devfs/procfs/sysfs, which will realistically be necessary to execute
        most applications. (Note that procfs and sysfs do not support hard
        links, so they do not require the complexity of separate inode objects.
        Also note that Linux's /dev is actually a variant of tmpfs called
        devtmpfs.)

    -   tmpfs. This should be relatively straightforward: copy/paste memfs,
        store regular file contents in pgalloc-allocated memory instead of
        `[]byte`, and add support for file timestamps. (In fact, it probably
        makes more sense to convert memfs to tmpfs and not keep the former.)

    -   A remote filesystem, either lisafs (if it is ready by the time that
        other benchmarking prerequisites are) or v9fs (aka 9P, aka gofers).

    -   epoll files.

    Filesystems that will need to be ported before switching to VFS2, but can
    probably be skipped for early testing:

    -   overlayfs, which is needed for (at least) synthetic mount points.

    -   Support for host ttys.

    -   timerfd files.

    Filesystems that can be probably dropped:

    -   ashmem, which is far too incomplete to use.

    -   binder, which is similarly far too incomplete to use.

    -   whitelistfs, which we are already actively attempting to remove.

-   Save/restore. For instance, it is unclear if the current implementation of
    the `state` package supports the inheritance pattern described above.

-   Many features that were previously implemented by VFS must now be
    implemented by individual filesystems (though, in most cases, this should
    consist of calls to hooks or libraries provided by `vfs` or other packages).
    This includes, but is not necessarily limited to:

    -   Block and character device special files

    -   Inotify

    -   File locking

    -   `O_ASYNC`

-   Reference counts in the `vfs` package do not use the `refs` package since
    `refs.AtomicRefCount` adds 64 bytes of overhead to each 8-byte reference
    count, resulting in considerable cache bloat. 24 bytes of this overhead is
    for weak reference support, which have poor performance and will not be used
    by VFS2. The remaining 40 bytes is to store a descriptive string and stack
    trace for reference leak checking; we can support reference leak checking
    without incurring this space overhead by including the applicable
    information directly in finalizers for applicable types.