diff options
Diffstat (limited to 'pkg/lisafs/README.md')
-rw-r--r-- | pkg/lisafs/README.md | 363 |
1 files changed, 0 insertions, 363 deletions
diff --git a/pkg/lisafs/README.md b/pkg/lisafs/README.md deleted file mode 100644 index 51d0d40e5..000000000 --- a/pkg/lisafs/README.md +++ /dev/null @@ -1,363 +0,0 @@ -# Replacing 9P - -## Background - -The Linux filesystem model consists of the following key aspects (modulo mounts, -which are outside the scope of this discussion): - -- A `struct inode` represents a "filesystem object", such as a directory or a - regular file. "Filesystem object" is most precisely defined by the practical - properties of an inode, such as an immutable type (regular file, directory, - symbolic link, etc.) and its independence from the path originally used to - obtain it. - -- A `struct dentry` represents a node in a filesystem tree. Semantically, each - dentry is immutably associated with an inode representing the filesystem - object at that position. (Linux implements optimizations involving reuse of - unreferenced dentries, which allows their associated inodes to change, but - this is outside the scope of this discussion.) - -- A `struct file` represents an open file description (hereafter FD) and is - needed to perform I/O. Each FD is immutably associated with the dentry - through which it was opened. - -The current gVisor virtual filesystem implementation (hereafter VFS1) closely -imitates the Linux design: - -- `struct inode` => `fs.Inode` - -- `struct dentry` => `fs.Dirent` - -- `struct file` => `fs.File` - -gVisor accesses most external filesystems through a variant of the 9P2000.L -protocol, including extensions for performance (`walkgetattr`) and for features -not supported by vanilla 9P2000.L (`flushf`, `lconnect`). The 9P protocol family -is inode-based; 9P fids represent a file (equivalently "file system object"), -and the protocol is structured around alternatively obtaining fids to represent -files (with `walk` and, in gVisor, `walkgetattr`) and performing operations on -those fids. - -In the sections below, a **shared** filesystem is a filesystem that is *mutably* -accessible by multiple concurrent clients, such that a **non-shared** filesystem -is a filesystem that is either read-only or accessible by only a single client. - -## Problems - -### Serialization of Path Component RPCs - -Broadly speaking, VFS1 traverses each path component in a pathname, alternating -between verifying that each traversed dentry represents an inode that represents -a searchable directory and moving to the next dentry in the path. - -In the context of a remote filesystem, the structure of this traversal means -that - modulo caching - a path involving N components requires at least N-1 -*sequential* RPCs to obtain metadata for intermediate directories, incurring -significant latency. (In vanilla 9P2000.L, 2(N-1) RPCs are required: N-1 `walk` -and N-1 `getattr`. We added the `walkgetattr` RPC to reduce this overhead.) On -non-shared filesystems, this overhead is primarily significant during -application startup; caching mitigates much of this overhead at steady state. On -shared filesystems, where correct caching requires revalidation (requiring RPCs -for each revalidated directory anyway), this overhead is consistently ruinous. - -### Inefficient RPCs - -9P is not exceptionally economical with RPCs in general. In addition to the -issue described above: - -- Opening an existing file in 9P involves at least 2 RPCs: `walk` to produce - an unopened fid representing the file, and `lopen` to open the fid. - -- Creating a file also involves at least 2 RPCs: `walk` to produce an unopened - fid representing the parent directory, and `lcreate` to create the file and - convert the fid to an open fid representing the created file. In practice, - both the Linux and gVisor 9P clients expect to have an unopened fid for the - created file (necessitating an additional `walk`), as well as attributes for - the created file (necessitating an additional `getattr`), for a total of 4 - RPCs. (In a shared filesystem, where whether a file already exists can - change between RPCs, a correct implementation of `open(O_CREAT)` would have - to alternate between these two paths (plus `clunk`ing the temporary fid - between alternations, since the nature of the `fid` differs between the two - paths). Neither Linux nor gVisor implement the required alternation, so - `open(O_CREAT)` without `O_EXCL` can spuriously fail with `EEXIST` on both.) - -- Closing (`clunk`ing) a fid requires an RPC. VFS1 issues this RPC - asynchronously in an attempt to reduce critical path latency, but scheduling - overhead makes this not clearly advantageous in practice. - -- `read` and `readdir` can return partial reads without a way to indicate EOF, - necessitating an additional final read to detect EOF. - -- Operations that affect filesystem state do not consistently return updated - filesystem state. In gVisor, the client implementation attempts to handle - this by tracking what it thinks updated state "should" be; this is complex, - and especially brittle for timestamps (which are often not arbitrarily - settable). In Linux, the client implemtation invalidates cached metadata - whenever it performs such an operation, and reloads it when a dentry - corresponding to an inode with no valid cached metadata is revalidated; this - is simple, but necessitates an additional `getattr`. - -### Dentry/Inode Ambiguity - -As noted above, 9P's documentation tends to imply that unopened fids represent -an inode. In practice, most filesystem APIs present very limited interfaces for -working with inodes at best, such that the interpretation of unopened fids -varies: - -- Linux's 9P client associates unopened fids with (dentry, uid) pairs. When - caching is enabled, it also associates each inode with the first fid opened - writably that references that inode, in order to support page cache - writeback. - -- gVisor's 9P client associates unopened fids with inodes, and also caches - opened fids in inodes in a manner similar to Linux. - -- The runsc fsgofer associates unopened fids with both "dentries" (host - filesystem paths) and "inodes" (host file descriptors); which is used - depends on the operation invoked on the fid. - -For non-shared filesystems, this confusion has resulted in correctness issues -that are (in gVisor) currently handled by a number of coarse-grained locks that -serialize renames with all other filesystem operations. For shared filesystems, -this means inconsistent behavior in the presence of concurrent mutation. - -## Design - -Almost all Linux filesystem syscalls describe filesystem resources in one of two -ways: - -- Path-based: A filesystem position is described by a combination of a - starting position and a sequence of path components relative to that - position, where the starting position is one of: - - - The VFS root (defined by mount namespace and chroot), for absolute paths - - - The VFS position of an existing FD, for relative paths passed to `*at` - syscalls (e.g. `statat`) - - - The current working directory, for relative paths passed to non-`*at` - syscalls and `*at` syscalls with `AT_FDCWD` - -- File-description-based: A filesystem object is described by an existing FD, - passed to a `f*` syscall (e.g. `fstat`). - -Many of our issues with 9P arise from its (and VFS') interposition of a model -based on inodes between the filesystem syscall API and filesystem -implementations. We propose to replace 9P with a protocol that does not feature -inodes at all, and instead closely follows the filesystem syscall API by -featuring only path-based and FD-based operations, with minimal deviations as -necessary to ameliorate deficiencies in the syscall interface (see below). This -approach addresses the issues described above: - -- Even on shared filesystems, most application filesystem syscalls are - translated to a single RPC (possibly excepting special cases described - below), which is a logical lower bound. - -- The behavior of application syscalls on shared filesystems is - straightforwardly predictable: path-based syscalls are translated to - path-based RPCs, which will re-lookup the file at that path, and FD-based - syscalls are translated to FD-based RPCs, which use an existing open file - without performing another lookup. (This is at least true on gofers that - proxy the host local filesystem; other filesystems that lack support for - e.g. certain operations on FDs may have different behavior, but this - divergence is at least still predictable and inherent to the underlying - filesystem implementation.) - -Note that this approach is only feasible in gVisor's next-generation virtual -filesystem (VFS2), which does not assume the existence of inodes and allows the -remote filesystem client to translate whole path-based syscalls into RPCs. Thus -one of the unavoidable tradeoffs associated with such a protocol vs. 9P is the -inability to construct a Linux client that is performance-competitive with -gVisor. - -### File Permissions - -Many filesystem operations are side-effectual, such that file permissions must -be checked before such operations take effect. The simplest approach to file -permission checking is for the sentry to obtain permissions from the remote -filesystem, then apply permission checks in the sentry before performing the -application-requested operation. However, this requires an additional RPC per -application syscall (which can't be mitigated by caching on shared filesystems). -Alternatively, we may delegate file permission checking to gofers. In general, -file permission checks depend on the following properties of the accessor: - -- Filesystem UID/GID - -- Supplementary GIDs - -- Effective capabilities in the accessor's user namespace (i.e. the accessor's - effective capability set) - -- All UIDs and GIDs mapped in the accessor's user namespace (which determine - if the accessor's capabilities apply to accessed files) - -We may choose to delay implementation of file permission checking delegation, -although this is potentially costly since it doubles the number of required RPCs -for most operations on shared filesystems. We may also consider compromise -options, such as only delegating file permission checks for accessors in the -root user namespace. - -### Symbolic Links - -gVisor usually interprets symbolic link targets in its VFS rather than on the -filesystem containing the symbolic link; thus e.g. a symlink to -"/proc/self/maps" on a remote filesystem resolves to said file in the sentry's -procfs rather than the host's. This implies that: - -- Remote filesystem servers that proxy filesystems supporting symlinks must - check if each path component is a symlink during path traversal. - -- Absolute symlinks require that the sentry restart the operation at its - contextual VFS root (which is task-specific and may not be on a remote - filesystem at all), so if a remote filesystem server encounters an absolute - symlink during path traversal on behalf of a path-based operation, it must - terminate path traversal and return the symlink target. - -- Relative symlinks begin target resolution in the parent directory of the - symlink, so in theory most relative symlinks can be handled automatically - during the path traversal that encounters the symlink, provided that said - traversal is supplied with the number of remaining symlinks before `ELOOP`. - However, the new path traversed by the symlink target may cross VFS mount - boundaries, such that it's only safe for remote filesystem servers to - speculatively follow relative symlinks for side-effect-free operations such - as `stat` (where the sentry can simply ignore results that are inapplicable - due to crossing mount boundaries). We may choose to delay implementation of - this feature, at the cost of an additional RPC per relative symlink (note - that even if the symlink target crosses a mount boundary, the sentry will - need to `stat` the path to the mount boundary to confirm that each traversed - component is an accessible directory); until it is implemented, relative - symlinks may be handled like absolute symlinks, by terminating path - traversal and returning the symlink target. - -The possibility of symlinks (and the possibility of a compromised sentry) means -that the sentry may issue RPCs with paths that, in the absence of symlinks, -would traverse beyond the root of the remote filesystem. For example, the sentry -may issue an RPC with a path like "/foo/../..", on the premise that if "/foo" is -a symlink then the resulting path may be elsewhere on the remote filesystem. To -handle this, path traversal must also track its current depth below the remote -filesystem root, and terminate path traversal if it would ascend beyond this -point. - -### Path Traversal - -Since path-based VFS operations will translate to path-based RPCs, filesystem -servers will need to handle path traversal. From the perspective of a given -filesystem implementation in the server, there are two basic approaches to path -traversal: - -- Inode-walk: For each path component, obtain a handle to the underlying - filesystem object (e.g. with `open(O_PATH)`), check if that object is a - symlink (as described above) and that that object is accessible by the - caller (e.g. with `fstat()`), then continue to the next path component (e.g. - with `openat()`). This ensures that the checked filesystem object is the one - used to obtain the next object in the traversal, which is intuitively - appealing. However, while this approach works for host local filesystems, it - requires features that are not widely supported by other filesystems. - -- Path-walk: For each path component, use a path-based operation to determine - if the filesystem object currently referred to by that path component is a - symlink / is accessible. This is highly portable, but suffers from quadratic - behavior (at the level of the underlying filesystem implementation, the - first path component will be traversed a number of times equal to the number - of path components in the path). - -The implementation should support either option by delegating path traversal to -filesystem implementations within the server (like VFS and the remote filesystem -protocol itself), as inode-walking is still safe, efficient, amenable to FD -caching, and implementable on non-shared host local filesystems (a sufficiently -common case as to be worth considering in the design). - -Both approaches are susceptible to race conditions that may permit sandboxed -filesystem escapes: - -- Under inode-walk, a malicious application may cause a directory to be moved - (with `rename`) during path traversal, such that the filesystem - implementation incorrectly determines whether subsequent inodes are located - in paths that should be visible to sandboxed applications. - -- Under path-walk, a malicious application may cause a non-symlink file to be - replaced with a symlink during path traversal, such that following path - operations will incorrectly follow the symlink. - -Both race conditions can, to some extent, be mitigated in filesystem server -implementations by synchronizing path traversal with the hazardous operations in -question. However, shared filesystems are frequently used to share data between -sandboxed and unsandboxed applications in a controlled way, and in some cases a -malicious sandboxed application may be able to take advantage of a hazardous -filesystem operation performed by an unsandboxed application. In some cases, -filesystem features may be available to ensure safety even in such cases (e.g. -[the new openat2() syscall](https://man7.org/linux/man-pages/man2/openat2.2.html)), -but it is not clear how to solve this problem in general. (Note that this issue -is not specific to our design; rather, it is a fundamental limitation of -filesystem sandboxing.) - -### Filesystem Multiplexing - -A given sentry may need to access multiple distinct remote filesystems (e.g. -different volumes for a given container). In many cases, there is no advantage -to serving these filesystems from distinct filesystem servers, or accessing them -through distinct connections (factors such as maximum RPC concurrency should be -based on available host resources). Therefore, the protocol should support -multiplexing of distinct filesystem trees within a single session. 9P supports -this by allowing multiple calls to the `attach` RPC to produce fids representing -distinct filesystem trees, but this is somewhat clunky; we propose a much -simpler mechanism wherein each message that conveys a path also conveys a -numeric filesystem ID that identifies a filesystem tree. - -## Alternatives Considered - -### Additional Extensions to 9P - -There are at least three conceptual aspects to 9P: - -- Wire format: messages with a 4-byte little-endian size prefix, strings with - a 2-byte little-endian size prefix, etc. Whether the wire format is worth - retaining is unclear; in particular, it's unclear that the 9P wire format - has a significant advantage over protobufs, which are substantially easier - to extend. Note that the official Go protobuf implementation is widely known - to suffer from a significant number of performance deficiencies, so if we - choose to switch to protobuf, we may need to use an alternative toolchain - such as `gogo/protobuf` (which is also widely used in the Go ecosystem, e.g. - by Kubernetes). - -- Filesystem model: fids, qids, etc. Discarding this is one of the motivations - for this proposal. - -- RPCs: Twalk, Tlopen, etc. In addition to previously-described - inefficiencies, most of these are dependent on the filesystem model and - therefore must be discarded. - -### FUSE - -The FUSE (Filesystem in Userspace) protocol is frequently used to provide -arbitrary userspace filesystem implementations to a host Linux kernel. -Unfortunately, FUSE is also inode-based, and therefore doesn't address any of -the problems we have with 9P. - -### virtio-fs - -virtio-fs is an ongoing project aimed at improving Linux VM filesystem -performance when accessing Linux host filesystems (vs. virtio-9p). In brief, it -is based on: - -- Using a FUSE client in the guest that communicates over virtio with a FUSE - server in the host. - -- Using DAX to map the host page cache into the guest. - -- Using a file metadata table in shared memory to avoid VM exits for metadata - updates. - -None of these improvements seem applicable to gVisor: - -- As explained above, FUSE is still inode-based, so it is still susceptible to - most of the problems we have with 9P. - -- Our use of host file descriptors already allows us to leverage the host page - cache for file contents. - -- Our need for shared filesystem coherence is usually based on a user - requirement that an out-of-sandbox filesystem mutation is guaranteed to be - visible by all subsequent observations from within the sandbox, or vice - versa; it's not clear that this can be guaranteed without a synchronous - signaling mechanism like an RPC. |