summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/fs/g3doc
diff options
context:
space:
mode:
Diffstat (limited to 'pkg/sentry/fs/g3doc')
-rw-r--r--pkg/sentry/fs/g3doc/.gitignore1
-rw-r--r--pkg/sentry/fs/g3doc/fuse.md360
-rw-r--r--pkg/sentry/fs/g3doc/inotify.md122
3 files changed, 0 insertions, 483 deletions
diff --git a/pkg/sentry/fs/g3doc/.gitignore b/pkg/sentry/fs/g3doc/.gitignore
deleted file mode 100644
index 2d19fc766..000000000
--- a/pkg/sentry/fs/g3doc/.gitignore
+++ /dev/null
@@ -1 +0,0 @@
-*.html
diff --git a/pkg/sentry/fs/g3doc/fuse.md b/pkg/sentry/fs/g3doc/fuse.md
deleted file mode 100644
index 05e043583..000000000
--- a/pkg/sentry/fs/g3doc/fuse.md
+++ /dev/null
@@ -1,360 +0,0 @@
-# Foreword
-
-This document describes an on-going project to support FUSE filesystems within
-the sentry. This is intended to become the final documentation for this
-subsystem, and is therefore written in the past tense. However FUSE support is
-currently incomplete and the document will be updated as things progress.
-
-# FUSE: Filesystem in Userspace
-
-The sentry supports dispatching filesystem operations to a FUSE server, allowing
-FUSE filesystem to be used with a sandbox.
-
-## Overview
-
-FUSE has two main components:
-
-1. A client kernel driver (canonically `fuse.ko` in Linux), which forwards
- filesystem operations (usually initiated by syscalls) to the server.
-
-2. A server, which is a userspace daemon that implements the actual filesystem.
-
-The sentry implements the client component, which allows a server daemon running
-within the sandbox to implement a filesystem within the sandbox.
-
-A FUSE filesystem is initialized with `mount(2)`, typically with the help of a
-utility like `fusermount(1)`. Various mount options exist for establishing
-ownership and access permissions on the filesystem, but the most important mount
-option is a file descriptor used to establish communication between the client
-and server.
-
-The FUSE device FD is obtained by opening `/dev/fuse`. During regular operation,
-the client and server use the FUSE protocol described in `fuse(4)` to service
-filesystem operations. See the "Protocol" section below for more information
-about this protocol. The core of the sentry support for FUSE is the client-side
-implementation of this protocol.
-
-## FUSE in the Sentry
-
-The sentry's FUSE client targets VFS2 and has the following components:
-
-- An implementation of `/dev/fuse`.
-
-- A VFS2 filesystem for mapping syscalls to FUSE ops. Since we're targeting
- VFS2, one point of contention may be the lack of inodes in VFS2. We can
- tentatively implement a kernfs-based filesystem to bridge the gap in APIs.
- The kernfs base functionality can serve the role of the Linux inode cache
- and, the filesystem can map VFS2 syscalls to kernfs inode operations; see
- the `kernfs.Inode` interface.
-
-The FUSE protocol lends itself well to marshaling with `go_marshal`. The various
-request and response packets can be defined in the ABI package and converted to
-and from the wire format using `go_marshal`.
-
-### Design Goals
-
-- While filesystem performance is always important, the sentry's FUSE support
- is primarily concerned with compatibility, with performance as a secondary
- concern.
-
-- Avoiding deadlocks from a hung server daemon.
-
-- Consider the potential for denial of service from a malicious server daemon.
- Protecting itself from userspace is already a design goal for the sentry,
- but needs additional consideration for FUSE. Normally, an operating system
- doesn't rely on userspace to make progress with filesystem operations. Since
- this changes with FUSE, it opens up the possibility of creating a chain of
- dependencies controlled by userspace, which could affect an entire sandbox.
- For example: a FUSE op can block a syscall, which could be holding a
- subsystem lock, which can then block another task goroutine.
-
-### Milestones
-
-Below are some broad goals to aim for while implementing FUSE in the sentry.
-Many FUSE ops can be grouped into broad categories of functionality, and most
-ops can be implemented in parallel.
-
-#### Minimal client that can mount a trivial FUSE filesystem.
-
-- Implement `/dev/fuse` - a character device used to establish an FD for
- communication between the sentry and the server daemon.
-
-- Implement basic FUSE ops like `FUSE_INIT`.
-
-#### Read-only mount with basic file operations
-
-- Implement the majority of file, directory and file descriptor FUSE ops. For
- this milestone, we can skip uncommon or complex operations like mmap, mknod,
- file locking, poll, and extended attributes. We can stub these out along
- with any ops that modify the filesystem. The exact list of required ops are
- to be determined, but the goal is to mount a real filesystem as read-only,
- and be able to read contents from the filesystem in the sentry.
-
-#### Full read-write support
-
-- Implement the remaining FUSE ops and decide if we can omit rarely used
- operations like ioctl.
-
-### Design Details
-
-#### Lifecycle for a FUSE Request
-
-- User invokes a syscall
-- Sentry prepares corresponding request
- - If FUSE device is available
- - Write the request in binary
- - If FUSE device is full
- - Kernel task blocked until available
-- Sentry notifies the readers of fuse device that it's ready for read
-- FUSE daemon reads the request and processes it
-- Sentry waits until a reply is written to the FUSE device
- - but returns directly for async requests
-- FUSE daemon writes to the fuse device
-- Sentry processes the reply
- - For sync requests, unblock blocked kernel task
- - For async requests, execute pre-specified callback if any
-- Sentry returns the syscall to the user
-
-#### Channels and Queues for Requests in Different Stages
-
-`connection.initializedChan`
-
-- a channel that the requests issued before connection initialization blocks
- on.
-
-`fd.queue`
-
-- a queue of requests that haven’t been read by the FUSE daemon yet.
-
-`fd.completions`
-
-- a map of the requests that have been prepared but not yet received a
- response, including the ones on the `fd.queue`.
-
-`fd.waitQueue`
-
-- a queue of waiters that is waiting for the fuse device fd to be available,
- such as the FUSE daemon.
-
-`fd.fullQueueCh`
-
-- a channel that the kernel task will be blocked on when the fd is not
- available.
-
-#### Basic I/O Implementation
-
-Currently we have implemented basic functionalities of read and write for our
-FUSE. We describe the design and ways to improve it here:
-
-##### Basic FUSE Read
-
-The vfs2 expects implementations of `vfs.FileDescriptionImpl.Read()` and
-`vfs.FileDescriptionImpl.PRead()`. When a syscall is made, it will eventually
-reach our implementation of those interface functions located at
-`pkg/sentry/fsimpl/fuse/regular_file.go` for regular files.
-
-After validation checks of the input, sentry sends `FUSE_READ` requests to the
-FUSE daemon. The FUSE daemon returns data after the `fuse_out_header` as the
-responses. For the first version, we create a copy in kernel memory of those
-data. They are represented as a byte slice in the marshalled struct. This
-happens as a common process for all the FUSE responses at this moment at
-`pkg/sentry/fsimpl/fuse/dev.go:writeLocked()`. We then directly copy from this
-intermediate buffer to the input buffer provided by the read syscall.
-
-There is an extra requirement for FUSE: When mounting the FUSE fs, the mounter
-or the FUSE daemon can specify a `max_read` or a `max_pages` parameter. They are
-the upperbound of the bytes to read in each `FUSE_READ` request. We implemented
-the code to handle the fragmented reads.
-
-To improve the performance: ideally we should have buffer cache to copy those
-data from the responses of FUSE daemon into, as is also the design of several
-other existing file system implementations for sentry, instead of a single-use
-temporary buffer. Directly mapping the memory of one process to another could
-also boost the performance, but to keep them isolated, we did not choose to do
-so.
-
-##### Basic FUSE Write
-
-The vfs2 invokes implementations of `vfs.FileDescriptionImpl.Write()` and
-`vfs.FileDescriptionImpl.PWrite()` on the regular file descriptor of FUSE when a
-user makes write(2) and pwrite(2) syscall.
-
-For valid writes, sentry sends the bytes to write after a `FUSE_WRITE` header
-(can be regarded as a request with 2 payloads) to the FUSE daemon. For the first
-version, we allocate a buffer inside kernel memory to store the bytes from the
-user, and copy directly from that buffer to the memory of FUSE daemon. This
-happens at `pkg/sentry/fsimpl/fuse/dev.go:readLocked()`
-
-The parameters `max_write` and `max_pages` restrict the number of bytes in one
-`FUSE_WRITE`. There are code handling fragmented writes in current
-implementation.
-
-To have better performance: the extra copy created to store the bytes to write
-can be replaced by the buffer cache as well.
-
-# Appendix
-
-## FUSE Protocol
-
-The FUSE protocol is a request-response protocol. All requests are initiated by
-the client. The wire-format for the protocol is raw C structs serialized to
-memory.
-
-All FUSE requests begin with the following request header:
-
-```c
-struct fuse_in_header {
- uint32_t len; // Length of the request, including this header.
- uint32_t opcode; // Requested operation.
- uint64_t unique; // A unique identifier for this request.
- uint64_t nodeid; // ID of the filesystem object being operated on.
- uint32_t uid; // UID of the requesting process.
- uint32_t gid; // GID of the requesting process.
- uint32_t pid; // PID of the requesting process.
- uint32_t padding;
-};
-```
-
-The request is then followed by a payload specific to the `opcode`.
-
-All responses begin with this response header:
-
-```c
-struct fuse_out_header {
- uint32_t len; // Length of the response, including this header.
- int32_t error; // Status of the request, 0 if success.
- uint64_t unique; // The unique identifier from the corresponding request.
-};
-```
-
-The response payload also depends on the request `opcode`. If `error != 0`, the
-response payload must be empty.
-
-### Operations
-
-The following is a list of all FUSE operations used in `fuse_in_header.opcode`
-as of Linux v4.4, and a brief description of their purpose. These are defined in
-`uapi/linux/fuse.h`. Many of these have a corresponding request and response
-payload struct; `fuse(4)` has details for some of these. We also note how these
-operations map to the sentry virtual filesystem.
-
-#### FUSE meta-operations
-
-These operations are specific to FUSE and don't have a corresponding action in a
-generic filesystem.
-
-- `FUSE_INIT`: This operation initializes a new FUSE filesystem, and is the
- first message sent by the client after mount. This is used for version and
- feature negotiation. This is related to `mount(2)`.
-- `FUSE_DESTROY`: Teardown a FUSE filesystem, related to `unmount(2)`.
-- `FUSE_INTERRUPT`: Interrupts an in-flight operation, specified by the
- `fuse_in_header.unique` value provided in the corresponding request header.
- The client can send at most one of these per request, and will enter an
- uninterruptible wait for a reply. The server is expected to reply promptly.
-- `FUSE_FORGET`: A hint to the server that server should evict the indicate
- node from any caches. This is wired up to `(struct
- super_operations).evict_inode` in Linux, which is in turned hooked as the
- inode cache shrinker which is typically triggered by system memory pressure.
-- `FUSE_BATCH_FORGET`: Batch version of `FUSE_FORGET`.
-
-#### Filesystem Syscalls
-
-These FUSE ops map directly to an equivalent filesystem syscall, or family of
-syscalls. The relevant syscalls have a similar name to the operation, unless
-otherwise noted.
-
-Node creation:
-
-- `FUSE_MKNOD`
-- `FUSE_MKDIR`
-- `FUSE_CREATE`: This is equivalent to `open(2)` and `creat(2)`, which
- atomically creates and opens a node.
-
-Node attributes and extended attributes:
-
-- `FUSE_GETATTR`
-- `FUSE_SETATTR`
-- `FUSE_SETXATTR`
-- `FUSE_GETXATTR`
-- `FUSE_LISTXATTR`
-- `FUSE_REMOVEXATTR`
-
-Node link manipulation:
-
-- `FUSE_READLINK`
-- `FUSE_LINK`
-- `FUSE_SYMLINK`
-- `FUSE_UNLINK`
-
-Directory operations:
-
-- `FUSE_RMDIR`
-- `FUSE_RENAME`
-- `FUSE_RENAME2`
-- `FUSE_OPENDIR`: `open(2)` for directories.
-- `FUSE_RELEASEDIR`: `close(2)` for directories.
-- `FUSE_READDIR`
-- `FUSE_READDIRPLUS`
-- `FUSE_FSYNCDIR`: `fsync(2)` for directories.
-- `FUSE_LOOKUP`: Establishes a unique identifier for a FS node. This is
- reminiscent of `VirtualFilesystem.GetDentryAt` in that it resolves a path
- component to a node. However the returned identifier is opaque to the
- client. The server must remember this mapping, as this is how the client
- will reference the node in the future.
-
-File operations:
-
-- `FUSE_OPEN`: `open(2)` for files.
-- `FUSE_RELEASE`: `close(2)` for files.
-- `FUSE_FSYNC`
-- `FUSE_FALLOCATE`
-- `FUSE_SETUPMAPPING`: Creates a memory map on a file for `mmap(2)`.
-- `FUSE_REMOVEMAPPING`: Removes a memory map for `munmap(2)`.
-
-File locking:
-
-- `FUSE_GETLK`
-- `FUSE_SETLK`
-- `FUSE_SETLKW`
-- `FUSE_COPY_FILE_RANGE`
-
-File descriptor operations:
-
-- `FUSE_IOCTL`
-- `FUSE_POLL`
-- `FUSE_LSEEK`
-
-Filesystem operations:
-
-- `FUSE_STATFS`
-
-#### Permissions
-
-- `FUSE_ACCESS` is used to check if a node is accessible, as part of many
- syscall implementations. Maps to `vfs.FilesystemImpl.AccessAt` in the
- sentry.
-
-#### I/O Operations
-
-These ops are used to read and write file pages. They're used to implement both
-I/O syscalls like `read(2)`, `write(2)` and `mmap(2)`.
-
-- `FUSE_READ`
-- `FUSE_WRITE`
-
-#### Miscellaneous
-
-- `FUSE_FLUSH`: Used by the client to indicate when a file descriptor is
- closed. Distinct from `FUSE_FSYNC`, which corresponds to an `fsync(2)`
- syscall from the user. Maps to `vfs.FileDescriptorImpl.Release` in the
- sentry.
-- `FUSE_BMAP`: Old address space API for block defrag. Probably not needed.
-- `FUSE_NOTIFY_REPLY`: [TODO: what does this do?]
-
-# References
-
-- [fuse(4) Linux manual page](https://www.man7.org/linux/man-pages/man4/fuse.4.html)
-- [Linux kernel FUSE documentation](https://www.kernel.org/doc/html/latest/filesystems/fuse.html)
-- [The reference implementation of the Linux FUSE (Filesystem in Userspace)
- interface](https://github.com/libfuse/libfuse)
-- [The kernel interface of FUSE](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h)
diff --git a/pkg/sentry/fs/g3doc/inotify.md b/pkg/sentry/fs/g3doc/inotify.md
deleted file mode 100644
index 85063d4e6..000000000
--- a/pkg/sentry/fs/g3doc/inotify.md
+++ /dev/null
@@ -1,122 +0,0 @@
-# Inotify
-
-Inotify implements the like-named filesystem event notification system for the
-sentry, see `inotify(7)`.
-
-## Architecture
-
-For the most part, the sentry implementation of inotify mirrors the Linux
-architecture. Inotify instances (i.e. the fd returned by inotify_init(2)) are
-backed by a pseudo-filesystem. Events are generated from various places in the
-sentry, including the [syscall layer][syscall_dir], the [vfs layer][dirent] and
-the [process fd table][fd_table]. Watches are stored in inodes and generated
-events are queued to the inotify instance owning the watches for delivery to the
-user.
-
-## Objects
-
-Here is a brief description of the existing and new objects involved in the
-sentry inotify mechanism, and how they interact:
-
-### [`fs.Inotify`][inotify]
-
-- An inotify instances, created by inotify_init(2)/inotify_init1(2).
-- The inotify fd has a `fs.Dirent`, supports filesystem syscalls to read
- events.
-- Has multiple `fs.Watch`es, with at most one watch per target inode, per
- inotify instance.
-- Has an instance `id` which is globally unique. This is *not* the fd number
- for this instance, since the fd can be duped. This `id` is not externally
- visible.
-
-### [`fs.Watch`][watch]
-
-- An inotify watch, created/deleted by
- inotify_add_watch(2)/inotify_rm_watch(2).
-- Owned by an `fs.Inotify` instance, each watch keeps a pointer to the
- `owner`.
-- Associated with a single `fs.Inode`, which is the watch `target`. While the
- watch is active, it indirectly pins `target` to memory. See the "Reference
- Model" section for a detailed explanation.
-- Filesystem operations on `target` generate `fs.Event`s.
-
-### [`fs.Event`][event]
-
-- A simple struct encapsulating all the fields for an inotify event.
-- Generated by `fs.Watch`es and forwarded to the watches' `owner`s.
-- Serialized to the user during read(2) syscalls on the associated
- `fs.Inotify`'s fd.
-
-### [`fs.Dirent`][dirent]
-
-- Many inotify events are generated inside dirent methods. Events are
- generated in the dirent methods rather than `fs.Inode` methods because some
- events carry the name of the subject node, and node names are generally
- unavailable in an `fs.Inode`.
-- Dirents do not directly contain state for any watches. Instead, they forward
- notifications to the underlying `fs.Inode`.
-
-### [`fs.Inode`][inode]
-
-- Interacts with inotify through `fs.Watch`es.
-- Inodes contain a map of all active `fs.Watch`es on them.
-- An `fs.Inotify` instance can have at most one `fs.Watch` per inode.
- `fs.Watch`es on an inode are indexed by their `owner`'s `id`.
-- All inotify logic is encapsulated in the [`Watches`][inode_watches] struct
- in an inode. Logically, `Watches` is the set of inotify watches on the
- inode.
-
-## Reference Model
-
-The sentry inotify implementation has a complex reference model. An inotify
-watch observes a single inode. For efficient lookup, the state for a watch is
-stored directly on the target inode. This state needs to be persistent for the
-lifetime of watch. Unlike usual filesystem metadata, the watch state has no
-"on-disk" representation, so they cannot be reconstructed by the filesystem if
-the inode is flushed from memory. This effectively means we need to keep any
-inodes with actives watches pinned to memory.
-
-We can't just hold an extra ref on the inode to pin it to memory because some
-filesystems (such as gofer-based filesystems) don't have persistent inodes. In
-such a filesystem, if we just pin the inode, nothing prevents the enclosing
-dirent from being GCed. Once the dirent is GCed, the pinned inode is
-unreachable -- these filesystems generate a new inode by re-reading the node
-state on the next walk. Incidentally, hardlinks also don't work on these
-filesystems for this reason.
-
-To prevent the above scenario, when a new watch is added on an inode, we *pin*
-the dirent we used to reach the inode. Note that due to hardlinks, this dirent
-may not be the only dirent pointing to the inode. Attempting to set an inotify
-watch via multiple hardlinks to the same file results in the same watch being
-returned for both links. However, for each new dirent we use to reach the same
-inode, we add a new pin. We need a new pin for each new dirent used to reach the
-inode because we have no guarantees about the deletion order of the different
-links to the inode.
-
-## Lock Ordering
-
-There are 4 locks related to the inotify implementation:
-
-- `Inotify.mu`: the inotify instance lock.
-- `Inotify.evMu`: the inotify event queue lock.
-- `Watch.mu`: the watch lock, used to protect pins.
-- `fs.Watches.mu`: the inode watch set mu, used to protect the collection of
- watches on the inode.
-
-The correct lock ordering for inotify code is:
-
-`Inotify.mu` -> `fs.Watches.mu` -> `Watch.mu` -> `Inotify.evMu`.
-
-We need a distinct lock for the event queue because by the time a goroutine
-attempts to queue a new event, it is already holding `fs.Watches.mu`. If we used
-`Inotify.mu` to also protect the event queue, this would violate the above lock
-ordering.
-
-[dirent]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/dirent.go
-[event]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify_event.go
-[fd_table]: https://github.com/google/gvisor/blob/master/pkg/sentry/kernel/fd_table.go
-[inode]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inode.go
-[inode_watches]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inode_inotify.go
-[inotify]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify.go
-[syscall_dir]: https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/
-[watch]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify_watch.go