diff options
Diffstat (limited to 'pkg/sentry/fs/g3doc/inotify.md')
-rw-r--r-- | pkg/sentry/fs/g3doc/inotify.md | 122 |
1 files changed, 122 insertions, 0 deletions
diff --git a/pkg/sentry/fs/g3doc/inotify.md b/pkg/sentry/fs/g3doc/inotify.md new file mode 100644 index 000000000..1e99a3357 --- /dev/null +++ b/pkg/sentry/fs/g3doc/inotify.md @@ -0,0 +1,122 @@ +# Inotify + +Inotify implements the like-named filesystem event notification system for the +sentry, see `inotify(7)`. + +## Architecture + +For the most part, the sentry implementation of inotify mirrors the Linux +architecture. Inotify instances (i.e. the fd returned by inotify_init(2)) are +backed by a pseudo-filesystem. Events are generated from various places in the +sentry, including the [syscall layer][syscall_dir], the [vfs layer][dirent] and +the [process fd table][fd_map]. Watches are stored in inodes and generated +events are queued to the inotify instance owning the watches for delivery to the +user. + +## Objects + +Here is a brief description of the existing and new objects involved in the +sentry inotify mechanism, and how they interact: + +### [`fs.Inotify`][inotify] + +- An inotify instances, created by inotify_init(2)/inotify_init1(2). +- The inotify fd has a `fs.Dirent`, supports filesystem syscalls to read + events. +- Has multiple `fs.Watch`es, with at most one watch per target inode, per + inotify instance. +- Has an instance `id` which is globally unique. This is *not* the fd number + for this instance, since the fd can be duped. This `id` is not externally + visible. + +### [`fs.Watch`][watch] + +- An inotify watch, created/deleted by + inotify_add_watch(2)/inotify_rm_watch(2). +- Owned by an `fs.Inotify` instance, each watch keeps a pointer to the + `owner`. +- Associated with a single `fs.Inode`, which is the watch `target`. While the + watch is active, it indirectly pins `target` to memory. See the "Reference + Model" section for a detailed explanation. +- Filesystem operations on `target` generate `fs.Event`s. + +### [`fs.Event`][event] + +- A simple struct encapsulating all the fields for an inotify event. +- Generated by `fs.Watch`es and forwarded to the watches' `owner`s. +- Serialized to the user during read(2) syscalls on the associated + `fs.Inotify`'s fd. + +### [`fs.Dirent`][dirent] + +- Many inotify events are generated inside dirent methods. Events are + generated in the dirent methods rather than `fs.Inode` methods because some + events carry the name of the subject node, and node names are generally + unavailable in an `fs.Inode`. +- Dirents do not directly contain state for any watches. Instead, they forward + notifications to the underlying `fs.Inode`. + +### [`fs.Inode`][inode] + +- Interacts with inotify through `fs.Watch`es. +- Inodes contain a map of all active `fs.Watch`es on them. +- An `fs.Inotify` instance can have at most one `fs.Watch` per inode. + `fs.Watch`es on an inode are indexed by their `owner`'s `id`. +- All inotify logic is encapsulated in the [`Watches`][inode_watches] struct + in an inode. Logically, `Watches` is the set of inotify watches on the + inode. + +## Reference Model + +The sentry inotify implementation has a complex reference model. An inotify +watch observes a single inode. For efficient lookup, the state for a watch is +stored directly on the target inode. This state needs to be persistent for the +lifetime of watch. Unlike usual filesystem metadata, the watch state has no +"on-disk" representation, so they cannot be reconstructed by the filesystem if +the inode is flushed from memory. This effectively means we need to keep any +inodes with actives watches pinned to memory. + +We can't just hold an extra ref on the inode to pin it to memory because some +filesystems (such as gofer-based filesystems) don't have persistent inodes. In +such a filesystem, if we just pin the inode, nothing prevents the enclosing +dirent from being GCed. Once the dirent is GCed, the pinned inode is +unreachable -- these filesystems generate a new inode by re-reading the node +state on the next walk. Incidentally, hardlinks also don't work on these +filesystems for this reason. + +To prevent the above scenario, when a new watch is added on an inode, we *pin* +the dirent we used to reach the inode. Note that due to hardlinks, this dirent +may not be the only dirent pointing to the inode. Attempting to set an inotify +watch via multiple hardlinks to the same file results in the same watch being +returned for both links. However, for each new dirent we use to reach the same +inode, we add a new pin. We need a new pin for each new dirent used to reach the +inode because we have no guarantees about the deletion order of the different +links to the inode. + +## Lock Ordering + +There are 4 locks related to the inotify implementation: + +- `Inotify.mu`: the inotify instance lock. +- `Inotify.evMu`: the inotify event queue lock. +- `Watch.mu`: the watch lock, used to protect pins. +- `fs.Watches.mu`: the inode watch set mu, used to protect the collection of + watches on the inode. + +The correct lock ordering for inotify code is: + +`Inotify.mu` -> `fs.Watches.mu` -> `Watch.mu` -> `Inotify.evMu`. + +We need a distinct lock for the event queue because by the time a goroutine +attempts to queue a new event, it is already holding `fs.Watches.mu`. If we used +`Inotify.mu` to also protect the event queue, this would violate the above lock +ordering. + +[dirent]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/fs/dirent.go +[event]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/fs/inotify_event.go +[fd_map]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/kernel/fd_map.go +[inode]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/fs/inode.go +[inode_watches]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/fs/inode_inotify.go +[inotify]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/fs/inotify.go +[syscall_dir]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/syscalls/linux/ +[watch]: https://gvisor.googlesource.com/gvisor/+/master/pkg/sentry/fs/inotify_watch.go |