summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/vfs/g3doc/inotify.md
blob: e7da49faab79c0abc298c13d5644930406c64018 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
# Inotify

Inotify is a mechanism for monitoring filesystem events in Linux--see
inotify(7). An inotify instance can be used to monitor files and directories for
modifications, creation/deletion, etc. The inotify API consists of system calls
that create inotify instances (inotify_init/inotify_init1) and add/remove
watches on files to an instance (inotify_add_watch/inotify_rm_watch). Events are
generated from various places in the sentry, including the syscall layer, the
vfs layer, the process fd table, and within each filesystem implementation. This
document outlines the implementation details of inotify in VFS2.

## Inotify Objects

Inotify data structures are implemented in the vfs package.

### vfs.Inotify

Inotify instances are represented by vfs.Inotify objects, which implement
vfs.FileDescriptionImpl. As in Linux, inotify fds are backed by a
pseudo-filesystem (anonfs). Each inotify instance receives events from a set of
vfs.Watch objects, which can be modified with inotify_add_watch(2) and
inotify_rm_watch(2). An application can retrieve events by reading the inotify
fd.

### vfs.Watches

The set of all watches held on a single file (i.e., the watch target) is stored
in vfs.Watches. Each watch will belong to a different inotify instance (an
instance can only have one watch on any watch target). The watches are stored in
a map indexed by their vfs.Inotify owner’s id. Hard links and file descriptions
to a single file will all share the same vfs.Watches. Activity on the target
causes its vfs.Watches to generate notifications on its watches’ inotify
instances.

### vfs.Watch

A single watch, owned by one inotify instance and applied to one watch target.
Both the vfs.Inotify owner and vfs.Watches on the target will hold a vfs.Watch,
which leads to some complicated locking behavior (see Lock Ordering). Whenever a
watch is notified of an event on its target, it will queue events to its inotify
instance for delivery to the user.

### vfs.Event

vfs.Event is a simple struct encapsulating all the fields for an inotify event.
It is generated by vfs.Watches and forwarded to the watches' owners. It is
serialized to the user during read(2) syscalls on the associated fs.Inotify's
fd.

## Lock Ordering

There are three locks related to the inotify implementation:

Inotify.mu: the inotify instance lock. Inotify.evMu: the inotify event queue
lock. Watches.mu: the watch set lock, used to protect the collection of watches
on a target.

The correct lock ordering for inotify code is:

Inotify.mu -> Watches.mu -> Inotify.evMu.

Note that we use a distinct lock to protect the inotify event queue. If we
simply used Inotify.mu, we could simultaneously have locks being acquired in the
order of Inotify.mu -> Watches.mu and Watches.mu -> Inotify.mu, which would
cause deadlocks. For instance, adding a watch to an inotify instance would
require locking Inotify.mu, and then adding the same watch to the target would
cause Watches.mu to be held. At the same time, generating an event on the target
would require Watches.mu to be held before iterating through each watch, and
then notifying the owner of each watch would cause Inotify.mu to be held.

See the vfs package comment to understand how inotify locks fit into the overall
ordering of filesystem locks.

## Watch Targets in Different Filesystem Implementations

In Linux, watches reside on inodes at the virtual filesystem layer. As a result,
all hard links and file descriptions on a single file will all share the same
watch set. In VFS2, there is no common inode structure across filesystem types
(some may not even have inodes), so we have to plumb inotify support through
each specific filesystem implementation. Some of the technical considerations
are outlined below.

### Tmpfs

For filesystems with inodes, like tmpfs, the design is quite similar to that of
Linux, where watches reside on the inode.

### Pseudo-filesystems

Technically, because inotify is implemented at the vfs layer in Linux,
pseudo-filesystems on top of kernfs support inotify passively. However, watches
can only track explicit filesystem operations like read/write, open/close,
mknod, etc., so watches on a target like /proc/self/fd will not generate events
every time a new fd is added or removed. As of this writing, we leave inotify
unimplemented in kernfs and anonfs; it does not seem particularly useful.

### Gofer Filesystem (fsimpl/gofer)

The gofer filesystem has several traits that make it difficult to support
inotify:

*   **There are no inodes.** A file is represented as a dentry that holds an
    unopened p9 file (and possibly an open FID), through which the Sentry
    interacts with the gofer.
    *   *Solution:* Because there is no inode structure stored in the sandbox,
        inotify watches must be held on the dentry. This would be an issue in
        the presence of hard links, where multiple dentries would need to share
        the same set of watches, but in VFS2, we do not support the internal
        creation of hard links on gofer fs. As a result, we make the assumption
        that every dentry corresponds to a unique inode. However, the next point
        raises an issue with this assumption:
*   **The Sentry cannot always be aware of hard links on the remote
    filesystem.** There is no way for us to confirm whether two files on the
    remote filesystem are actually links to the same inode. QIDs and inodes are
    not always 1:1. The assumption that dentries and inodes are 1:1 is
    inevitably broken if there are remote hard links that we cannot detect.
    *   *Solution:* this is an issue with gofer fs in general, not only inotify,
        and we will have to live with it.
*   **Dentries can be cached, and then evicted.** Dentry lifetime does not
    correspond to file lifetime. Because gofer fs is not entirely in-memory, the
    absence of a dentry does not mean that the corresponding file does not
    exist, nor does a dentry reaching zero references mean that the
    corresponding file no longer exists. When a dentry reaches zero references,
    it will be cached, in case the file at that path is needed again in the
    future. However, the dentry may be evicted from the cache, which will cause
    a new dentry to be created next time the same file path is used. The
    existing watches will be lost.
    *   *Solution:* When a dentry reaches zero references, do not cache it if it
        has any watches, so we can avoid eviction/destruction. Note that if the
        dentry was deleted or invalidated (d.vfsd.IsDead()), we should still
        destroy it along with its watches. Additionally, when a dentry’s last
        watch is removed, we cache it if it also has zero references. This way,
        the dentry can eventually be evicted from memory if it is no longer
        needed.
*   **Dentries can be invalidated.** Another issue with dentry lifetime is that
    the remote file at the file path represented may change from underneath the
    dentry. In this case, the next time that the dentry is used, it will be
    invalidated and a new dentry will replace it. In this case, it is not clear
    what should be done with the watches on the old dentry.
    *   *Solution:* Silently destroy the watches when invalidation occurs. We
        have no way of knowing exactly what happened, when it happens. Inotify
        instances on NFS files in Linux probably behave in a similar fashion,
        since inotify is implemented at the vfs layer and is not aware of the
        complexities of remote file systems.
    *   An alternative would be to issue some kind of event upon invalidation,
        e.g. a delete event, but this has several issues:
    *   We cannot discern whether the remote file was invalidated because it was
        moved, deleted, etc. This information is crucial, because these cases
        should result in different events. Furthermore, the watches should only
        be destroyed if the file has been deleted.
    *   Moreover, the mechanism for detecting whether the underlying file has
        changed is to check whether a new QID is given by the gofer. This may
        result in false positives, e.g. suppose that the server closed and
        re-opened the same file, which may result in a new QID.
    *   Finally, the time of the event may be completely different from the time
        of the file modification, since a dentry is not immediately notified
        when the underlying file has changed. It would be quite unexpected to
        receive the notification when invalidation was triggered, i.e. the next
        time the file was accessed within the sandbox, because then the
        read/write/etc. operation on the file would not result in the expected
        event.
    *   Another point in favor of the first solution: inotify in Linux can
        already be lossy on local filesystems (one of the sacrifices made so
        that filesystem performance isn’t killed), and it is lossy on NFS for
        similar reasons to gofer fs. Therefore, it is better for inotify to be
        silent than to emit incorrect notifications.
*   **There may be external users of the remote filesystem.** We can only track
    operations performed on the file within the sandbox. This is sufficient
    under InteropModeExclusive, but whenever there are external users, the set
    of actions we are aware of is incomplete.
    *   *Solution:* We could either return an error or just issue a warning when
        inotify is used without InteropModeExclusive. Although faulty, VFS1
        allows it when the filesystem is shared, and Linux does the same for
        remote filesystems (as mentioned above, inotify sits at the vfs level).

## Dentry Interface

For events that must be generated above the vfs layer, we provide the following
DentryImpl methods to allow interactions with targets on any FilesystemImpl:

*   **InotifyWithParent()** generates events on the dentry’s watches as well as
    its parent’s.
*   **Watches()** retrieves the watch set of the target represented by the
    dentry. This is used to access and modify watches on a target.
*   **OnZeroWatches()** performs cleanup tasks after the last watch is removed
    from a dentry. This is needed by gofer fs, which must allow a watched dentry
    to be cached once it has no more watches. Most implementations can just do
    nothing. Note that OnZeroWatches() must be called after all inotify locks
    are released to preserve lock ordering, since it may acquire
    FilesystemImpl-specific locks.

## IN_EXCL_UNLINK

There are several options that can be set for a watch, specified as part of the
mask in inotify_add_watch(2). In particular, IN_EXCL_UNLINK requires some
additional support in each filesystem.

A watch with IN_EXCL_UNLINK will not generate events for its target if it
corresponds to a path that was unlinked. For instance, if an fd is opened on
“foo/bar” and “foo/bar” is subsequently unlinked, any reads/writes/etc. on the
fd will be ignored by watches on “foo” or “foo/bar” with IN_EXCL_UNLINK. This
requires each DentryImpl to keep track of whether it has been unlinked, in order
to determine whether events should be sent to watches with IN_EXCL_UNLINK.

## IN_ONESHOT

One-shot watches expire after generating a single event. When an event occurs,
all one-shot watches on the target that successfully generated an event are
removed. Lock ordering can cause the management of one-shot watches to be quite
expensive; see Watches.Notify() for more information.