Age | Commit message (Collapse) | Author |
|
Previously, mount could discover a hierarchy being destroyed
concurrently, which resulted in mount attempting to take a ref on an
already destroyed cgroupfs.
Reported-by: syzbot+062c0a67798a200f23ee@syzkaller.appspotmail.com
PiperOrigin-RevId: 374959054
|
|
O_PATH is now implemented in vfs2.
Fixes #2782.
PiperOrigin-RevId: 373861410
|
|
Previously, registration was racy because we were publishing
hierarchies in the registry without fully initializing the underlying
filesystem. This led to concurrent mount(2)s discovering the partially
intialized filesystems and dropping the final refs on them which cause
them to be freed prematurely.
Reported-by: syzbot+13f54e77bdf59f0171f0@syzkaller.appspotmail.com
Reported-by: syzbot+2c7f0a9127ac6a84f17e@syzkaller.appspotmail.com
PiperOrigin-RevId: 373824552
|
|
The newly added Weirdness metric with fields should be used instead of them.
Simple query for weirdness metric: http://shortn/_DGNk0z2Up6
PiperOrigin-RevId: 370578132
|
|
Weirdness metric contains fields to track the number of clock fallback,
partial result and vsyscalls. This metric will avoid the overhead of
having three different metrics (fallbackMetric, partialResultMetric,
vsyscallCount).
PiperOrigin-RevId: 369970218
|
|
Reported-by: syzbot+a6ef0f95a2c9e7da26f3@syzkaller.appspotmail.com
Reported-by: syzbot+2eaf8a9f115edec468fe@syzkaller.appspotmail.com
PiperOrigin-RevId: 368093861
|
|
A skeleton implementation of cgroupfs. It supports trivial cpu and
memory controllers with no support for hierarchies.
PiperOrigin-RevId: 366561126
|
|
Split usermem package to help remove syserror dependency in go_marshal.
New hostarch package contains code not dependent on syserror.
PiperOrigin-RevId: 365651233
|
|
This is necessary since ptraceClone() mutates tracer.ptraceTracees.
PiperOrigin-RevId: 365152396
|
|
On Linux these are meant to be equivalent to POLLIN/POLLOUT. Rather
than hack these on in sys_poll etc it felt cleaner to just cleanup
the call sites to notify for both events. This is what linux does
as well.
Fixes #5544
PiperOrigin-RevId: 364859977
|
|
This change is inspired by Adin's cl/355256448.
PiperOrigin-RevId: 364695931
|
|
The syscall package has been deprecated in favor of golang.org/x/sys.
Note that syscall is still used in the following places:
- pkg/sentry/socket/hostinet/stack.go: some netlink related functionalities
are not yet available in golang.org/x/sys.
- syscall.Stat_t is still used in some places because os.FileInfo.Sys() still
returns it and not unix.Stat_t.
Updates #214
PiperOrigin-RevId: 360701387
|
|
PiperOrigin-RevId: 359591577
|
|
Restrict ptrace(2) according to the default configurations of the YAMA security
module (mode 1), which is a common default among various Linux distributions.
The new access checks only permit the tracer to proceed if one of the following
conditions is met:
a) The tracer is already attached to the tracee.
b) The target is a descendant of the tracer.
c) The target has explicitly given permission to the tracer through the
PR_SET_PTRACER prctl.
d) The tracer has CAP_SYS_PTRACE.
See security/yama/yama_lsm.c for more details.
Note that these checks are added to CanTrace, which is checked for
PTRACE_ATTACH as well as some other operations, e.g., checking a process'
memory layout through /proc/[pid]/mem.
Since this patch adds restrictions to ptrace, it may break compatibility for
applications run by non-root users that, for instance, rely on being able to
trace processes that are not descended from the tracer (e.g., `gdb -p`). YAMA
restrictions can be turned off by setting /proc/sys/kernel/yama/ptrace_scope
to 0, or exceptions can be made on a per-process basis with the PR_SET_PTRACER
prctl.
Reported-by: syzbot+622822d8bca08c99e8c8@syzkaller.appspotmail.com
PiperOrigin-RevId: 359237723
|
|
This removes a three-lock deadlock between fdnotifier.notifier.mu,
epoll.EventPoll.listsMu, and baseEndpoint.mu.
A lock order comment was added to epoll/epoll.go.
Also fix unsafe access of baseEndpoint.connected/receiver.
PiperOrigin-RevId: 358515191
|
|
PiperOrigin-RevId: 357015186
|
|
Reported-by: syzbot+9ffc71246fe72c73fc25@syzkaller.appspotmail.com
PiperOrigin-RevId: 356536113
|
|
PiperOrigin-RevId: 356450303
|
|
This was missed in cl/351911375; pipe.VFSPipeFD.SpliceFromNonPipe already calls
Notify.
PiperOrigin-RevId: 355246655
|
|
This improves type-assertion safety.
PiperOrigin-RevId: 353931228
|
|
IN_CLOSE should only be generated when a file description loses its last
reference; not when a file descriptor is closed.
See fs/file_table.c:__fput.
Updates #5348.
PiperOrigin-RevId: 353810697
|
|
Fixes #5113.
PiperOrigin-RevId: 353313374
|
|
Fixes #1509.
PiperOrigin-RevId: 353295589
|
|
PiperOrigin-RevId: 352904728
|
|
- Remove the pipe package's dependence on the buffer package, which becomes
unused as a result. The buffer package is currently intended to serve two use
cases, pipes and temporary buffers, and does neither optimally as a result;
this change facilitates retooling the buffer package to better serve the
latter.
- Pass callbacks taking safemem.BlockSeq to the internal pipe I/O methods,
which makes most callbacks trivial.
- Fix VFS1's splice() and tee() to immediately return if a pipe returns a
partial write.
PiperOrigin-RevId: 351911375
|
|
These are primarily simplification and lint mistakes. However, minor
fixes are also included and tests added where appropriate.
PiperOrigin-RevId: 351425971
|
|
Syzkaller discovered this bug in pipefs by doing something quite strange:
creat(&(0x7f0000002a00)='./file1\x00', 0x0)
mount(&(0x7f0000000440)=ANY=[], &(0x7f00000002c0)='./file1\x00', &(0x7f0000000300)='devtmpfs\x00', 0x20000d, 0x0)
creat(&(0x7f0000000000)='./file1/file0\x00', 0x0)
This can be reproduced with:
touch mymount
mkfifo /dev/mypipe
mount -o ro -t devtmpfs devtmpfs mymount
echo 123 > mymount/mypipe
PiperOrigin-RevId: 349687714
|
|
PiperOrigin-RevId: 347711998
|
|
We should not assert that all resources are dropped after saving.
PiperOrigin-RevId: 347420131
|
|
PiperOrigin-RevId: 347047550
|
|
PiperOrigin-RevId: 346973338
|
|
PiperOrigin-RevId: 345589628
|
|
These options allow overriding the signal that gets sent to the process when
I/O operations are available on the file descriptor, rather than the default
`SIGIO` signal. Doing so also populates `siginfo` to contain extra information
about which file descriptor caused the event (`si_fd`) and what events happened
on it (`si_band`). The logic around which FD is populated within `si_fd`
matches Linux's, which means it has some weird edge cases where that value may
not actually refer to a file descriptor that is still valid.
This CL also ports extra S/R logic regarding async handler in VFS2.
Without this, async I/O handlers aren't properly re-registered after S/R.
PiperOrigin-RevId: 345436598
|
|
`slice := *(*[]unsafe.Pointer)(...)` makes a copy of the slice header, which
then escapes because of the conditional `atomic.StorePointer(&f.slice, &slice)`
from table expansion. This occurs even when the table doesn't expand, and when
it can't (e.g. `close()` => `f.setAll(nil)`). Fix this by avoiding the copy
until after table expansion.
Before this CL:
```
TEXT pkg/sentry/kernel/kernel.(*FDTable).setAll(SB) pkg/sentry/kernel/fd_table_unsafe.go
fd_table_unsafe.go:119 0x7f00005f50e0 64488b0c25f8ffffff MOVQ FS:0xfffffff8, CX
fd_table_unsafe.go:119 0x7f00005f50e9 483b6110 CMPQ 0x10(CX), SP
fd_table_unsafe.go:119 0x7f00005f50ed 0f864d040000 JBE 0x7f00005f5540
fd_table_unsafe.go:119 0x7f00005f50f3 4883c480 ADDQ $-0x80, SP
fd_table_unsafe.go:119 0x7f00005f50f7 48896c2478 MOVQ BP, 0x78(SP)
fd_table_unsafe.go:119 0x7f00005f50fc 488d6c2478 LEAQ 0x78(SP), BP
fd_table_unsafe.go:120 0x7f00005f5101 488b8424a8000000 MOVQ 0xa8(SP), AX
fd_table_unsafe.go:120 0x7f00005f5109 4885c0 TESTQ AX, AX
fd_table_unsafe.go:120 0x7f00005f510c 7411 JE 0x7f00005f511f
fd_table_unsafe.go:120 0x7f00005f510e 488b8c24b0000000 MOVQ 0xb0(SP), CX
fd_table_unsafe.go:120 0x7f00005f5116 4885c9 TESTQ CX, CX
fd_table_unsafe.go:120 0x7f00005f5119 0f8500040000 JNE 0x7f00005f551f
fd_table_unsafe.go:124 0x7f00005f511f 488d05da115700 LEAQ 0x5711da(IP), AX
fd_table_unsafe.go:124 0x7f00005f5126 48890424 MOVQ AX, 0(SP)
fd_table_unsafe.go:124 0x7f00005f512a e8d19fa1ff CALL runtime.newobject(SB)
fd_table_unsafe.go:124 0x7f00005f512f 488b7c2408 MOVQ 0x8(SP), DI
fd_table_unsafe.go:124 0x7f00005f5134 488b842488000000 MOVQ 0x88(SP), AX
fd_table_unsafe.go:124 0x7f00005f513c 488b4820 MOVQ 0x20(AX), CX
fd_table_unsafe.go:124 0x7f00005f5140 488b5108 MOVQ 0x8(CX), DX
fd_table_unsafe.go:124 0x7f00005f5144 488b19 MOVQ 0(CX), BX
fd_table_unsafe.go:124 0x7f00005f5147 488b4910 MOVQ 0x10(CX), CX
fd_table_unsafe.go:124 0x7f00005f514b 48895708 MOVQ DX, 0x8(DI)
fd_table_unsafe.go:124 0x7f00005f514f 48894f10 MOVQ CX, 0x10(DI)
fd_table_unsafe.go:124 0x7f00005f5153 833df6e1120100 CMPL $0x0, runtime.writeBarrier(SB)
fd_table_unsafe.go:124 0x7f00005f515a 660f1f440000 NOPW 0(AX)(AX*1)
fd_table_unsafe.go:124 0x7f00005f5160 0f8589030000 JNE 0x7f00005f54ef
fd_table_unsafe.go:124 0x7f00005f5166 48891f MOVQ BX, 0(DI)
fd_table_unsafe.go:124 0x7f00005f5169 48897c2470 MOVQ DI, 0x70(SP)
fd_table_unsafe.go:127 0x7f00005f516e 8bb424a0000000 MOVL 0xa0(SP), SI
fd_table_unsafe.go:127 0x7f00005f5175 39d6 CMPL DX, SI
fd_table_unsafe.go:127 0x7f00005f5177 0f8c5f030000 JL 0x7f00005f54dc
...
```
After this CL:
```
TEXT pkg/sentry/kernel/kernel.(*FDTable).setAll(SB) pkg/sentry/kernel/fd_table_unsafe.go
fd_table_unsafe.go:119 0x7f00005f50e0 64488b0c25f8ffffff MOVQ FS:0xfffffff8, CX
fd_table_unsafe.go:119 0x7f00005f50e9 488d4424e8 LEAQ -0x18(SP), AX
fd_table_unsafe.go:119 0x7f00005f50ee 483b4110 CMPQ 0x10(CX), AX
fd_table_unsafe.go:119 0x7f00005f50f2 0f868e040000 JBE 0x7f00005f5586
fd_table_unsafe.go:119 0x7f00005f50f8 4881ec98000000 SUBQ $0x98, SP
fd_table_unsafe.go:119 0x7f00005f50ff 4889ac2490000000 MOVQ BP, 0x90(SP)
fd_table_unsafe.go:119 0x7f00005f5107 488dac2490000000 LEAQ 0x90(SP), BP
fd_table_unsafe.go:120 0x7f00005f510f 488b9424c0000000 MOVQ 0xc0(SP), DX
fd_table_unsafe.go:120 0x7f00005f5117 660f1f840000000000 NOPW 0(AX)(AX*1)
fd_table_unsafe.go:120 0x7f00005f5120 4885d2 TESTQ DX, DX
fd_table_unsafe.go:120 0x7f00005f5123 0f8406040000 JE 0x7f00005f552f
fd_table_unsafe.go:120 0x7f00005f5129 488b9c24c8000000 MOVQ 0xc8(SP), BX
fd_table_unsafe.go:120 0x7f00005f5131 4885db TESTQ BX, BX
fd_table_unsafe.go:120 0x7f00005f5134 0f852b040000 JNE 0x7f00005f5565
fd_table_unsafe.go:124 0x7f00005f513a 488bb424a0000000 MOVQ 0xa0(SP), SI
fd_table_unsafe.go:124 0x7f00005f5142 488b7e20 MOVQ 0x20(SI), DI
fd_table_unsafe.go:127 0x7f00005f5146 4c8b4708 MOVQ 0x8(DI), R8
fd_table_unsafe.go:127 0x7f00005f514a 448b8c24b8000000 MOVL 0xb8(SP), R9
fd_table_unsafe.go:127 0x7f00005f5152 4539c1 CMPL R8, R9
fd_table_unsafe.go:127 0x7f00005f5155 0f8d4a020000 JGE 0x7f00005f53a5
...
```
PiperOrigin-RevId: 345363242
|
|
PiperOrigin-RevId: 345178956
|
|
PiperOrigin-RevId: 343123278
|
|
As part of this, change Task.interrupted() to not drain Task.interruptChan, and
do so explicitly using new function Task.unsetInterrupted() instead.
PiperOrigin-RevId: 342768365
|
|
PiperOrigin-RevId: 342373580
|
|
Checks in Task.block() and Task.Value() are conditional on race detection being
enabled, since these functions are relatively hot. Checks in Task.SleepStart()
and Task.UninterruptibleSleepStart() are enabled unconditionally, since these
functions are not thought to lie on any critical paths, and misuse of these
functions is required for b/168241471 to manifest.
PiperOrigin-RevId: 342342175
|
|
kernel.Task can only be used as context.Context by that Task's task goroutine.
This is violated in at least two places:
- In any case where one thread accesses the /proc/[tid] of any other thread,
passing the kernel.Task for [tid] as the context.Context is incorrect.
- Task.rebuildTraceContext() may be called by Kernel.RebuildTraceContexts()
outside the scope of any task goroutine.
Fix these (as well as a data race on Task.traceContext discovered during the
course of finding the latter).
PiperOrigin-RevId: 342174404
|
|
This reduces confusion with context.Context (which is also relevant to
kernel.Tasks) and is consistent with existing function kernel.LoadTaskImage().
PiperOrigin-RevId: 342167298
|
|
This lets us avoid treating a value of 0 as one reference. All references
using the refsvfs2 template must call InitRefs() before the reference is
incremented/decremented, or else a panic will occur. Therefore, it should be
pretty easy to identify missing InitRef calls during testing.
Updates #1486.
PiperOrigin-RevId: 341411151
|
|
PiperOrigin-RevId: 341154192
|
|
PiperOrigin-RevId: 340536306
|
|
The default pipe size already matched linux, and is unchanged.
Furthermore `atomicIOBytes` is made a proper constant (as it is in Linux). We
were plumbing usermem.PageSize everywhere, so this is no functional change.
PiperOrigin-RevId: 340497006
|
|
PiperOrigin-RevId: 340389884
|
|
kernel.copyContext{t} cannot be used outside of t's task goroutine, for three
reasons:
- t.CopyScratchBuffer() is task-goroutine-local.
- Calling t.MemoryManager() without running on t's task goroutine or locking
t.mu violates t.MemoryManager()'s preconditions.
- kernel.copyContext passes t as context.Context to MM IO methods, which is
illegal outside of t's task goroutine (cf. kernel.Task.Value()).
Fix this by splitting AsCopyContext() into CopyContext() (which takes an
explicit context.Context and is usable outside of the task goroutine) and
OwnCopyContext() (which uses t as context.Context, but is only usable by t's
task goroutine).
PiperOrigin-RevId: 339933809
|
|
PiperOrigin-RevId: 339166854
|
|
Inode number consistency checks are now skipped in save/restore tests for
reasons described in greatest detail in StatTest.StateDoesntChangeAfterRename.
They pass in VFS1 due to the bug described in new test case
SimpleStatTest.DifferentFilesHaveDifferentDeviceInodeNumberPairs.
Fixes #1663
PiperOrigin-RevId: 338776148
|
|
Our current reference leak checker uses finalizers to verify whether an object
has reached zero references before it is garbage collected. There are multiple
problems with this mechanism, so a rewrite is in order.
With finalizers, there is no way to guarantee that a finalizer will run before
the program exits. When an unreachable object with a finalizer is garbage
collected, its finalizer will be added to a queue and run asynchronously. The
best we can do is run garbage collection upon sandbox exit to make sure that
all finalizers are enqueued.
Furthermore, if there is a chain of finalized objects, e.g. A points to B
points to C, garbage collection needs to run multiple times before all of the
finalizers are enqueued. The first GC run will register the finalizer for A but
not free it. It takes another GC run to free A, at which point B's finalizer
can be registered. As a result, we need to run GC as many times as the length
of the longest such chain to have a somewhat reliable leak checker.
Finally, a cyclical chain of structs pointing to one another will never be
garbage collected if a finalizer is set. This is a well-known issue with Go
finalizers (https://github.com/golang/go/issues/7358). Using leak checking on
filesystem objects that produce cycles will not work and even result in memory
leaks.
The new leak checker stores reference counted objects in a global map when
leak check is enabled and removes them once they are destroyed. At sandbox
exit, any remaining objects in the map are considered as leaked. This provides
a deterministic way of detecting leaks without relying on the complexities of
finalizers and garbage collection.
This approach has several benefits over the former, including:
- Always detects leaks of objects that should be destroyed very close to
sandbox exit. The old checker very rarely detected these leaks, because it
relied on garbage collection to be run in a short window of time.
- Panics if we forgot to enable leak check on a ref-counted object (we will try
to remove it from the map when it is destroyed, but it will never have been
added).
- Can store extra logging information in the map values without adding to the
size of the ref count struct itself. With the size of just an int64, the ref
count object remains compact, meaning frequent operations like IncRef/DecRef
are more cache-efficient.
- Can aggregate leak results in a single report after the sandbox exits.
Instead of having warnings littered in the log, which were
non-deterministically triggered by garbage collection, we can print all
warning messages at once. Note that this could also be a limitation--the
sandbox must exit properly for leaks to be detected.
Some basic benchmarking indicates that this change does not significantly
affect performance when leak checking is enabled, which is understandable
since registering/unregistering is only done once for each filesystem object.
Updates #1486.
PiperOrigin-RevId: 338685972
|