Age | Commit message (Collapse) | Author |
|
add renameMu.Lock when oldParent == newParent
in order to avoid data race in following report:
WARNING: DATA RACE
Read at 0x00c000ba2160 by goroutine 405:
gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*Dirent).fullName()
pkg/sentry/fs/dirent.go:246 +0x6c
gvisor.googlesource.com/gvisor/pkg/sentry/fs.(*Dirent).FullName()
pkg/sentry/fs/dirent.go:356 +0x8b
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*FDMap).String()
pkg/sentry/kernel/fd_map.go:135 +0x1e0
fmt.(*pp).handleMethods()
GOROOT/src/fmt/print.go:603 +0x404
fmt.(*pp).printArg()
GOROOT/src/fmt/print.go:686 +0x255
fmt.(*pp).doPrintf()
GOROOT/src/fmt/print.go:1003 +0x33f
fmt.Fprintf()
GOROOT/src/fmt/print.go:188 +0x7f
gvisor.googlesource.com/gvisor/pkg/log.(*Writer).Emit()
pkg/log/log.go:121 +0x89
gvisor.googlesource.com/gvisor/pkg/log.GoogleEmitter.Emit()
pkg/log/glog.go:162 +0x1acc
gvisor.googlesource.com/gvisor/pkg/log.(*GoogleEmitter).Emit()
<autogenerated>:1 +0xe1
gvisor.googlesource.com/gvisor/pkg/log.(*BasicLogger).Debugf()
pkg/log/log.go:177 +0x111
gvisor.googlesource.com/gvisor/pkg/log.Debugf()
pkg/log/log.go:235 +0x66
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Debugf()
pkg/sentry/kernel/task_log.go:48 +0xfe
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).DebugDumpState()
pkg/sentry/kernel/task_log.go:66 +0x11f
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
pkg/sentry/kernel/task_run.go:272 +0xc80
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
pkg/sentry/kernel/task_run.go:91 +0x24b
Previous write at 0x00c000ba2160 by goroutine 423:
gvisor.googlesource.com/gvisor/pkg/sentry/fs.Rename()
pkg/sentry/fs/dirent.go:1628 +0x61f
gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1.1()
pkg/sentry/syscalls/linux/sys_file.go:1864 +0x1f8
gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt( gvisor.googlesource.com/g/linux/sys_file.go:51 +0x20f
gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt.func1()
pkg/sentry/syscalls/linux/sys_file.go:1852 +0x218
gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.fileOpAt()
pkg/sentry/syscalls/linux/sys_file.go:51 +0x20f
gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.renameAt()
pkg/sentry/syscalls/linux/sys_file.go:1840 +0x180
gvisor.googlesource.com/gvisor/pkg/sentry/syscalls/linux.Rename()
pkg/sentry/syscalls/linux/sys_file.go:1873 +0x60
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).executeSyscall()
pkg/sentry/kernel/task_syscall.go:165 +0x17a
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallInvoke()
pkg/sentry/kernel/task_syscall.go:283 +0xb4
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscallEnter()
pkg/sentry/kernel/task_syscall.go:244 +0x10c
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).doSyscall()
pkg/sentry/kernel/task_syscall.go:219 +0x1e3
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runApp).execute()
pkg/sentry/kernel/task_run.go:215 +0x15a9
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run()
pkg/sentry/kernel/task_run.go:91 +0x24b
Reported-by: syzbot+e1babbf756fab380dfff@syzkaller.appspotmail.com
Change-Id: Icd2620bb3ea28b817bf0672d454a22b9d8ee189a
PiperOrigin-RevId: 242938741
|
|
PiperOrigin-RevId: 242919489
Change-Id: Ie3267b3bcd8a54b54bc16a6556369a19e843376f
|
|
DirentCache is already a savable type, and it ensures that it is empty at the
point of Save. There is no reason not to save it along with the MountSource.
This did uncover an issue where not all MountSources were properly flushed
before Save. If a mount point has an open file and is then unmounted, we save
the MountSource without flushing it first. This CL also fixes that by flushing
all MountSources for all open FDs on Save.
PiperOrigin-RevId: 242906637
Change-Id: I3acd9d52b6ce6b8c989f835a408016cb3e67018f
|
|
This also applies these permissions to other static proc files.
Change-Id: I4167e585fed49ad271aa4e1f1260babb3239a73d
PiperOrigin-RevId: 242898575
|
|
From a recent test failure:
"State:\tD (disk sleep)\n"
"disk sleep" does not match \w+. We need to allow spaces.
PiperOrigin-RevId: 242762469
Change-Id: Ic8d05a16669412a72c1e76b498373e5b22fe64c4
|
|
From sendfile spec and also the linux kernel code, we should
limit the count arg to 'MAX_RW_COUNT'. This patch export
'MAX_RW_COUNT' in kernel pkg and use it in the implementation
of sendfile syscall.
Signed-off-by: Li Qiang <pangpei.lq@antfin.com>
Change-Id: I1086fec0685587116984555abd22b07ac233fbd2
PiperOrigin-RevId: 242745831
|
|
Otherwise, we will not have capabilities in the user namespace.
And this patch adds the noexec option for mounts.
https://github.com/google/gvisor/issues/145
PiperOrigin-RevId: 242706519
Change-Id: I1b78b77d6969bd18038c71616e8eb7111b71207c
|
|
PiperOrigin-RevId: 242704699
Change-Id: I87db368ca343b3b4bf4f969b17d3aa4ce2f8bd4f
|
|
PiperOrigin-RevId: 242690968
Change-Id: I1ac2248b5ab3bcd95beed52ecddbb9f34eeb3775
|
|
PiperOrigin-RevId: 242647530
Change-Id: I1bf9ac1d664f452dc47ca670d408a73538cb482f
|
|
PiperOrigin-RevId: 242573252
Change-Id: Ibb4c6bfae2c2e322bf1cec23181a0ab663d8530a
|
|
Also add kernel.SignalInfoNoInfo, and use it in RLIMIT_FSIZE checks.
PiperOrigin-RevId: 242562428
Change-Id: I4887c0e1c8f5fddcabfe6d4281bf76d2f2eafe90
|
|
PiperOrigin-RevId: 242531141
Change-Id: I2a3bd815bda09f392f511f47120d5d9e6e86a40d
|
|
We construct a ramfs tree of "scaffolding" directories for all mount points, so
that a directory exists that each mount point can be mounted over.
We were creating these directories without write permissions, which meant that
they were not wribable even when underlayed under a writable filesystem. They
should be writable.
PiperOrigin-RevId: 242507789
Change-Id: I86645e35417560d862442ff5962da211dbe9b731
|
|
PiperOrigin-RevId: 242493066
Change-Id: I2b2b590799d208895c5c16606e4f854dfd112dba
|
|
PiperOrigin-RevId: 242226319
Change-Id: Iefc78656841315f6b7d48bd85db451486850264d
|
|
Strings are a better fit for this usage because they are immutable in Go, and
can contain arbitrary bytes. It also allows us to avoid casting bytes to string
(and the associated allocation) in the hot path when checking for overlay
whiteouts.
PiperOrigin-RevId: 242208856
Change-Id: I7699ae6302492eca71787dd0b72e0a5a217a3db2
|
|
This CL merges all RBE-specific configuration from .bazelrc_rbe into .bazelrc
so that it will be picked up by default by users running bazel.
It also checks in a bazelrc from the upstream bazel-toolchains repository, and
imports that into our repo-specific .bazelrc. This makes it easier to maintain
and update the bazelrc going forward.
Documentation was added to the README.
PiperOrigin-RevId: 242208733
Change-Id: Iea32de9be85b024bd74f88909b56b2a8ab34851a
|
|
From the SDM: "The least-significant byte in register EAX (register AL)
will always return 01H. Software should ignore this value and not
interpret it as an informational descriptor."
Unfortunately, online docs [1] [2] (likely based on an old version of the SDM)
say: "The least-significant byte in register EAX (register AL) indicates
the number of times the CPUID instruction must be executed with an input
value of 2 to get a complete description of the processor's caches and
TLBs."
dlang uses this second interpretation [3] and will loop 2^32 times if we
return zero. Fix this by specifying the fixed value of one. We still
don't support exposing the actual cache information, leaving all other
bytes empty. A zero byte means: "Null descriptor, this byte contains no
information."
[1] http://www.sandpile.org/x86/cpuid.htm#level_0000_0002h
[2] https://c9x.me/x86/html/file_module_x86_id_45.html
[3] https://github.com/dlang/druntime/blob/424640864c2aa001731467e96f637bd3e704e481/src/core/cpuid.d#L533-L534
PiperOrigin-RevId: 242046629
Change-Id: Ic0f0a5f974b20f71391cb85645bdcd4003e5fe88
|
|
https://github.com/google/gvisor/issues/145
PiperOrigin-RevId: 242044115
Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878
|
|
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid,
respectively, where removing defer saves ~100ns. This may be a small
improvement to application logging, which may call gettid/getpid
frequently.
PiperOrigin-RevId: 242039616
Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
|
|
Change-Id: Ibd6d8a1a63826af6e62a0f0669f8f0866c8091b4
PiperOrigin-RevId: 242037969
|
|
Change-Id: Ibb77656c46942eb123cd6cff8b471a526468d2dd
PiperOrigin-RevId: 242007583
|
|
PiperOrigin-RevId: 241867632
Change-Id: I29459f2758ac4835882b491ff25c6aca9a37d41d
|
|
This will save copies when preemption is not caused by a CPU migration.
PiperOrigin-RevId: 241844399
Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851
|
|
Dirent.exists() is called in Create to check whether a child with the given
name already exists.
Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu
while calling d.Inode.Lookup. During this existence check, a racing Rename()
can acquire d.mu and create a new child of the dirent with the same name.
(Note that the source and destination of the rename must be in the same
directory, otherwise renameMu will be taken preventing the race.) In this
case, d.exists() can return false, even though a child with the same name
actually does exist.
This CL changes d.exists() so that it does not release d.mu while walking, thus
preventing the race with Rename.
It also adds comments noting that lockForRename may not take renameMu if the
source and destination are in the same directory, as this is a bit surprising
(at least it was to me).
PiperOrigin-RevId: 241842579
Change-Id: I56524870e39dfcd18cab82054eb3088846c34813
|
|
If there are thousands of threads, ThreadGroupsAppend becomes very
expensive as it must iterate over all Tasks to find the ThreadGroup
leaders.
Reduce the cost by maintaining a map of ThreadGroups which can be used
to grab them all directly.
The one somewhat visible change is to convert PID namespace init
children zapping to a group-directed SIGKILL, as Linux did in
82058d668465 "signal: Use group_send_sig_info to kill all processes in a
pid namespace".
In a benchmark that creates N threads which sleep for two minutes, we
see approximately this much CPU time in ThreadGroupsAppend:
Before:
1 thread: 0ms
1024 threads: 30ms - 9130ms
4096 threads: 50ms - 2000ms
8192 threads: 18160ms
16384 threads: 17210ms
After:
1 thread: 0ms
1024 threads: 0ms
4096 threads: 0ms
8192 threads: 0ms
16384 threads: 0ms
The profiling is actually extremely noisy (likely due to cache effects),
as some runs show almost no samples at 1024, 4096 threads, but obviously
this does not scale to lots of threads.
PiperOrigin-RevId: 241828039
Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c
|
|
The previous implementation revolved around runes instead of bytes, which caused
weird behavior when converting between the two. For example, peekRune would read
the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an
alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for
?. However, peekRune also returned the length of the byte (1). When calling
utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte
character ?.
tl;dr: I apparently didn't understand runes when I wrote this.
PiperOrigin-RevId: 241789081
Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
|
|
Also makes the safemem reading and writing inline, as it makes it easier to see
what locks are held.
PiperOrigin-RevId: 241775201
Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c
|
|
Added syscall annotations for unimplemented syscalls for later generation into
reference docs. Annotations are of the form:
@Syscall(<name>, <key:value>, ...)
Supported args and values are:
- arg: A syscall option. This entry only applies to the syscall when given this
option.
- support: Indicates support level
- UNIMPLEMENTED: Unimplemented (implies returns:ENOSYS)
- PARTIAL: Partial support. Details should be provided in note.
- FULL: Full support
- returns: Indicates a known return value. Values are
syscall errors. This is treated as a string so you can use something
like "returns:EPERM or ENOSYS".
- issue: A Github issue number.
- note: A note
Example:
// @Syscall(mmap, arg:MAP_PRIVATE, support:FULL, note:Private memory fully supported)
// @Syscall(mmap, arg:MAP_SHARED, support:UNIMPLEMENTED, issue:123, note:Shared memory not supported)
// @Syscall(setxattr, returns:ENOTSUP, note:Requires file system support)
Annotations should be placed as close to their implementation as possible
(preferrably as part of a supporting function's Godoc) and should be updated as
syscall support changes.
PiperOrigin-RevId: 241697482
Change-Id: I7a846135db124e1271dc5057d788cba82ca312d4
|
|
$ docker run --rm --runtime=runsc -it --cap-add=SYS_PTRACE debian bash -c "apt-get update && apt-get install strace && strace ls"
...
Setting up strace (4.15-2) ...
execve("/bin/ls", ["ls"], [/* 6 vars */]) = 0
brk(NULL) = 0x5646d8c1e000
uname({sysname="Linux", nodename="114ef93d2db3", ...}) = 0
...
PiperOrigin-RevId: 241643321
Change-Id: Ie4bce27a7fb147eef07bbae5895c6ef3f529e177
|
|
bazel test test/syscalls:raw_socket_ipv4_test_{native,runsc_ptrace,runsc_kvm}
PiperOrigin-RevId: 241640049
Change-Id: Iac4dbdd7fd1827399a472059ac7d85fb6b506577
|
|
Also remove comments in InodeOperations that required that implementation of
some Create* operations ensure that the name does not already exist, since
these checks are all centralized in the Dirent.
PiperOrigin-RevId: 241637335
Change-Id: Id098dc6063ff7c38347af29d1369075ad1e89a58
|
|
PiperOrigin-RevId: 241637164
Change-Id: I65476a739cf38f1818dc47f6ce60638dec8b77a8
|
|
PiperOrigin-RevId: 241630409
Change-Id: Ie0df5f5a2f20c2d32e615f16e2ba43c88f963181
|
|
Current gvisor doesn't give devices a right major and minor number.
When testing golang supporting of gvisor, I run the test case below:
```
$ docker run -ti --runtime runsc golang:1.12.1 bash -c "cd /usr/local/go/src && ./run.bash "
```
And it reports some errors, one of them is:
"--- FAIL: TestDevices (0.00s)
--- FAIL: TestDevices//dev/null_1:3 (0.00s)
dev_linux_test.go:45: for /dev/null Major(0x0) == 0, want 1
dev_linux_test.go:48: for /dev/null Minor(0x0) == 0, want 3
dev_linux_test.go:51: for /dev/null Mkdev(1, 3) == 0x103, want 0x0
--- FAIL: TestDevices//dev/zero_1:5 (0.00s)
dev_linux_test.go:45: for /dev/zero Major(0x0) == 0, want 1
dev_linux_test.go:48: for /dev/zero Minor(0x0) == 0, want 5
dev_linux_test.go:51: for /dev/zero Mkdev(1, 5) == 0x105, want 0x0
--- FAIL: TestDevices//dev/random_1:8 (0.00s)
dev_linux_test.go:45: for /dev/random Major(0x0) == 0, want 1
dev_linux_test.go:48: for /dev/random Minor(0x0) == 0, want 8
dev_linux_test.go:51: for /dev/random Mkdev(1, 8) == 0x108, want 0x0
--- FAIL: TestDevices//dev/full_1:7 (0.00s)
dev_linux_test.go:45: for /dev/full Major(0x0) == 0, want 1
dev_linux_test.go:48: for /dev/full Minor(0x0) == 0, want 7
dev_linux_test.go:51: for /dev/full Mkdev(1, 7) == 0x107, want 0x0
--- FAIL: TestDevices//dev/urandom_1:9 (0.00s)
dev_linux_test.go:45: for /dev/urandom Major(0x0) == 0, want 1
dev_linux_test.go:48: for /dev/urandom Minor(0x0) == 0, want 9
dev_linux_test.go:51: for /dev/urandom Mkdev(1, 9) == 0x109, want 0x0
"
So I think we'd better assign to them correct major/minor numbers following linux spec.
Signed-off-by: Wei Zhang <zhangwei198900@gmail.com>
Change-Id: I4521ee7884b4e214fd3a261929e3b6dac537ada9
PiperOrigin-RevId: 241609021
|
|
PiperOrigin-RevId: 241567897
Change-Id: I580eac04f52bb15f4aab7df9822c4aa92e743021
|
|
Having raw socket code together will make it easier to add support for other raw
network protocols. Currently, only ICMP uses the raw endpoint. However, adding
support for other protocols such as UDP shouldn't be much more difficult than
adding a few switch cases.
PiperOrigin-RevId: 241564875
Change-Id: I77e03adafe4ce0fd29ba2d5dfdc547d2ae8f25bf
|
|
PiperOrigin-RevId: 241434161
Change-Id: I9ec734e50cef5b39203e8bf37de2d91d24943f1e
|
|
PiperOrigin-RevId: 241421671
Change-Id: Ic0cebfe3efd458dc42c49f7f812c13318705199a
|
|
We weren't saving simple devices' last allocated inode numbers, which
caused inode number reuse across S/R.
PiperOrigin-RevId: 241414245
Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
|
|
This reveals a bug in the tests that require CAP_SET{UID,GID}: After the
child process enters the new user namespace, it ceases to have the
relevant capability in the parent user namespace, so the privileged
write must be done by the parent process. Change tests accordingly.
PiperOrigin-RevId: 241412765
Change-Id: I587c1f24aa6f2180fb2e5e5c0162691ba5bac1bc
|
|
'ls' will hang if there is any FIFO in this path. So
return EPERM if unsupported file occurs and add NONBLOCK flag
when opening file to avoid blocking on FIFO read.
Signed-off-by: Liu Hua <sdu.liu@huawei.com>
Change-Id: I8b9a2a48322118d8ad531dd226395438123eb047
PiperOrigin-RevId: 241406726
|
|
PiperOrigin-RevId: 241403847
Change-Id: I4631ca05734142da6e80cdfa1a1d63ed68aa05cc
|
|
ilist:generic_list works faster (cl/240185278) and
the code looks cleaner without type casting.
PiperOrigin-RevId: 241381175
Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
|
|
PiperOrigin-RevId: 241350917
Change-Id: Ieacaa9ce2e41e22f1bae8900170879f549606782
|
|
- Make the body of InForkedProcess async-signal-safe.
- Pass the correct path to open().
PiperOrigin-RevId: 241348774
Change-Id: I753dfa36e4fb05521e659c173e3b7db0c7fc159b
|
|
The linux packet socket can handle GSO packets, so we can segment packets to
64K instead of the MTU which is usually 1500.
Here are numbers for the nginx-1m test:
runsc: 579330.01 [Kbytes/sec] received
runsc-gso: 1794121.66 [Kbytes/sec] received
runc: 2122139.06 [Kbytes/sec] received
and for tcp_benchmark:
$ tcp_benchmark --duration 15 --ideal
[ 4] 0.0-15.0 sec 86647 MBytes 48456 Mbits/sec
$ tcp_benchmark --client --duration 15 --ideal
[ 4] 0.0-15.0 sec 2173 MBytes 1214 Mbits/sec
$ tcp_benchmark --client --duration 15 --ideal --gso 65536
[ 4] 0.0-15.0 sec 19357 MBytes 10825 Mbits/sec
PiperOrigin-RevId: 241072403
Change-Id: I20b03063a1a6649362b43609cbbc9b59be06e6d5
|
|
PiperOrigin-RevId: 241072126
Change-Id: Ib4d9f58f550732ac4c5153d3cf159a5b1a9749da
|
|
PiperOrigin-RevId: 241056805
Change-Id: I13ea8f5dbfb01ca02a3b0ab887b8c3bdf4d556a6
|