Age | Commit message (Collapse) | Author |
|
This patch is to solve problem that vCPU timer mess up when
adding vCPU dynamically on ARM64, for detailed information
please refer to:
https://github.com/google/gvisor/issues/5739
There is no influence on x86 and here are main changes for
ARM64:
1. create maxVCPUs number of vCPU in machine initialization
2. we want to sync gvisor vCPU number with host CPU number,
so use smaller number between runtime.NumCPU and
KVM_CAP_MAX_VCPUS to be maxVCPUS
3. put unused vCPUs into architecture-specific map initialvCPUs
4. When machine need to bind a new vCPU with tid, rather
than creating new one, it would pick a vCPU from map initalvCPUs
5. change the setSystemTime function. When vCPU number increasing,
the time cost for function setTSC(use syscall to set cntvoff) is
liner growth from around 300 ns to 100000 ns, and this leads to
the function setSystemTimeLegacy can not get correct offset
value.
6. initializing StdioFDs and goferFD before a platform to avoid
StdioFDs confects with vCPU fds
Signed-off-by: howard zhang <howard.zhang@arm.com>
|
|
--file-access-mounts flag is similar to --file-access, but controls
non-root mounts that were previously mounted in shared mode only.
This gives more flexibility to control how mounts are shared within
a container.
PiperOrigin-RevId: 364669882
|
|
Also adds support for clearing the setuid bit when appropriate (writing,
truncating, changing size, changing UID, or changing GID).
VFS2 only.
PiperOrigin-RevId: 364661835
|
|
These host calls are needed for Verity fs to generate/verify hashes.
PiperOrigin-RevId: 364598180
|
|
containerd usually configures both /dev and /dev/shm as tmpfs mounts, e.g.:
```
"mounts": [
...
{
"destination": "/dev",
"type": "tmpfs",
"source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
...
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=67108864"
]
},
...
```
(This is mostly consistent with how Linux is usually configured, except that
/dev is conventionally devtmpfs, not regular tmpfs. runc/libcontainer
implements OCI-runtime-spec-undocumented behavior to create
/dev/{ptmx,fd,stdin,stdout,stderr} in non-bind /dev mounts. runsc silently
switches /dev to devtmpfs. In VFS1, this is necessary to get device files like
/dev/null at all, since VFS1 doesn't support real device special files, only
what is hardcoded in devfs. VFS2 does support device special files, but using
devtmpfs is the easiest way to get pre-created files in /dev.)
runsc ignores many /dev submounts in the spec, including /dev/shm. In VFS1,
this appears to be to avoid introducing a submount overlay for /dev, and is
mostly fine since the typical mode for the /dev/shm mount is ~consistent with
the mode of the /dev/shm directory provided by devfs (modulo the sticky bit).
In VFS2, this is vestigial (VFS2 does not use submount overlays), and devtmpfs'
/dev/shm mode is correct for the mount point but not the mount. So turn off
this behavior for VFS2.
After this change:
```
$ docker run --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwt 2 root root 40 Mar 18 00:16 .
drwxr-xr-x 5 root root 360 Mar 18 00:16 ..
$ docker run --runtime=runsc --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwx 1 root root 0 Mar 18 00:16 .
dr-xr-xr-x 1 root root 0 Mar 18 00:16 ..
$ docker run --runtime=runsc-vfs2 --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwt 2 root root 40 Mar 18 00:16 .
drwxr-xr-x 5 root root 320 Mar 18 00:16 ..
```
Fixes #5687
PiperOrigin-RevId: 363699385
|
|
PiperOrigin-RevId: 362406813
|
|
PiperOrigin-RevId: 362360425
|
|
PiperOrigin-RevId: 361962416
|
|
panic: interface conversion: interface {} is syscall.WaitStatus, not unix.WaitStatus
goroutine 1 [running]:
main.runTestCaseNative(0xc0001fc000, 0xe3, 0xc000119b60, 0x1, 0x1, 0x0, 0x0)
test/runner/runner.go:185 +0xa94
main.main()
test/runner/runner.go:118 +0x745
PiperOrigin-RevId: 361957796
|
|
PiperOrigin-RevId: 361689477
|
|
The syscall package has been deprecated in favor of golang.org/x/sys.
Note that syscall is still used in some places because the following don't seem
to have an equivalent in unix package:
- syscall.SysProcIDMap
- syscall.Credential
Updates #214
PiperOrigin-RevId: 361381490
|
|
Add reverse operation to mitigate that just enables
all CPUs.
PiperOrigin-RevId: 360511215
|
|
PiperOrigin-RevId: 359334029
|
|
Syzkaller hosts contains many audit messages that runsc tries
to call the clock_nanosleep syscall.
PiperOrigin-RevId: 359331413
|
|
`runsc ps` currently return pid for a task's immediate pid namespace,
which is confusing when there're multiple pid namespaces. We should
return only pids in the root namespace.
Before:
```
1000 1 0 0 ? 02:24 250ms chrome
1000 1 0 0 ? 02:24 40ms dumb-init
1000 1 0 0 ? 02:24 240ms chrome
1000 2 1 0 ? 02:24 2.78s node
```
After:
```
UID PID PPID C TTY STIME TIME CMD
1000 1 0 0 ? 12:35 0s dumb-init
1000 2 1 7 ? 12:35 240ms node
1000 13 2 21 ? 12:35 2.33s chrome
1000 27 13 3 ? 12:35 260ms chrome
```
Signed-off-by: Daniel Dao <dqminh@cloudflare.com>
|
|
Only detect and mitigate on mds for the mitigate command.
PiperOrigin-RevId: 358924466
|
|
Updates #3481
Closes #5430
PiperOrigin-RevId: 358923208
|
|
Previously, loader.signalProcess was inconsitently using both root and
container's PID namespace to find the process. It used root namespace
for the exec'd process and container's PID namespace for other processes.
This fixes the code to use the root PID namespace across the board, which
is the same PID reported in `runsc ps` (or soon will after
https://github.com/google/gvisor/pull/5519).
PiperOrigin-RevId: 358836297
|
|
Operations are now shut down automatically by the main Stop
command, and it is not necessary to call Stop during Destroy.
Fixes #5454
PiperOrigin-RevId: 357295930
|
|
rt_sigaction may be called by Go runtime when trying to panic:
https://cs.opensource.google/go/go/+/master:src/runtime/signal_unix.go;drc=ed3e4afa12d655a0c5606bcf3dd4e1cdadcb1476;bpv=1;bpt=1;l=780?q=rt_sigaction&ss=go
Updates #5038
PiperOrigin-RevId: 357013186
|
|
PiperOrigin-RevId: 356772367
|
|
Panic seen at some code path like control.ExecAsync where
ctx does not have a Task.
Reported-by: syzbot+55ce727161cf94a7b7d6@syzkaller.appspotmail.com
PiperOrigin-RevId: 355960596
|
|
Some versions of the Go runtime call getcpu(), so add it for compatibility. The
hostcpu package already uses getcpu() on arm64.
PiperOrigin-RevId: 355717757
|
|
PiperOrigin-RevId: 355242055
|
|
Because we lack gVisor-internal cgroups, we take the CPU usage of the entire pod
and divide it proportionally according to sentry-internal usage stats.
This fixes `kubectl top pods`, which gets a pod's CPU usage by summing the usage
of its containers.
Addresses #172.
PiperOrigin-RevId: 355229833
|
|
Updates #1663
PiperOrigin-RevId: 355077816
|
|
PiperOrigin-RevId: 354568091
|
|
PiperOrigin-RevId: 354170726
|
|
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
|
|
in nested container, we see paths from host in /proc/self/cgroup, so we
need to re-process that path to get a relative path to be used inside
the container.
Without it, runsc generates ugly paths that may trip other cgroup
watchers that expect clean paths. An example of ugly path is:
```
/sys/fs/cgroup/memory/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93/cgroupPath
```
Notice duplication of `docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93`
`/proc/1/cgroup` looks like
```
12:perf_event:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
11:blkio:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
10:freezer:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
9:hugetlb:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
8:devices:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
7:rdma:/
6:pids:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
5:cpuset:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
4:cpu,cpuacct:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
3:memory:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
2:net_cls,net_prio:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
1:name=systemd:/docker/e383892b29290ae8005d535f2dadc4a583bb354d5bb1ba8c10bf900d92c4db93
0::/system.slice/containerd.service
```
This is not necessary when the parent container was created with cgroup
namespace, but that setup is not very common right now.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
|
|
PiperOrigin-RevId: 353274135
|
|
abi package is to be used by the Sentry to implement the Linux ABI.
Code dealing with the host should use x/sys/unix.
PiperOrigin-RevId: 353272679
|
|
Updates #5226
PiperOrigin-RevId: 353262133
|
|
Previously fsgofer was skipping chown call if the uid and gid
were the same as the current user/group. However, when setgid
is set, the group may not be the same as the caller. Instead,
compare the actual uid/gid of the file after it has been
created and change ownership only if needed.
Updates #180
PiperOrigin-RevId: 353118733
|
|
Whether the variable was found is already returned by syscall.Getenv.
os.Getenv drops this value while os.Lookupenv passes it along.
PiperOrigin-RevId: 351674032
|
|
These are primarily simplification and lint mistakes. However, minor
fixes are also included and tests added where appropriate.
PiperOrigin-RevId: 351425971
|
|
Closes #5226
PiperOrigin-RevId: 351259576
|
|
This includes minor fix-ups:
* Handle SIGTERM in runsc debug, to exit gracefully.
* Fix cmd.debug.go opening all profiles as RDONLY.
* Fix the test name in fio_test.go, and encode the block size in the test.
PiperOrigin-RevId: 350205718
|
|
PiperOrigin-RevId: 350159657
|
|
Closes #5052
PiperOrigin-RevId: 349579814
|
|
This allows for a model of profiling when you can start collection, and
it will terminate when the sandbox terminates. Without this synchronous
call, it is effectively impossible to collect length blocking and mutex
profiles.
PiperOrigin-RevId: 349483418
|
|
PiperOrigin-RevId: 348106699
|
|
PiperOrigin-RevId: 348055514
|
|
This allows to find all containers inside a sandbox more efficiently.
This operation is required every time a container starts and stops,
and previously required loading *all* container state files to check
whether the container belonged to the sandbox.
Apert from being inneficient, it has caused problems when state files
are stale or corrupt, causing inavalability to create any container.
Also adjust commands `list` and `debug` to skip over files that fail
to load.
Resolves #5052
PiperOrigin-RevId: 348050637
|
|
Closes #5048
PiperOrigin-RevId: 348050472
|
|
- Skip chown call in case owner change is not needed
- Skip filepath.Clean() calls when joining paths
- Pass unix.Stat_t by value to reduce runtime.duffcopy calls.
This change allows for better inlining in localFile.walk().
Change Baseline Improvement
BenchmarkWalkOne-6 2912 ns/op 3082 ns/op 5.5%
BenchmarkCreate-6 15915 ns/op 19126 ns/op 16.8%
BenchmarkCreateDiffOwner-6 18795 ns/op 19741 ns/op 4.8%
PiperOrigin-RevId: 347667833
|
|
There are surprisingly few syscall tests that run with hostinet. For example
running the following command only returns two results:
`bazel query test/syscalls:all | grep hostnet`
I think as a result, as our control messages evolved, hostinet was left
behind. Update it to support all control messages netstack supports.
This change also updates sentry's control message parsing logic to make it up to
date with all the control messages we support.
PiperOrigin-RevId: 347508892
|
|
This command takes instruction pointers from stdin and converts them into their
corresponding file names and line/column numbers in the runsc source code. The
inputs are not interpreted as actual addresses, but as synthetic values that are
exposed through /sys/kernel/debug/kcov. One can extract coverage information
from kcov and translate those values into locations in the source code by
running symbolize on the same runsc binary.
This will allow us to generate syzkaller coverage reports.
PiperOrigin-RevId: 347089624
|
|
PiperOrigin-RevId: 347047550
|
|
fdbased endpoint was enabling fragment reassembly on the host AF_PACKET socket
to ensure that fragments are delivered inorder to the right dispatcher. But this
prevents fragments from being delivered to gvisor at all and makes testing of
gvisor's fragment reassembly code impossible.
The potential impact from this is minimal since IP Fragmentation is not really
that prevelant and in cases where we do get fragments we may deliver the
fragment out of order to the TCP layer as multiple network dispatchers may
process the fragments and deliver a reassembled fragment after the next packet
has been delivered to the TCP endpoint. While not desirable I believe the impact
from this is minimal due to low prevalence of fragmentation.
Also removed PktType and Hatype fields when binding the socket as these are not
used when binding. Its just confusing to have them specified.
See: https://man7.org/linux/man-pages/man7/packet.7.html
"Fields used for binding are
sll_family (should be AF_PACKET), sll_protocol, and sll_ifindex."
Fixes #5055
PiperOrigin-RevId: 346919439
|