Age | Commit message (Collapse) | Author |
|
Add global flags -profile-{block,cpu,heap,mutex} and -trace which
enable collection of the specified profile for the entire duration of a
container execution. This provides a way to definitively start profiling
before that application starts, rather than attempting to race with an
out-of-band `runsc debug`.
Note that only the main boot process is profiled.
This exposed a bug in Task.traceExecEvent: a crash when tracing and
-race are enabled. traceExecEvent is called off of the task goroutine,
but uses the Task as a context, which is a violation of the Task
contract. Switching to the AsyncContext fixes the issue.
Fixes #220
|
|
...through the loopback interface, only.
This change only supports sending on packet sockets through the loopback
interface as the loopback interface is the only interface used in packet
socket syscall tests - the other link endpoints are not excercised with
the existing test infrastructure.
Support for sending on packet sockets through the other interfaces will
be added as needed.
BUG: https://fxbug.dev/81592
PiperOrigin-RevId: 394368899
|
|
...to match Linux behaviour.
We can see evidence of Linux representing loopback as an ethernet-based
device below:
```
# EUI-48 based MAC addresses.
$ ip link show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# tcpdump showing ethernet frames when sniffing loopback and logging the
# link-type as EN10MB (Ethernet).
$ sudo tcpdump -i lo -e -c 2 -n
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
03:09:05.002034 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.9557 > 127.0.0.1.36828: Flags [.], ack 3562800815, win 15342, options [nop,nop,TS val 843174495 ecr 843159493], length 0
03:09:05.002094 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.36828 > 127.0.0.1.9557: Flags [.], ack 1, win 6160, options [nop,nop,TS val 843174496 ecr 843159493], length 0
2 packets captured
116 packets received by filter
0 packets dropped by kernel
```
Wireshark shows a similar result as the tcpdump example above.
Linux's loopback setup: https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/drivers/net/loopback.c#L162
PiperOrigin-RevId: 391836719
|
|
The gofer session is killed when a gofer backed volume is unmounted. The
gofer monitor catches the disconnect and kills the container. This changes
the gofer monitor to only care about the rootfs connections, which cannot
be unmounted.
Fixes #6259
PiperOrigin-RevId: 385929039
|
|
It was confusing to find functions relating to root and non-root
containers. Replace "non-root" and "subcontainer" and make naming
consistent in Sandbox and controller.
PiperOrigin-RevId: 384512518
|
|
Set stdio ownership based on the container's user to ensure the
user can open/read/write to/from stdios.
1. stdios in the host are changed to have the owner be the same
uid/gid of the process running the sandbox. This ensures that the
sandbox has full control over it.
2. stdios owner owner inside the sandbox is changed to match the
container's user to give access inside the container and make it
behave the same as runc.
Fixes #6180
PiperOrigin-RevId: 384347009
|
|
PiperOrigin-RevId: 383705129
|
|
...and pass it explicitly.
This reverts commit b63e61828d0652ad1769db342c17a3529d2d24ed.
PiperOrigin-RevId: 380039167
|
|
PiperOrigin-RevId: 378726430
|
|
Fixes #214
PiperOrigin-RevId: 378680466
|
|
Avoids a race condition at kernel initialization.
Updates #6057.
PiperOrigin-RevId: 377357723
|
|
...except in tests.
Note this replaces some uses of a cryptographic RNG with a plain RNG.
PiperOrigin-RevId: 376070666
|
|
PiperOrigin-RevId: 375843579
|
|
Remove useless conversions. Avoid unhandled errors.
PiperOrigin-RevId: 375834275
|
|
PiperOrigin-RevId: 372608247
|
|
This patch is to solve problem that vCPU timer mess up when
adding vCPU dynamically on ARM64, for detailed information
please refer to:
https://github.com/google/gvisor/issues/5739
There is no influence on x86 and here are main changes for
ARM64:
1. create maxVCPUs number of vCPU in machine initialization
2. we want to sync gvisor vCPU number with host CPU number,
so use smaller number between runtime.NumCPU and
KVM_CAP_MAX_VCPUS to be maxVCPUS
3. put unused vCPUs into architecture-specific map initialvCPUs
4. When machine need to bind a new vCPU with tid, rather
than creating new one, it would pick a vCPU from map initalvCPUs
5. change the setSystemTime function. When vCPU number increasing,
the time cost for function setTSC(use syscall to set cntvoff) is
liner growth from around 300 ns to 100000 ns, and this leads to
the function setSystemTimeLegacy can not get correct offset
value.
6. initializing StdioFDs and goferFD before a platform to avoid
StdioFDs confects with vCPU fds
Signed-off-by: howard zhang <howard.zhang@arm.com>
|
|
Weirdness metric contains fields to track the number of clock fallback,
partial result and vsyscalls. This metric will avoid the overhead of
having three different metrics (fallbackMetric, partialResultMetric,
vsyscallCount).
PiperOrigin-RevId: 369970218
|
|
In the previous spot, there was a roughly 50% chance that leak checking would
actually run. Move it to the waitContainer() call on the root container, where
it is guaranteed to run before the sandbox process is terminated. Add it to
runsc/cli/main.go as well for good measure, in case the sandbox exit path does
not involve waitContainer().
PiperOrigin-RevId: 369329796
|
|
Add a coverage-report flag that will cause the sandbox to generate a coverage
report (with suffix .cov) in the debug log directory upon exiting. For the
report to be generated, runsc must have been built with the following Bazel
flags: `--collect_code_coverage --instrumentation_filter=...`.
With coverage reports, we should be able to aggregate results across all tests
to surface code coverage statistics for the project as a whole.
The report is simply a text file with each line representing a covered block
as `file:start_line.start_col,end_line.end_col`. Note that this is similar to
the format of coverage reports generated with `go test -coverprofile`,
although we omit the count and number of statements, which are not useful for
us.
Some simple ways of getting coverage reports:
bazel test <some_test> --collect_code_coverage \
--instrumentation_filter=//pkg/...
bazel build //runsc --collect_code_coverage \
--instrumentation_filter=//pkg/...
runsc -coverage-report=dir/ <other_flags> do ...
PiperOrigin-RevId: 368952911
|
|
A skeleton implementation of cgroupfs. It supports trivial cpu and
memory controllers with no support for hierarchies.
PiperOrigin-RevId: 366561126
|
|
containerd usually configures both /dev and /dev/shm as tmpfs mounts, e.g.:
```
"mounts": [
...
{
"destination": "/dev",
"type": "tmpfs",
"source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
...
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=67108864"
]
},
...
```
(This is mostly consistent with how Linux is usually configured, except that
/dev is conventionally devtmpfs, not regular tmpfs. runc/libcontainer
implements OCI-runtime-spec-undocumented behavior to create
/dev/{ptmx,fd,stdin,stdout,stderr} in non-bind /dev mounts. runsc silently
switches /dev to devtmpfs. In VFS1, this is necessary to get device files like
/dev/null at all, since VFS1 doesn't support real device special files, only
what is hardcoded in devfs. VFS2 does support device special files, but using
devtmpfs is the easiest way to get pre-created files in /dev.)
runsc ignores many /dev submounts in the spec, including /dev/shm. In VFS1,
this appears to be to avoid introducing a submount overlay for /dev, and is
mostly fine since the typical mode for the /dev/shm mount is ~consistent with
the mode of the /dev/shm directory provided by devfs (modulo the sticky bit).
In VFS2, this is vestigial (VFS2 does not use submount overlays), and devtmpfs'
/dev/shm mode is correct for the mount point but not the mount. So turn off
this behavior for VFS2.
After this change:
```
$ docker run --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwt 2 root root 40 Mar 18 00:16 .
drwxr-xr-x 5 root root 360 Mar 18 00:16 ..
$ docker run --runtime=runsc --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwx 1 root root 0 Mar 18 00:16 .
dr-xr-xr-x 1 root root 0 Mar 18 00:16 ..
$ docker run --runtime=runsc-vfs2 --rm -it ubuntu:focal ls -lah /dev/shm
total 0
drwxrwxrwt 2 root root 40 Mar 18 00:16 .
drwxr-xr-x 5 root root 320 Mar 18 00:16 ..
```
Fixes #5687
PiperOrigin-RevId: 363699385
|
|
Previously, loader.signalProcess was inconsitently using both root and
container's PID namespace to find the process. It used root namespace
for the exec'd process and container's PID namespace for other processes.
This fixes the code to use the root PID namespace across the board, which
is the same PID reported in `runsc ps` (or soon will after
https://github.com/google/gvisor/pull/5519).
PiperOrigin-RevId: 358836297
|
|
Operations are now shut down automatically by the main Stop
command, and it is not necessary to call Stop during Destroy.
Fixes #5454
PiperOrigin-RevId: 357295930
|
|
Because we lack gVisor-internal cgroups, we take the CPU usage of the entire pod
and divide it proportionally according to sentry-internal usage stats.
This fixes `kubectl top pods`, which gets a pod's CPU usage by summing the usage
of its containers.
Addresses #172.
PiperOrigin-RevId: 355229833
|
|
These are primarily simplification and lint mistakes. However, minor
fixes are also included and tests added where appropriate.
PiperOrigin-RevId: 351425971
|
|
Closes #5226
PiperOrigin-RevId: 351259576
|
|
This allows for a model of profiling when you can start collection, and
it will terminate when the sandbox terminates. Without this synchronous
call, it is effectively impossible to collect length blocking and mutex
profiles.
PiperOrigin-RevId: 349483418
|
|
Closes #5048
PiperOrigin-RevId: 348050472
|
|
PiperOrigin-RevId: 346101076
|
|
Fixes #2714
PiperOrigin-RevId: 342950412
|
|
When OOM score adjustment needs to be set, all the containers need to be
loaded to find all containers that belong to the sandbox. However, each
load signals the container to ensure it is still alive. OOM score
adjustment is set during creation and deletion of every container, generating
a flood of signals to all containers. The fix removes the signal check
when it's not needed.
There is also a race fetching OOM score adjustment value from the parent when
the sandbox exits at the same time (the time it took to signal containers above
made this window quite large). The fix is to store the original value
in the sandbox state file and use it when the value needs to be restored.
Also add more logging and made the existing ones more consistent to help with
debugging.
PiperOrigin-RevId: 340940799
|
|
PiperOrigin-RevId: 339721152
|
|
Also refactor the template and CheckedObject interface to make this cleaner.
Updates #1486.
PiperOrigin-RevId: 339577120
|
|
Our current reference leak checker uses finalizers to verify whether an object
has reached zero references before it is garbage collected. There are multiple
problems with this mechanism, so a rewrite is in order.
With finalizers, there is no way to guarantee that a finalizer will run before
the program exits. When an unreachable object with a finalizer is garbage
collected, its finalizer will be added to a queue and run asynchronously. The
best we can do is run garbage collection upon sandbox exit to make sure that
all finalizers are enqueued.
Furthermore, if there is a chain of finalized objects, e.g. A points to B
points to C, garbage collection needs to run multiple times before all of the
finalizers are enqueued. The first GC run will register the finalizer for A but
not free it. It takes another GC run to free A, at which point B's finalizer
can be registered. As a result, we need to run GC as many times as the length
of the longest such chain to have a somewhat reliable leak checker.
Finally, a cyclical chain of structs pointing to one another will never be
garbage collected if a finalizer is set. This is a well-known issue with Go
finalizers (https://github.com/golang/go/issues/7358). Using leak checking on
filesystem objects that produce cycles will not work and even result in memory
leaks.
The new leak checker stores reference counted objects in a global map when
leak check is enabled and removes them once they are destroyed. At sandbox
exit, any remaining objects in the map are considered as leaked. This provides
a deterministic way of detecting leaks without relying on the complexities of
finalizers and garbage collection.
This approach has several benefits over the former, including:
- Always detects leaks of objects that should be destroyed very close to
sandbox exit. The old checker very rarely detected these leaks, because it
relied on garbage collection to be run in a short window of time.
- Panics if we forgot to enable leak check on a ref-counted object (we will try
to remove it from the map when it is destroyed, but it will never have been
added).
- Can store extra logging information in the map values without adding to the
size of the ref count struct itself. With the size of just an int64, the ref
count object remains compact, meaning frequent operations like IncRef/DecRef
are more cache-efficient.
- Can aggregate leak results in a single report after the sandbox exits.
Instead of having warnings littered in the log, which were
non-deterministically triggered by garbage collection, we can print all
warning messages at once. Note that this could also be a limitation--the
sandbox must exit properly for leaks to be detected.
Some basic benchmarking indicates that this change does not significantly
affect performance when leak checking is enabled, which is understandable
since registering/unregistering is only done once for each filesystem object.
Updates #1486.
PiperOrigin-RevId: 338685972
|
|
Subcontainers are only configured when the container starts, however because
start doesn't load the spec, flag annotations that may override flags were
not getting applied to the configuration.
Updates #3494
PiperOrigin-RevId: 338610953
|
|
This fixes reference leaks related to accidentally forgetting to DecRef()
after calling one or the other.
PiperOrigin-RevId: 336918922
|
|
PiperOrigin-RevId: 336694658
|
|
When all container tasks finish, they release the mount which in turn
will close the 9P session to the gofer. The gofer exits when the connection
closes, triggering the gofer monitor. The gofer monitor will _think_ that
the gofer died prematurely and destroy the container. Then when the caller
attempts to wait for the container, e.g. to get the exit code, wait fails
saying the container doesn't exist.
Gofer monitor now just SIGKILLs the container, and let the normal teardown
process to happen, which will evetually destroy the container at the right
time. Also, fixed an issue with exec racing with container's init process
exiting.
Closes #1487
PiperOrigin-RevId: 335537350
|
|
PiperOrigin-RevId: 335516972
|
|
With cgroups configured NumCPU is correct, however GOMAXPROCS is still derived from total host core count and ignores cgroup restrictions. This can lead to different and undesired behavior across different hosts.
For example, the total number of threads in the guest process will be larger on machines with more cores.
This change configures the go runtime for the sandbox to only use the number of threads consistent with its restrictions.
|
|
Network or transport protocols may want to reach the stack. Support this
by letting the stack create the protocol instances so it can pass a
reference to itself at protocol creation time.
Note, protocols do not yet use the stack in this CL but later CLs will
make use of the stack from protocols.
PiperOrigin-RevId: 334260210
|
|
OCI configuration includes support for specifying seccomp filters. In runc,
these filter configurations are converted into seccomp BPF programs and loaded
into the kernel via libseccomp. runsc needs to be a static binary so, for
runsc, we cannot rely on a C library and need to implement the functionality
in Go.
The generator added here implements basic support for taking OCI seccomp
configuration and converting it into a seccomp BPF program with the same
behavior as a program generated by libseccomp.
- New conditional operations were added to pkg/seccomp to support operations
available in OCI.
- AllowAny and AllowValue were renamed to MatchAny and EqualTo to better reflect
that syscalls matching the conditionals result in the provided action not
simply SCMP_RET_ALLOW.
- BuildProgram in pkg/seccomp no longer panics if provided an empty list of
rules. It now builds a program with the architecture sanity check only.
- ProgramBuilder now allows adding labels that are unused. However, backwards
jumps are still not permitted.
Fixes #510
PiperOrigin-RevId: 331938697
|
|
The existing implementation for TransportProtocol.{Set}Option take
arguments of an empty interface type which all types (implicitly)
implement; any type may be passed to the functions.
This change introduces marker interfaces for transport protocol options
that may be set or queried which transport protocol option types
implement to ensure that invalid types are caught at compile time.
Different interfaces are used to allow the compiler to enforce read-only
or set-only socket options.
RELNOTES: n/a
PiperOrigin-RevId: 330559811
|
|
VFS1 and VFS2 host FDs have different dupping behavior,
making error prone to code for both. Change the contract
so that FDs are released as they are used, so the caller
can simple defer a block that closes all remaining files.
This also addresses handling of partial failures.
With this fix, more VFS2 tests can be enabled.
Updates #1487
PiperOrigin-RevId: 330112266
|
|
Currently the stdio FDs are not dupped and will be closed
unexpectedly in VFS2 when starting a child container. This
patch fixes this issue.
Fixes: #3821
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
|
|
The existing implementation for NetworkProtocol.{Set}Option take
arguments of an empty interface type which all types (implicitly)
implement; any type may be passed to the functions.
This change introduces marker interfaces for network protocol options
that may be set or queried which network protocol option types implement
to ensure that invalid types are caught at compile time. Different
interfaces are used to allow the compiler to enforce read-only or
set-only socket options.
PiperOrigin-RevId: 328980359
|
|
Updates #3494
PiperOrigin-RevId: 327548511
|
|
Running garbage collection enqueues all finalizers, which are used by the
refs/refs_vfs2 packages to detect reference leaks. Note that even with GC,
there is no guarantee that all finalizers will be run before the program exits.
This is a best effort attempt to activate leak checks as much as possible.
Updates #3545.
PiperOrigin-RevId: 325834438
|
|
Also removes `--profile-goroutine` because it's equivalent
to `debug --stacks`.
PiperOrigin-RevId: 325061502
|
|
The loader dup's stdio FD into stable FD's starting at a fixed
number. During tests, it's possible that the target FD is already
in use. Added check to error early so it's easier to debug failures.
Also bumped up the starting FD number to prevent collisions.
PiperOrigin-RevId: 324917299
|