summaryrefslogtreecommitdiffhomepage
path: root/runsc/boot/loader.go
AgeCommit message (Collapse)Author
2021-09-16runsc: add global profile collection flagsMichael Pratt
Add global flags -profile-{block,cpu,heap,mutex} and -trace which enable collection of the specified profile for the entire duration of a container execution. This provides a way to definitively start profiling before that application starts, rather than attempting to race with an out-of-band `runsc debug`. Note that only the main boot process is profiled. This exposed a bug in Task.traceExecEvent: a crash when tracing and -race are enabled. traceExecEvent is called off of the task goroutine, but uses the Task as a context, which is a violation of the Task contract. Switching to the AsyncContext fixes the issue. Fixes #220
2021-09-01Support sending with packet socketsGhanan Gowripalan
...through the loopback interface, only. This change only supports sending on packet sockets through the loopback interface as the loopback interface is the only interface used in packet socket syscall tests - the other link endpoints are not excercised with the existing test infrastructure. Support for sending on packet sockets through the other interfaces will be added as needed. BUG: https://fxbug.dev/81592 PiperOrigin-RevId: 394368899
2021-08-19Add loopback interface as an ethernet-based deviceGhanan Gowripalan
...to match Linux behaviour. We can see evidence of Linux representing loopback as an ethernet-based device below: ``` # EUI-48 based MAC addresses. $ ip link show lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 # tcpdump showing ethernet frames when sniffing loopback and logging the # link-type as EN10MB (Ethernet). $ sudo tcpdump -i lo -e -c 2 -n tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes 03:09:05.002034 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.9557 > 127.0.0.1.36828: Flags [.], ack 3562800815, win 15342, options [nop,nop,TS val 843174495 ecr 843159493], length 0 03:09:05.002094 00:00:00:00:00:00 > 00:00:00:00:00:00, ethertype IPv4 (0x0800), length 66: 127.0.0.1.36828 > 127.0.0.1.9557: Flags [.], ack 1, win 6160, options [nop,nop,TS val 843174496 ecr 843159493], length 0 2 packets captured 116 packets received by filter 0 packets dropped by kernel ``` Wireshark shows a similar result as the tcpdump example above. Linux's loopback setup: https://github.com/torvalds/linux/blob/5bfc75d92efd494db37f5c4c173d3639d4772966/drivers/net/loopback.c#L162 PiperOrigin-RevId: 391836719
2021-07-20Don't kill container when volume is unmountedFabricio Voznika
The gofer session is killed when a gofer backed volume is unmounted. The gofer monitor catches the disconnect and kills the container. This changes the gofer monitor to only care about the rootfs connections, which cannot be unmounted. Fixes #6259 PiperOrigin-RevId: 385929039
2021-07-13Use consistent naming for subcontainersFabricio Voznika
It was confusing to find functions relating to root and non-root containers. Replace "non-root" and "subcontainer" and make naming consistent in Sandbox and controller. PiperOrigin-RevId: 384512518
2021-07-12Fix stdios ownershipFabricio Voznika
Set stdio ownership based on the container's user to ensure the user can open/read/write to/from stdios. 1. stdios in the host are changed to have the owner be the same uid/gid of the process running the sandbox. This ensures that the sandbox has full control over it. 2. stdios owner owner inside the sandbox is changed to match the container's user to give access inside the container and make it behave the same as runc. Fixes #6180 PiperOrigin-RevId: 384347009
2021-07-08Replace kernel.ExitStatus with linux.WaitStatus.Jamie Liu
PiperOrigin-RevId: 383705129
2021-06-17Move tcpip.Clock impl to TimekeeperTamir Duberstein
...and pass it explicitly. This reverts commit b63e61828d0652ad1769db342c17a3529d2d24ed. PiperOrigin-RevId: 380039167
2021-06-10Set RLimits during `runsc exec`Fabricio Voznika
PiperOrigin-RevId: 378726430
2021-06-10[op] Move SignalInfo to abi/linux package.Ayush Ranjan
Fixes #214 PiperOrigin-RevId: 378680466
2021-06-03Initialize metrics at initTamir Duberstein
Avoids a race condition at kernel initialization. Updates #6057. PiperOrigin-RevId: 377357723
2021-05-26Use the stack RNG everywhereTamir Duberstein
...except in tests. Note this replaces some uses of a cryptographic RNG with a plain RNG. PiperOrigin-RevId: 376070666
2021-05-25Initialize Kernel.Timekeeper before network NSTamir Duberstein
PiperOrigin-RevId: 375843579
2021-05-25Use specific fmt verbs (avoid %v)Tamir Duberstein
Remove useless conversions. Avoid unhandled errors. PiperOrigin-RevId: 375834275
2021-05-07Merge pull request #5758 from zhlhahaha:2125gVisor bot
PiperOrigin-RevId: 372608247
2021-05-07Init all vCPU when initializing machine on ARM64howard zhang
This patch is to solve problem that vCPU timer mess up when adding vCPU dynamically on ARM64, for detailed information please refer to: https://github.com/google/gvisor/issues/5739 There is no influence on x86 and here are main changes for ARM64: 1. create maxVCPUs number of vCPU in machine initialization 2. we want to sync gvisor vCPU number with host CPU number, so use smaller number between runtime.NumCPU and KVM_CAP_MAX_VCPUS to be maxVCPUS 3. put unused vCPUs into architecture-specific map initialvCPUs 4. When machine need to bind a new vCPU with tid, rather than creating new one, it would pick a vCPU from map initalvCPUs 5. change the setSystemTime function. When vCPU number increasing, the time cost for function setTSC(use syscall to set cntvoff) is liner growth from around 300 ns to 100000 ns, and this leads to the function setSystemTimeLegacy can not get correct offset value. 6. initializing StdioFDs and goferFD before a platform to avoid StdioFDs confects with vCPU fds Signed-off-by: howard zhang <howard.zhang@arm.com>
2021-04-22Add weirdness sentry metric.Nayana Bidari
Weirdness metric contains fields to track the number of clock fallback, partial result and vsyscalls. This metric will avoid the overhead of having three different metrics (fallbackMetric, partialResultMetric, vsyscallCount). PiperOrigin-RevId: 369970218
2021-04-19Move runsc reference leak checking to better locations.Dean Deng
In the previous spot, there was a roughly 50% chance that leak checking would actually run. Move it to the waitContainer() call on the root container, where it is guaranteed to run before the sandbox process is terminated. Add it to runsc/cli/main.go as well for good measure, in case the sandbox exit path does not involve waitContainer(). PiperOrigin-RevId: 369329796
2021-04-16Allow runsc to generate coverage reports.Dean Deng
Add a coverage-report flag that will cause the sandbox to generate a coverage report (with suffix .cov) in the debug log directory upon exiting. For the report to be generated, runsc must have been built with the following Bazel flags: `--collect_code_coverage --instrumentation_filter=...`. With coverage reports, we should be able to aggregate results across all tests to surface code coverage statistics for the project as a whole. The report is simply a text file with each line representing a covered block as `file:start_line.start_col,end_line.end_col`. Note that this is similar to the format of coverage reports generated with `go test -coverprofile`, although we omit the count and number of statements, which are not useful for us. Some simple ways of getting coverage reports: bazel test <some_test> --collect_code_coverage \ --instrumentation_filter=//pkg/... bazel build //runsc --collect_code_coverage \ --instrumentation_filter=//pkg/... runsc -coverage-report=dir/ <other_flags> do ... PiperOrigin-RevId: 368952911
2021-04-02Implement cgroupfs.Rahat Mahmood
A skeleton implementation of cgroupfs. It supports trivial cpu and memory controllers with no support for hierarchies. PiperOrigin-RevId: 366561126
2021-03-18Skip /dev submount hack on VFS2.Jamie Liu
containerd usually configures both /dev and /dev/shm as tmpfs mounts, e.g.: ``` "mounts": [ ... { "destination": "/dev", "type": "tmpfs", "source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, ... { "destination": "/dev/shm", "type": "tmpfs", "source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=67108864" ] }, ... ``` (This is mostly consistent with how Linux is usually configured, except that /dev is conventionally devtmpfs, not regular tmpfs. runc/libcontainer implements OCI-runtime-spec-undocumented behavior to create /dev/{ptmx,fd,stdin,stdout,stderr} in non-bind /dev mounts. runsc silently switches /dev to devtmpfs. In VFS1, this is necessary to get device files like /dev/null at all, since VFS1 doesn't support real device special files, only what is hardcoded in devfs. VFS2 does support device special files, but using devtmpfs is the easiest way to get pre-created files in /dev.) runsc ignores many /dev submounts in the spec, including /dev/shm. In VFS1, this appears to be to avoid introducing a submount overlay for /dev, and is mostly fine since the typical mode for the /dev/shm mount is ~consistent with the mode of the /dev/shm directory provided by devfs (modulo the sticky bit). In VFS2, this is vestigial (VFS2 does not use submount overlays), and devtmpfs' /dev/shm mode is correct for the mount point but not the mount. So turn off this behavior for VFS2. After this change: ``` $ docker run --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwt 2 root root 40 Mar 18 00:16 . drwxr-xr-x 5 root root 360 Mar 18 00:16 .. $ docker run --runtime=runsc --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwx 1 root root 0 Mar 18 00:16 . dr-xr-xr-x 1 root root 0 Mar 18 00:16 .. $ docker run --runtime=runsc-vfs2 --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwt 2 root root 40 Mar 18 00:16 . drwxr-xr-x 5 root root 320 Mar 18 00:16 .. ``` Fixes #5687 PiperOrigin-RevId: 363699385
2021-02-22Fix `runsc kill --pid`Fabricio Voznika
Previously, loader.signalProcess was inconsitently using both root and container's PID namespace to find the process. It used root namespace for the exec'd process and container's PID namespace for other processes. This fixes the code to use the root PID namespace across the board, which is the same PID reported in `runsc ps` (or soon will after https://github.com/google/gvisor/pull/5519). PiperOrigin-RevId: 358836297
2021-02-12Stop the control server only once.Adin Scannell
Operations are now shut down automatically by the main Stop command, and it is not necessary to call Stop during Destroy. Fixes #5454 PiperOrigin-RevId: 357295930
2021-02-02Stub out basic `runsc events --stat` CPU functionalityKevin Krakauer
Because we lack gVisor-internal cgroups, we take the CPU usage of the entire pod and divide it proportionally according to sentry-internal usage stats. This fixes `kubectl top pods`, which gets a pod's CPU usage by summing the usage of its containers. Addresses #172. PiperOrigin-RevId: 355229833
2021-01-12Fix simple mistakes identified by goreportcard.Adin Scannell
These are primarily simplification and lint mistakes. However, minor fixes are also included and tests added where appropriate. PiperOrigin-RevId: 351425971
2021-01-11OCI spec may contain duplicate environment variablesFabricio Voznika
Closes #5226 PiperOrigin-RevId: 351259576
2020-12-29Make profiling commands synchronous.Adin Scannell
This allows for a model of profiling when you can start collection, and it will terminate when the sandbox terminates. Without this synchronous call, it is effectively impossible to collect length blocking and mutex profiles. PiperOrigin-RevId: 349483418
2020-12-17Set max memory not minFabricio Voznika
Closes #5048 PiperOrigin-RevId: 348050472
2020-12-07Support icmpv6 transport protocolPeter Johnston
PiperOrigin-RevId: 346101076
2020-11-17Add support for TTY in multi-containerFabricio Voznika
Fixes #2714 PiperOrigin-RevId: 342950412
2020-11-05Fix failure setting OOM score adjustmentFabricio Voznika
When OOM score adjustment needs to be set, all the containers need to be loaded to find all containers that belong to the sandbox. However, each load signals the container to ensure it is still alive. OOM score adjustment is set during creation and deletion of every container, generating a flood of signals to all containers. The fix removes the signal check when it's not needed. There is also a race fetching OOM score adjustment value from the parent when the sandbox exits at the same time (the time it took to signal containers above made this window quite large). The fix is to store the original value in the sandbox state file and use it when the value needs to be restored. Also add more logging and made the existing ones more consistent to help with debugging. PiperOrigin-RevId: 340940799
2020-10-29Keep magic constants out of netstackKevin Krakauer
PiperOrigin-RevId: 339721152
2020-10-28Add logging option to leak checker.Dean Deng
Also refactor the template and CheckedObject interface to make this cleaner. Updates #1486. PiperOrigin-RevId: 339577120
2020-10-23Rewrite reference leak checker without finalizers.Dean Deng
Our current reference leak checker uses finalizers to verify whether an object has reached zero references before it is garbage collected. There are multiple problems with this mechanism, so a rewrite is in order. With finalizers, there is no way to guarantee that a finalizer will run before the program exits. When an unreachable object with a finalizer is garbage collected, its finalizer will be added to a queue and run asynchronously. The best we can do is run garbage collection upon sandbox exit to make sure that all finalizers are enqueued. Furthermore, if there is a chain of finalized objects, e.g. A points to B points to C, garbage collection needs to run multiple times before all of the finalizers are enqueued. The first GC run will register the finalizer for A but not free it. It takes another GC run to free A, at which point B's finalizer can be registered. As a result, we need to run GC as many times as the length of the longest such chain to have a somewhat reliable leak checker. Finally, a cyclical chain of structs pointing to one another will never be garbage collected if a finalizer is set. This is a well-known issue with Go finalizers (https://github.com/golang/go/issues/7358). Using leak checking on filesystem objects that produce cycles will not work and even result in memory leaks. The new leak checker stores reference counted objects in a global map when leak check is enabled and removes them once they are destroyed. At sandbox exit, any remaining objects in the map are considered as leaked. This provides a deterministic way of detecting leaks without relying on the complexities of finalizers and garbage collection. This approach has several benefits over the former, including: - Always detects leaks of objects that should be destroyed very close to sandbox exit. The old checker very rarely detected these leaks, because it relied on garbage collection to be run in a short window of time. - Panics if we forgot to enable leak check on a ref-counted object (we will try to remove it from the map when it is destroyed, but it will never have been added). - Can store extra logging information in the map values without adding to the size of the ref count struct itself. With the size of just an int64, the ref count object remains compact, meaning frequent operations like IncRef/DecRef are more cache-efficient. - Can aggregate leak results in a single report after the sandbox exits. Instead of having warnings littered in the log, which were non-deterministically triggered by garbage collection, we can print all warning messages at once. Note that this could also be a limitation--the sandbox must exit properly for leaks to be detected. Some basic benchmarking indicates that this change does not significantly affect performance when leak checking is enabled, which is understandable since registering/unregistering is only done once for each filesystem object. Updates #1486. PiperOrigin-RevId: 338685972
2020-10-22Load spec during "runsc start" to process flag overridesFabricio Voznika
Subcontainers are only configured when the container starts, however because start doesn't load the spec, flag annotations that may override flags were not getting applied to the configuration. Updates #3494 PiperOrigin-RevId: 338610953
2020-10-13[vfs2] Don't take reference in Task.MountNamespaceVFS2 and MountNamespace.Root.Dean Deng
This fixes reference leaks related to accidentally forgetting to DecRef() after calling one or the other. PiperOrigin-RevId: 336918922
2020-10-12[vfs2] Don't leak disconnected mounts.Dean Deng
PiperOrigin-RevId: 336694658
2020-10-05Fix gofer monitor prematurely destroying containerFabricio Voznika
When all container tasks finish, they release the mount which in turn will close the 9P session to the gofer. The gofer exits when the connection closes, triggering the gofer monitor. The gofer monitor will _think_ that the gofer died prematurely and destroy the container. Then when the caller attempts to wait for the container, e.g. to get the exit code, wait fails saying the container doesn't exist. Gofer monitor now just SIGKILLs the container, and let the normal teardown process to happen, which will evetually destroy the container at the right time. Also, fixed an issue with exec racing with container's init process exiting. Closes #1487 PiperOrigin-RevId: 335537350
2020-10-05Merge pull request #3970 from benbuzbee:gomaxprocsgVisor bot
PiperOrigin-RevId: 335516972
2020-09-30Use consistent thread configuration for sandbox go runtimeBen Buzbee
With cgroups configured NumCPU is correct, however GOMAXPROCS is still derived from total host core count and ignores cgroup restrictions. This can lead to different and undesired behavior across different hosts. For example, the total number of threads in the guest process will be larger on machines with more cores. This change configures the go runtime for the sandbox to only use the number of threads consistent with its restrictions.
2020-09-28Support creating protocol instances with Stack refGhanan Gowripalan
Network or transport protocols may want to reach the stack. Support this by letting the stack create the protocol instances so it can pass a reference to itself at protocol creation time. Note, protocols do not yet use the stack in this CL but later CLs will make use of the stack from protocols. PiperOrigin-RevId: 334260210
2020-09-15Add support for OCI seccomp filters in the sandbox.Ian Lewis
OCI configuration includes support for specifying seccomp filters. In runc, these filter configurations are converted into seccomp BPF programs and loaded into the kernel via libseccomp. runsc needs to be a static binary so, for runsc, we cannot rely on a C library and need to implement the functionality in Go. The generator added here implements basic support for taking OCI seccomp configuration and converting it into a seccomp BPF program with the same behavior as a program generated by libseccomp. - New conditional operations were added to pkg/seccomp to support operations available in OCI. - AllowAny and AllowValue were renamed to MatchAny and EqualTo to better reflect that syscalls matching the conditionals result in the provided action not simply SCMP_RET_ALLOW. - BuildProgram in pkg/seccomp no longer panics if provided an empty list of rules. It now builds a program with the architecture sanity check only. - ProgramBuilder now allows adding labels that are unused. However, backwards jumps are still not permitted. Fixes #510 PiperOrigin-RevId: 331938697
2020-09-08Improve type safety for transport protocol optionsGhanan Gowripalan
The existing implementation for TransportProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for transport protocol options that may be set or queried which transport protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. RELNOTES: n/a PiperOrigin-RevId: 330559811
2020-09-04Simplify FD handling for container start/execFabricio Voznika
VFS1 and VFS2 host FDs have different dupping behavior, making error prone to code for both. Change the contract so that FDs are released as they are used, so the caller can simple defer a block that closes all remaining files. This also addresses handling of partial failures. With this fix, more VFS2 tests can be enabled. Updates #1487 PiperOrigin-RevId: 330112266
2020-09-01Dup stdio FDs for VFS2 when starting a child containerTiwei Bie
Currently the stdio FDs are not dupped and will be closed unexpectedly in VFS2 when starting a child container. This patch fixes this issue. Fixes: #3821 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-08-28Improve type safety for network protocol optionsGhanan Gowripalan
The existing implementation for NetworkProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for network protocol options that may be set or queried which network protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. PiperOrigin-RevId: 328980359
2020-08-19Move boot.Config to its own packageFabricio Voznika
Updates #3494 PiperOrigin-RevId: 327548511
2020-08-10Run GC before sandbox exit when leak checking is enabled.Dean Deng
Running garbage collection enqueues all finalizers, which are used by the refs/refs_vfs2 packages to detect reference leaks. Note that even with GC, there is no guarantee that all finalizers will be run before the program exits. This is a best effort attempt to activate leak checks as much as possible. Updates #3545. PiperOrigin-RevId: 325834438
2020-08-05Stop profiling when the sentry exitsFabricio Voznika
Also removes `--profile-goroutine` because it's equivalent to `debug --stacks`. PiperOrigin-RevId: 325061502
2020-08-04Error if dup'ing stdio FDs will clobber another FDFabricio Voznika
The loader dup's stdio FD into stable FD's starting at a fixed number. During tests, it's possible that the target FD is already in use. Added check to error early so it's easier to debug failures. Also bumped up the starting FD number to prevent collisions. PiperOrigin-RevId: 324917299