Age | Commit message (Collapse) | Author |
|
OCI configuration includes support for specifying seccomp filters. In runc,
these filter configurations are converted into seccomp BPF programs and loaded
into the kernel via libseccomp. runsc needs to be a static binary so, for
runsc, we cannot rely on a C library and need to implement the functionality
in Go.
The generator added here implements basic support for taking OCI seccomp
configuration and converting it into a seccomp BPF program with the same
behavior as a program generated by libseccomp.
- New conditional operations were added to pkg/seccomp to support operations
available in OCI.
- AllowAny and AllowValue were renamed to MatchAny and EqualTo to better reflect
that syscalls matching the conditionals result in the provided action not
simply SCMP_RET_ALLOW.
- BuildProgram in pkg/seccomp no longer panics if provided an empty list of
rules. It now builds a program with the architecture sanity check only.
- ProgramBuilder now allows adding labels that are unused. However, backwards
jumps are still not permitted.
Fixes #510
PiperOrigin-RevId: 331938697
|
|
The existing implementation for TransportProtocol.{Set}Option take
arguments of an empty interface type which all types (implicitly)
implement; any type may be passed to the functions.
This change introduces marker interfaces for transport protocol options
that may be set or queried which transport protocol option types
implement to ensure that invalid types are caught at compile time.
Different interfaces are used to allow the compiler to enforce read-only
or set-only socket options.
RELNOTES: n/a
PiperOrigin-RevId: 330559811
|
|
VFS1 and VFS2 host FDs have different dupping behavior,
making error prone to code for both. Change the contract
so that FDs are released as they are used, so the caller
can simple defer a block that closes all remaining files.
This also addresses handling of partial failures.
With this fix, more VFS2 tests can be enabled.
Updates #1487
PiperOrigin-RevId: 330112266
|
|
Currently the stdio FDs are not dupped and will be closed
unexpectedly in VFS2 when starting a child container. This
patch fixes this issue.
Fixes: #3821
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
|
|
The existing implementation for NetworkProtocol.{Set}Option take
arguments of an empty interface type which all types (implicitly)
implement; any type may be passed to the functions.
This change introduces marker interfaces for network protocol options
that may be set or queried which network protocol option types implement
to ensure that invalid types are caught at compile time. Different
interfaces are used to allow the compiler to enforce read-only or
set-only socket options.
PiperOrigin-RevId: 328980359
|
|
Updates #3494
PiperOrigin-RevId: 327548511
|
|
Running garbage collection enqueues all finalizers, which are used by the
refs/refs_vfs2 packages to detect reference leaks. Note that even with GC,
there is no guarantee that all finalizers will be run before the program exits.
This is a best effort attempt to activate leak checks as much as possible.
Updates #3545.
PiperOrigin-RevId: 325834438
|
|
Also removes `--profile-goroutine` because it's equivalent
to `debug --stacks`.
PiperOrigin-RevId: 325061502
|
|
The loader dup's stdio FD into stable FD's starting at a fixed
number. During tests, it's possible that the target FD is already
in use. Added check to error early so it's easier to debug failures.
Also bumped up the starting FD number to prevent collisions.
PiperOrigin-RevId: 324917299
|
|
context is passed to DecRef() and Release() which is
needed for SO_LINGER implementation.
PiperOrigin-RevId: 324672584
|
|
PiperOrigin-RevId: 321411758
|
|
- Combine process creation code that is shared between
root and subcontainer processes
- Move root container information into a struct for
clarity
Updates #2714
PiperOrigin-RevId: 321204798
|
|
This change gates all FUSE commands (by gating /dev/fuse) behind a runsc
flag. In order to use FUSE commands, use the --fuse flag with the --vfs2
flag. Check if FUSE is enabled by running dmesg in the sandbox.
|
|
Container restart test is disabled for VFS2 for now.
Updates #1487
PiperOrigin-RevId: 320296401
|
|
Removed VDSO dependency on VFS1.
Resolves #2921
PiperOrigin-RevId: 320122176
|
|
Linux controls socket send/receive buffers using a few sysctl variables
- net.core.rmem_default
- net.core.rmem_max
- net.core.wmem_max
- net.core.wmem_default
- net.ipv4.tcp_rmem
- net.ipv4.tcp_wmem
The first 4 control the default socket buffer sizes for all sockets
raw/packet/tcp/udp and also the maximum permitted socket buffer that can be
specified in setsockopt(SOL_SOCKET, SO_(RCV|SND)BUF,...).
The last two control the TCP auto-tuning limits and override the default
specified in rmem_default/wmem_default as well as the max limits.
Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it
to limit the maximum size in setsockopt() as well as uses it for raw/udp
sockets.
This changelist introduces the other 4 and updates the udp/raw sockets to use
the newly introduced variables. The values for min/max match the current
tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is
updated to match the linux value of 212KiB up from the really low current value
of 32 KiB.
Updates #3043
Fixes #3043
PiperOrigin-RevId: 318089805
|
|
Metadata was useful for debugging and safety, but enough tests exist that we
should see failures when (de)serialization is broken. It made stack
initialization more cumbersome and it's also getting in the way of ip6tables.
PiperOrigin-RevId: 317210653
|
|
Updates #173,#6
Fixes #2888
PiperOrigin-RevId: 317087652
|
|
Fixes #701
PiperOrigin-RevId: 316025635
|
|
IPTables.connections contains a sync.RWMutex. Copying it will trigger copylocks
analysis. Tested by manually enabling nogo tests.
sync.RWMutex is added to IPTables for the additional race condition discovered.
PiperOrigin-RevId: 314817019
|
|
Updates #1197, #1198, #1672
PiperOrigin-RevId: 310432006
|
|
We can register any number of tables with any number of architectures, and
need not limit the definitions to the architecture in question. This allows
runsc to generate documentation for all architectures simultaneously.
Similarly, this simplifies the VFSv2 patching process.
PiperOrigin-RevId: 310224827
|
|
Updates #1623, #1487
PiperOrigin-RevId: 309777922
|
|
This change includes:
- Modifications to loader_test.go to get TestCreateMountNamespace to
pass with VFS2.
- Changes necessary to get TestHelloWorld in image tests to pass with
VFS2. This means runsc can run the hello-world container with docker
on VSF2.
Note: Containers that use sockets will not run with these changes.
See "//test/image/...". Any tests here with sockets currently fail
(which is all of them but HelloWorld).
PiperOrigin-RevId: 308363072
|
|
This is needed to set up host fds passed through a Unix socket. Note that
the host package depends on kernel, so we cannot set up the hostfs mount
directly in Kernel.Init as we do for sockfs and pipefs.
Also, adjust sockfs to make its setup look more like hostfs's and pipefs's.
PiperOrigin-RevId: 308274053
|
|
PiperOrigin-RevId: 307977689
|
|
Included:
- loader_test.go RunTest and TestStartSignal VFS2
- container_test.go TestAppExitStatus on VFS2
- experimental flag added to runsc to turn on VFS2
Note: shared mounts are not yet supported.
PiperOrigin-RevId: 307070753
|
|
PiperOrigin-RevId: 306477639
|
|
Suppose I start a runsc container using kvm platform like this:
$ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash
The donating FD and the corresponding cmdline for runsc-sandbox is:
D0313 17:50:12.608203 44389 x:0] Donating FD 3: "1.txt"
D0313 17:50:12.608214 44389 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:12.608224 44389 x:0] Donating FD 5: "|0"
D0313 17:50:12.608229 44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:12.608234 44389 x:0] Donating FD 7: "|1"
D0313 17:50:12.608238 44389 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:12.608242 44389 x:0] Donating FD 9: "/dev/kvm"
D0313 17:50:12.608246 44389 x:0] Donating FD 10: "/dev/stdin"
D0313 17:50:12.608249 44389 x:0] Donating FD 11: "/dev/stdout"
D0313 17:50:12.608253 44389 x:0] Donating FD 12: "/dev/stderr"
D0313 17:50:12.608257 44389 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024
--watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true
--num-network-channels=1 --rootless=false --alsologtostderr=false
--ref-leak-mode=disabled --gso=true --software-gso=true
--overlayfs-stale-read=false --shared-volume= --debug-log-fd=3
--panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc
--controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8
--device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true
--setup-root --cpu-num 32 --total-memory 4294967296 rootbash]
Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12.
If I restore a container from the checkpoint image which is derived
by checkpointing the above rootbash container, but either omit the
platform switch or specify to use ptrace platform explicitely:
$ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash
the donating FD and corresponding cmdline for runsc-sandbox is:
D0313 17:50:15.258632 44452 x:0] Donating FD 3: "1.txt"
D0313 17:50:15.258640 44452 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:15.258645 44452 x:0] Donating FD 5: "|0"
D0313 17:50:15.258648 44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:15.258653 44452 x:0] Donating FD 7: "|1"
D0313 17:50:15.258657 44452 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:15.258661 44452 x:0] Donating FD 9: "/dev/stdin"
D0313 17:50:15.258675 44452 x:0] Donating FD 10: "/dev/stdout"
D0313 17:50:15.258680 44452 x:0] Donating FD 11: "/dev/stderr"
D0313 17:50:15.258684 44452 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=ptrace --strace=false --strace-syscalls=
--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1
--profile=false --net-raw=true --num-network-channels=1 --rootless=false
--alsologtostderr=false --ref-leak-mode=disabled --gso=true
--software-gso=true --overlayfs-stale-read=false --shared-volume=
--debug-log-fd=3 --panic-signal=15 boot
--bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4
--mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9
--stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory
4294967296 restored_rootbash]
Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the
saved host.descritor.origFD which is 12 for stderr is no longer valid).
For the three host FD based files, The s.Dev and s.Ino derived from
fstat(fd) shall all be the same and since the two fields are used
as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is
the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same.
Note that for MultiDevice m, m.cache records the mapping of key to value
and m.rcache records the mapping of value to key. If same value doesn't
map to the same key, it will panic on restore.
Now that stderr's origFD 12 is no longer valid(it happens to be
/memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived
from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be
correct. But its InodeID is still the same as saved, MultiDevice.Load()
will complain about the same value(InodeID) being mapped to different
keys (different from stdin and stdout's) and panic with: "MultiDevice's
caches are inconsistent".
Solve this problem by making sure stdioFDs for root container's init
task are always the same on initial start and on restore time, no matter
what cmdline user has used: debug log specified or not, platform changed
or not etc. shall not affect the ability to restore.
Fixes #1844.
|
|
TCP/IP will work with netstack networking. hostinet doesn't work, and sockets
will have the same behavior as it is now.
Before the userspace is able to create device, the default loopback device can
be used to test.
/proc/net and /sys/net will still be connected to the root network stack; this
is the same behavior now.
Issue #1833
PiperOrigin-RevId: 296309389
|
|
This is to fix a data race between sending an external signal to
a ThreadGroup and kernel saving state for S/R.
PiperOrigin-RevId: 295244281
|
|
- Added fsbridge package with interface that can be used to open
and read from VFS1 and VFS2 files.
- Converted ELF loader to use fsbridge
- Added VFS2 types to FSContext
- Added vfs.MountNamespace to ThreadGroup
Updates #1623
PiperOrigin-RevId: 295183950
|
|
FD table now holds both VFS1 and VFS2 types and uses the correct
one based on what's set.
Parts of this CL are just initial changes (e.g. sys_read.go,
runsc/main.go) to serve as a template for the remaining changes.
Updates #1487
Updates #1623
PiperOrigin-RevId: 292023223
|
|
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.
This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.
Updates #1472
PiperOrigin-RevId: 289033387
|
|
PiperOrigin-RevId: 284320186
|
|
This patch also include a minor change to replace syscall.Dup2
with syscall.Dup3 which was missed in a previous commit(ref a25a976).
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I00beb9cc492e44c762ebaa3750201c63c1f7c2f3
|
|
NETLINK_KOBJECT_UEVENT sockets send udev-style messages for device events.
gVisor doesn't have any device events, so our sockets don't need to do anything
once created.
systemd's device manager needs to be able to create one of these sockets. It
also wants to install a BPF filter on the socket. Since we'll never send any
messages, the filter would never be invoked, thus we just fake it out.
Fixes #1117
Updates #1119
PiperOrigin-RevId: 278405893
|
|
The watchdog currently can find stuck tasks, but has no way to tell if the
sandbox is stuck before the application starts executing.
This CL adds a startup timeout and action to the watchdog. If Start() is not
called before the given timeout (if non-zero), then the watchdog will take the
action.
PiperOrigin-RevId: 277970577
|
|
It is required to guarantee the same order of endpoints after save/restore.
PiperOrigin-RevId: 277598665
|
|
Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With
runsc, you'll need to pass `--net-raw=true` to enable them.
Binding isn't supported yet.
PiperOrigin-RevId: 275909366
|
|
Also change the default TTL to 64 to match Linux.
PiperOrigin-RevId: 273430341
|
|
PiperOrigin-RevId: 273365058
|
|
Also removes the need for protocol names.
PiperOrigin-RevId: 271186030
|
|
We already do this for `runsc run`, but need to do the same for `runsc exec`.
PiperOrigin-RevId: 270793459
|
|
Some processes are reparented to the root container depending
on the kill order and the root container would not reap in time.
So some zombie processes were still present when the test checked.
Fix it by running the second container inside a PID namespace.
PiperOrigin-RevId: 267278591
|
|
PiperOrigin-RevId: 266226714
|
|
This used to be the case, but regressed after a recent change.
Also made a few fixes around it and clean up the code a bit.
Closes #720
PiperOrigin-RevId: 265717496
|
|
|
|
|
|
This was done in commit 04cbb13ce9b151cf906f42e3f18ce3a875f01f63
PiperOrigin-RevId: 261414748
|