Age | Commit message (Collapse) | Author |
|
Suppose I start a runsc container using kvm platform like this:
$ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash
The donating FD and the corresponding cmdline for runsc-sandbox is:
D0313 17:50:12.608203 44389 x:0] Donating FD 3: "1.txt"
D0313 17:50:12.608214 44389 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:12.608224 44389 x:0] Donating FD 5: "|0"
D0313 17:50:12.608229 44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:12.608234 44389 x:0] Donating FD 7: "|1"
D0313 17:50:12.608238 44389 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:12.608242 44389 x:0] Donating FD 9: "/dev/kvm"
D0313 17:50:12.608246 44389 x:0] Donating FD 10: "/dev/stdin"
D0313 17:50:12.608249 44389 x:0] Donating FD 11: "/dev/stdout"
D0313 17:50:12.608253 44389 x:0] Donating FD 12: "/dev/stderr"
D0313 17:50:12.608257 44389 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024
--watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true
--num-network-channels=1 --rootless=false --alsologtostderr=false
--ref-leak-mode=disabled --gso=true --software-gso=true
--overlayfs-stale-read=false --shared-volume= --debug-log-fd=3
--panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc
--controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8
--device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true
--setup-root --cpu-num 32 --total-memory 4294967296 rootbash]
Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12.
If I restore a container from the checkpoint image which is derived
by checkpointing the above rootbash container, but either omit the
platform switch or specify to use ptrace platform explicitely:
$ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash
the donating FD and corresponding cmdline for runsc-sandbox is:
D0313 17:50:15.258632 44452 x:0] Donating FD 3: "1.txt"
D0313 17:50:15.258640 44452 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:15.258645 44452 x:0] Donating FD 5: "|0"
D0313 17:50:15.258648 44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:15.258653 44452 x:0] Donating FD 7: "|1"
D0313 17:50:15.258657 44452 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:15.258661 44452 x:0] Donating FD 9: "/dev/stdin"
D0313 17:50:15.258675 44452 x:0] Donating FD 10: "/dev/stdout"
D0313 17:50:15.258680 44452 x:0] Donating FD 11: "/dev/stderr"
D0313 17:50:15.258684 44452 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=ptrace --strace=false --strace-syscalls=
--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1
--profile=false --net-raw=true --num-network-channels=1 --rootless=false
--alsologtostderr=false --ref-leak-mode=disabled --gso=true
--software-gso=true --overlayfs-stale-read=false --shared-volume=
--debug-log-fd=3 --panic-signal=15 boot
--bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4
--mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9
--stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory
4294967296 restored_rootbash]
Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the
saved host.descritor.origFD which is 12 for stderr is no longer valid).
For the three host FD based files, The s.Dev and s.Ino derived from
fstat(fd) shall all be the same and since the two fields are used
as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is
the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same.
Note that for MultiDevice m, m.cache records the mapping of key to value
and m.rcache records the mapping of value to key. If same value doesn't
map to the same key, it will panic on restore.
Now that stderr's origFD 12 is no longer valid(it happens to be
/memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived
from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be
correct. But its InodeID is still the same as saved, MultiDevice.Load()
will complain about the same value(InodeID) being mapped to different
keys (different from stdin and stdout's) and panic with: "MultiDevice's
caches are inconsistent".
Solve this problem by making sure stdioFDs for root container's init
task are always the same on initial start and on restore time, no matter
what cmdline user has used: debug log specified or not, platform changed
or not etc. shall not affect the ability to restore.
Fixes #1844.
|
|
In the case of other signals (preemption), inject a normal bounce and
defer the signal until the vCPU has been returned from guest mode.
PiperOrigin-RevId: 303799678
|
|
Using the host-defined file owner matches VFS1. It is more correct to use the
host-defined mode, since the cached value may become out of date. However,
kernfs.Inode.Mode() does not return an error--other filesystems on kernfs are
in-memory so retrieving mode should not fail. Therefore, if the host syscall
fails, we rely on a cached value instead.
Updates #1672.
PiperOrigin-RevId: 303220864
|
|
utimensat is used by hostfs for setting timestamps on imported fds. Previously,
this would crash the sandbox since utimensat was not allowed.
Correct the VFS2 version of hostfs to match the call in VFS1.
PiperOrigin-RevId: 301970121
|
|
PiperOrigin-RevId: 301949722
|
|
- When setting up the virtual filesystem, mount a host.filesystem to contain
all files that need to be imported.
- Make read/preadv syscalls to the host in cases where preadv2 may not be
supported yet (likewise for writing).
- Make save/restore functions in kernel/kernel.go return early if vfs2 is
enabled.
PiperOrigin-RevId: 300922353
|
|
When the sandbox runs in attached more, e.g. runsc do, runsc run, the
sandbox lifetime is controlled by the parent process. This wasn't working
in all cases because PR_GET_PDEATHSIG doesn't propagate through execve
when the process changes uid/gid. So it was getting dropped when the
sandbox execve's to change to user nobody.
PiperOrigin-RevId: 300601247
|
|
The asynchronous goroutine preemption is a new feature of Go 1.14.
When we switched to go 1.14 (cl/297915917) in the bazel config,
the kokoro syscall-kvm job started permanently failing. Lets
temporary set asyncpreemptoff for the kvm platform to unblock tests.
PiperOrigin-RevId: 300372387
|
|
PiperOrigin-RevId: 299233818
|
|
A parser of test results doesn't expect to see any extra messages.
PiperOrigin-RevId: 299174138
|
|
A parser of test results doesn't expect to see any extra messages.
PiperOrigin-RevId: 298966577
|
|
GO's runtime calls the write system call twice to print "panic:"
and "the reason of this panic", so here is a race window when
other threads can print something to the log and we will see
something like this:
panic: log messages from another thread
The reason of the panic.
This confuses the syzkaller blacklist and dedup detection.
It also makes the logs generally difficult to read. e.g.,
data races often have one side of the race, followed by
a large "diagnosis" dump, finally followed by the other
side of the race.
PiperOrigin-RevId: 297887895
|
|
Updates #1873
PiperOrigin-RevId: 297695241
|
|
|
|
pipe and pipe2 aren't ported, pending a slight rework of pipe FDs for VFS2.
mount and umount2 aren't ported out of temporary laziness. access and faccessat
need additional FSImpl methods to implement properly, but are stubbed to
prevent googletest from CHECK-failing. Other syscalls require additional
plumbing.
Updates #1623
PiperOrigin-RevId: 297188448
|
|
TestMultiContainerKillAll timed out under --race. Without logging,
we cannot tell if the process list is still increasing, but slowly,
or is stuck.
PiperOrigin-RevId: 297158834
|
|
TCP/IP will work with netstack networking. hostinet doesn't work, and sockets
will have the same behavior as it is now.
Before the userspace is able to create device, the default loopback device can
be used to test.
/proc/net and /sys/net will still be connected to the root network stack; this
is the same behavior now.
Issue #1833
PiperOrigin-RevId: 296309389
|
|
PiperOrigin-RevId: 296105337
|
|
This is to fix a data race between sending an external signal to
a ThreadGroup and kernel saving state for S/R.
PiperOrigin-RevId: 295244281
|
|
- Added fsbridge package with interface that can be used to open
and read from VFS1 and VFS2 files.
- Converted ELF loader to use fsbridge
- Added VFS2 types to FSContext
- Added vfs.MountNamespace to ThreadGroup
Updates #1623
PiperOrigin-RevId: 295183950
|
|
PiperOrigin-RevId: 294500858
|
|
PiperOrigin-RevId: 294300437
|
|
PiperOrigin-RevId: 294297004
|
|
Note that these are only implemented for tmpfs, and other impls will still
return EOPNOTSUPP.
PiperOrigin-RevId: 293899385
|
|
Sometimes we get this error under TSAN:
"""
error getting process data from container: connecting to control server at PID
XXXX: connection refused
"""
The theory is that the top "sleep 20" was too short for TSAN, and the container
already exited, so we get connected refused. This commit changes the test to
let container signaling it's running by touching a file repeatedly forever
during the test.
PiperOrigin-RevId: 293710957
|
|
The host /etc can contain config files which affect tests.
For example, bash reads /etc/passwd and if it is too big
a test can fail by timeout.
PiperOrigin-RevId: 293670637
|
|
These were out-of-band notes that can help provide additional context
and simplify automated imports.
PiperOrigin-RevId: 293525915
|
|
PiperOrigin-RevId: 293243342
|
|
container_test was flaking because a small percentage of runs timed out. Tested
this fix with --runs_per_test=100.
PiperOrigin-RevId: 293240102
|
|
Go 1.14 has a workaround for a Linux 5.2-5.4 bug which requires mlock'ing the g
stack to prevent register corruption. We need to allow this syscall until it is
removed from Go.
PiperOrigin-RevId: 293212935
|
|
* Tests are picked for a shard differently. It now picks one test from each
block, instead of picking the whole block. This makes the same kind of tests
spreads across different shards.
* Reduce the number of connect() calls in TCPListenClose.
PiperOrigin-RevId: 293019281
|
|
PiperOrigin-RevId: 292974323
|
|
Go 1.14 has a workaround for a Linux 5.2-5.4 bug which requires mlock'ing the g
stack to prevent register corruption. We need to allow this syscall until it is
removed from Go.
PiperOrigin-RevId: 292967478
|
|
FD table now holds both VFS1 and VFS2 types and uses the correct
one based on what's set.
Parts of this CL are just initial changes (e.g. sys_read.go,
runsc/main.go) to serve as a template for the remaining changes.
Updates #1487
Updates #1623
PiperOrigin-RevId: 292023223
|
|
In general, we've learned that logging must be avoided at all
costs in the hot path. It's unlikely that the optimizations
here were significant in any case, since buffer would certainly
escape.
This also adds a test to ensure that the caller identification
works as expected, and so that logging can be benchmarked.
Original:
BenchmarkGoogleLogging-6 1222255 949 ns/op
With this change:
BenchmarkGoogleLogging-6 517323 2346 ns/op
Fixes #184
PiperOrigin-RevId: 291815420
|
|
Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.
PiperOrigin-RevId: 291811289
|
|
The preferred Copyright holder is "The gVisor Authors".
PiperOrigin-RevId: 291786657
|
|
PiperOrigin-RevId: 291745021
|
|
There was a very bare get/setxattr in the InodeOperations interface. Add
context.Context to both, size to getxattr, and flags to setxattr.
Note that extended attributes are passed around as strings in this
implementation, so size is automatically encoded into the value. Size is
added in getxattr so that implementations can return ERANGE if a value is larger
than can fit in the user-allocated buffer. This prevents us from unnecessarily
passing around an arbitrarily large xattr when the user buffer is actually too
small.
Don't use the existing xattrwalk and xattrcreate messages and define our
own, mainly for the sake of simplicity.
Extended attributes will be implemented in future commits.
PiperOrigin-RevId: 290121300
|
|
Updates #231
PiperOrigin-RevId: 289897881
|
|
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.
This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.
Updates #1472
PiperOrigin-RevId: 289033387
|
|
PiperOrigin-RevId: 288779416
|
|
...enabling us to remove the "CreateNamedLoopbackNIC" variant of
CreateNIC and all the plumbing to connect it through to where the value
is read in FindRoute.
PiperOrigin-RevId: 288713093
|
|
It can take more than 10 seconds when running under --race.
PiperOrigin-RevId: 286296060
|
|
Remove introduced CPUNumMin config and hard-code it as 2.
|
|
* Add `--cpu-num-min` flag to control minimum CPUs
* Only lower CPU count
* Fix comments
|
|
When application is not cgroups-aware, it can spawn excessive threads
which often defaults to CPU number.
Introduce a opt-in flag that will set CPU number accordingly to CPU
quota (if available).
Fixes #1391
|
|
It would be preferrable to test iptables via syscall tests, but there are some
problems with that approach:
* We're limited to loopback-only, as syscall tests involve only a single
container. Other link interfaces (e.g. fdbased) should be tested.
* We'd have to shell out to call iptables anyways, as the iptables syscall
interface itself is too large and complex to work with alone.
* Running the Linux/native version of the syscall test will require root, which
is a pain to configure, is inherently unsafe, and could leave host iptables
misconfigured.
Using the go_test target allows there to be no new test runner.
PiperOrigin-RevId: 285274275
|
|
Fixes #1341
PiperOrigin-RevId: 285108973
|
|
runsc debug --ps list all processes with all threads. This option is added to
the debug command but not to the ps command, because it is going to be used for
debug purposes and we want to add any useful information without thinking about
backward compatibility.
This will help to investigate syzkaller issues.
PiperOrigin-RevId: 285013668
|