Age | Commit message (Collapse) | Author |
|
Included:
- loader_test.go RunTest and TestStartSignal VFS2
- container_test.go TestAppExitStatus on VFS2
- experimental flag added to runsc to turn on VFS2
Note: shared mounts are not yet supported.
PiperOrigin-RevId: 307070753
|
|
PiperOrigin-RevId: 306477639
|
|
The sentry doesn't allow execve, but it's a good defense
in-depth measure.
PiperOrigin-RevId: 305958737
|
|
This required minor restructuring of how system call tables were saved
and restored, but it makes way more sense this way.
Updates #2243
|
|
Issue #2270
Issue #1765
PiperOrigin-RevId: 305385436
|
|
Suppose I start a runsc container using kvm platform like this:
$ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash
The donating FD and the corresponding cmdline for runsc-sandbox is:
D0313 17:50:12.608203 44389 x:0] Donating FD 3: "1.txt"
D0313 17:50:12.608214 44389 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:12.608224 44389 x:0] Donating FD 5: "|0"
D0313 17:50:12.608229 44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:12.608234 44389 x:0] Donating FD 7: "|1"
D0313 17:50:12.608238 44389 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:12.608242 44389 x:0] Donating FD 9: "/dev/kvm"
D0313 17:50:12.608246 44389 x:0] Donating FD 10: "/dev/stdin"
D0313 17:50:12.608249 44389 x:0] Donating FD 11: "/dev/stdout"
D0313 17:50:12.608253 44389 x:0] Donating FD 12: "/dev/stderr"
D0313 17:50:12.608257 44389 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024
--watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true
--num-network-channels=1 --rootless=false --alsologtostderr=false
--ref-leak-mode=disabled --gso=true --software-gso=true
--overlayfs-stale-read=false --shared-volume= --debug-log-fd=3
--panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc
--controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8
--device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true
--setup-root --cpu-num 32 --total-memory 4294967296 rootbash]
Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12.
If I restore a container from the checkpoint image which is derived
by checkpointing the above rootbash container, but either omit the
platform switch or specify to use ptrace platform explicitely:
$ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash
the donating FD and corresponding cmdline for runsc-sandbox is:
D0313 17:50:15.258632 44452 x:0] Donating FD 3: "1.txt"
D0313 17:50:15.258640 44452 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:15.258645 44452 x:0] Donating FD 5: "|0"
D0313 17:50:15.258648 44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:15.258653 44452 x:0] Donating FD 7: "|1"
D0313 17:50:15.258657 44452 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:15.258661 44452 x:0] Donating FD 9: "/dev/stdin"
D0313 17:50:15.258675 44452 x:0] Donating FD 10: "/dev/stdout"
D0313 17:50:15.258680 44452 x:0] Donating FD 11: "/dev/stderr"
D0313 17:50:15.258684 44452 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=ptrace --strace=false --strace-syscalls=
--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1
--profile=false --net-raw=true --num-network-channels=1 --rootless=false
--alsologtostderr=false --ref-leak-mode=disabled --gso=true
--software-gso=true --overlayfs-stale-read=false --shared-volume=
--debug-log-fd=3 --panic-signal=15 boot
--bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4
--mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9
--stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory
4294967296 restored_rootbash]
Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the
saved host.descritor.origFD which is 12 for stderr is no longer valid).
For the three host FD based files, The s.Dev and s.Ino derived from
fstat(fd) shall all be the same and since the two fields are used
as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is
the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same.
Note that for MultiDevice m, m.cache records the mapping of key to value
and m.rcache records the mapping of value to key. If same value doesn't
map to the same key, it will panic on restore.
Now that stderr's origFD 12 is no longer valid(it happens to be
/memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived
from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be
correct. But its InodeID is still the same as saved, MultiDevice.Load()
will complain about the same value(InodeID) being mapped to different
keys (different from stdin and stdout's) and panic with: "MultiDevice's
caches are inconsistent".
Solve this problem by making sure stdioFDs for root container's init
task are always the same on initial start and on restore time, no matter
what cmdline user has used: debug log specified or not, platform changed
or not etc. shall not affect the ability to restore.
Fixes #1844.
|
|
Using the host-defined file owner matches VFS1. It is more correct to use the
host-defined mode, since the cached value may become out of date. However,
kernfs.Inode.Mode() does not return an error--other filesystems on kernfs are
in-memory so retrieving mode should not fail. Therefore, if the host syscall
fails, we rely on a cached value instead.
Updates #1672.
PiperOrigin-RevId: 303220864
|
|
utimensat is used by hostfs for setting timestamps on imported fds. Previously,
this would crash the sandbox since utimensat was not allowed.
Correct the VFS2 version of hostfs to match the call in VFS1.
PiperOrigin-RevId: 301970121
|
|
- When setting up the virtual filesystem, mount a host.filesystem to contain
all files that need to be imported.
- Make read/preadv syscalls to the host in cases where preadv2 may not be
supported yet (likewise for writing).
- Make save/restore functions in kernel/kernel.go return early if vfs2 is
enabled.
PiperOrigin-RevId: 300922353
|
|
PiperOrigin-RevId: 299233818
|
|
GO's runtime calls the write system call twice to print "panic:"
and "the reason of this panic", so here is a race window when
other threads can print something to the log and we will see
something like this:
panic: log messages from another thread
The reason of the panic.
This confuses the syzkaller blacklist and dedup detection.
It also makes the logs generally difficult to read. e.g.,
data races often have one side of the race, followed by
a large "diagnosis" dump, finally followed by the other
side of the race.
PiperOrigin-RevId: 297887895
|
|
|
|
pipe and pipe2 aren't ported, pending a slight rework of pipe FDs for VFS2.
mount and umount2 aren't ported out of temporary laziness. access and faccessat
need additional FSImpl methods to implement properly, but are stubbed to
prevent googletest from CHECK-failing. Other syscalls require additional
plumbing.
Updates #1623
PiperOrigin-RevId: 297188448
|
|
TCP/IP will work with netstack networking. hostinet doesn't work, and sockets
will have the same behavior as it is now.
Before the userspace is able to create device, the default loopback device can
be used to test.
/proc/net and /sys/net will still be connected to the root network stack; this
is the same behavior now.
Issue #1833
PiperOrigin-RevId: 296309389
|
|
This is to fix a data race between sending an external signal to
a ThreadGroup and kernel saving state for S/R.
PiperOrigin-RevId: 295244281
|
|
- Added fsbridge package with interface that can be used to open
and read from VFS1 and VFS2 files.
- Converted ELF loader to use fsbridge
- Added VFS2 types to FSContext
- Added vfs.MountNamespace to ThreadGroup
Updates #1623
PiperOrigin-RevId: 295183950
|
|
These were out-of-band notes that can help provide additional context
and simplify automated imports.
PiperOrigin-RevId: 293525915
|
|
Go 1.14 has a workaround for a Linux 5.2-5.4 bug which requires mlock'ing the g
stack to prevent register corruption. We need to allow this syscall until it is
removed from Go.
PiperOrigin-RevId: 292967478
|
|
FD table now holds both VFS1 and VFS2 types and uses the correct
one based on what's set.
Parts of this CL are just initial changes (e.g. sys_read.go,
runsc/main.go) to serve as a template for the remaining changes.
Updates #1487
Updates #1623
PiperOrigin-RevId: 292023223
|
|
In general, we've learned that logging must be avoided at all
costs in the hot path. It's unlikely that the optimizations
here were significant in any case, since buffer would certainly
escape.
This also adds a test to ensure that the caller identification
works as expected, and so that logging can be benchmarked.
Original:
BenchmarkGoogleLogging-6 1222255 949 ns/op
With this change:
BenchmarkGoogleLogging-6 517323 2346 ns/op
Fixes #184
PiperOrigin-RevId: 291815420
|
|
Because the abi will depend on the core types for marshalling (usermem,
context, safemem, safecopy), these need to be flattened from the sentry
directory. These packages contain no sentry-specific details.
PiperOrigin-RevId: 291811289
|
|
PiperOrigin-RevId: 291745021
|
|
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.
This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.
Updates #1472
PiperOrigin-RevId: 289033387
|
|
PiperOrigin-RevId: 288779416
|
|
...enabling us to remove the "CreateNamedLoopbackNIC" variant of
CreateNIC and all the plumbing to connect it through to where the value
is read in FindRoute.
PiperOrigin-RevId: 288713093
|
|
Remove introduced CPUNumMin config and hard-code it as 2.
|
|
* Add `--cpu-num-min` flag to control minimum CPUs
* Only lower CPU count
* Fix comments
|
|
When application is not cgroups-aware, it can spawn excessive threads
which often defaults to CPU number.
Introduce a opt-in flag that will set CPU number accordingly to CPU
quota (if available).
Fixes #1391
|
|
Fixes #1341
PiperOrigin-RevId: 285108973
|
|
PiperOrigin-RevId: 284320186
|
|
PiperOrigin-RevId: 284305935
|
|
This adds meaningful annotations to the trace generated by the runtime/trace
package.
PiperOrigin-RevId: 284290115
|
|
Changed annotation to follow the standard defined here:
https://github.com/opencontainers/image-spec/blob/master/annotations.md
PiperOrigin-RevId: 284254847
|
|
There are two potential ways of sending a TOS byte with outgoing packets:
including a control message in sendmsg, or setting the IP_TOS/IPV6_TCLASS
socket options (for IPV4 and IPV6 respectively). This change lets hostinet
support the latter.
Fixes #1188
PiperOrigin-RevId: 283550925
|
|
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I3fd5e552f5f03b5144ed52647f75af3b8253b1d6
|
|
This involves allowing getsockopt/setsockopt for the corresponding socket
options, as well as allowing hostinet to process control messages received from
the actual recvmsg syscall.
PiperOrigin-RevId: 282851425
|
|
PiperOrigin-RevId: 282401165
|
|
PiperOrigin-RevId: 282382564
|
|
This patch also include a minor change to replace syscall.Dup2
with syscall.Dup3 which was missed in a previous commit(ref a25a976).
Signed-off-by: Haibo Xu <haibo.xu@arm.com>
Change-Id: I00beb9cc492e44c762ebaa3750201c63c1f7c2f3
|
|
NETLINK_KOBJECT_UEVENT sockets send udev-style messages for device events.
gVisor doesn't have any device events, so our sockets don't need to do anything
once created.
systemd's device manager needs to be able to create one of these sockets. It
also wants to install a BPF filter on the socket. Since we'll never send any
messages, the filter would never be invoked, thus we just fake it out.
Fixes #1117
Updates #1119
PiperOrigin-RevId: 278405893
|
|
The watchdog currently can find stuck tasks, but has no way to tell if the
sandbox is stuck before the application starts executing.
This CL adds a startup timeout and action to the watchdog. If Start() is not
called before the given timeout (if non-zero), then the watchdog will take the
action.
PiperOrigin-RevId: 277970577
|
|
PiperOrigin-RevId: 277623766
|
|
It is required to guarantee the same order of endpoints after save/restore.
PiperOrigin-RevId: 277598665
|
|
fix a typo
|
|
remove a duplicated period
|
|
Right now, we send each tcp packet separately, we call one system
call per-packet. This patch allows to generate multiple tcp packets
and send them by sendmmsg.
The arguable part of this CL is a way how to handle multiple headers.
This CL adds the next field to the Prepandable buffer.
Nginx test results:
Server Software: nginx/1.15.9
Server Hostname: 10.138.0.2
Server Port: 8080
Document Path: /10m.txt
Document Length: 10485760 bytes
w/o gso:
Concurrency Level: 5
Time taken for tests: 5.491 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 18.21 [#/sec] (mean)
Time per request: 274.525 [ms] (mean)
Time per request: 54.905 [ms] (mean, across all concurrent requests)
Transfer rate: 186508.03 [Kbytes/sec] received
sw-gso:
Concurrency Level: 5
Time taken for tests: 3.852 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 1048600200 bytes
HTML transferred: 1048576000 bytes
Requests per second: 25.96 [#/sec] (mean)
Time per request: 192.576 [ms] (mean)
Time per request: 38.515 [ms] (mean, across all concurrent requests)
Transfer rate: 265874.92 [Kbytes/sec] received
w/o gso:
$ ./tcp_benchmark --client --duration 15 --ideal
[SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec
software gso:
$ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso
[SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec
PiperOrigin-RevId: 276112677
|
|
Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With
runsc, you'll need to pass `--net-raw=true` to enable them.
Binding isn't supported yet.
PiperOrigin-RevId: 275909366
|
|
Linux kernel before 4.19 doesn't implement a feature that updates
open FD after a file is open for write (and is copied to the upper
layer). Already open FD will continue to read the old file content
until they are reopened. This is especially problematic for gVisor
because it caches open files.
Flag was added to force readonly files to be reopenned when the
same file is open for write. This is only needed if using kernels
prior to 4.19.
Closes #1006
It's difficult to really test this because we never run on tests
on older kernels. I'm adding a test in GKE which uses kernels
with the overlayfs problem for 1.14 and lower.
PiperOrigin-RevId: 275115289
|
|
PiperOrigin-RevId: 273630282
|
|
Options that do not change mount behavior inside the Sentry are
irrelevant and should not be used when looking for possible
incompatibilities between master and slave mounts.
PiperOrigin-RevId: 273593486
|