gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2021-05-07	Init all vCPU when initializing machine on ARM64	howard zhang
	This patch is to solve problem that vCPU timer mess up when adding vCPU dynamically on ARM64, for detailed information please refer to: https://github.com/google/gvisor/issues/5739 There is no influence on x86 and here are main changes for ARM64: 1. create maxVCPUs number of vCPU in machine initialization 2. we want to sync gvisor vCPU number with host CPU number, so use smaller number between runtime.NumCPU and KVM_CAP_MAX_VCPUS to be maxVCPUS 3. put unused vCPUs into architecture-specific map initialvCPUs 4. When machine need to bind a new vCPU with tid, rather than creating new one, it would pick a vCPU from map initalvCPUs 5. change the setSystemTime function. When vCPU number increasing, the time cost for function setTSC(use syscall to set cntvoff) is liner growth from around 300 ns to 100000 ns, and this leads to the function setSystemTimeLegacy can not get correct offset value. 6. initializing StdioFDs and goferFD before a platform to avoid StdioFDs confects with vCPU fds Signed-off-by: howard zhang <howard.zhang@arm.com>
2021-03-29	[syserror] Split usermem package	Zach Koopmans
	Split usermem package to help remove syserror dependency in go_marshal. New hostarch package contains code not dependent on syserror. PiperOrigin-RevId: 365651233
2021-03-29	Merge pull request #5728 from zhlhahaha:2091	gVisor bot
	PiperOrigin-RevId: 365613394
2021-03-29	[perf] Reduce contention in ptrace.threadPool.lookupOrCreate().	Ayush Ranjan
	lookupOrCreate is called from subprocess.switchToApp() and subprocess.syscall(). lookupOrCreate() looks for a thread already created for the current TID. If a thread exists (common case), it returns immediately. Otherwise it creates a new one. This change switches to using a sync.RWMutex. The initial thread existence lookup is now done only with the read lock. So multiple successful lookups can occur concurrently. Only when a new thread is created will it acquire the lock for writing and update the map (which is not the common case). Discovered in mutex profiles from the various ptrace benchmarks. Example: https://gvisor.dev/profile/gvisor-buildkite/fd14bfad-b30f-44dc-859b-80ebac50beb4/843827db-da50-4dc9-a2ea-ecf734dde2d5/tmp/profile/ptrace/BenchmarkFio/operation.write/blockSize.4K/filesystem.tmpfs/benchmarks/fio/mutex.pprof/flamegraph PiperOrigin-RevId: 365612094
2021-03-25	Use seqfile.SeqHandles correctly in VFS1 /proc/net/.	Jamie Liu
	Before this change: ``` $ docker run --runtime=runsc --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024 #1: read(128) = 128 #2: read(1024) = EOF $ docker run --runtime=runsc-vfs2 --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024 #1: read(128) = 128 #2: read(1024) = 256 ``` After this change: ``` $ docker run --runtime=runsc --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024 #1: read(128) = 128 #2: read(1024) = 256 $ docker run --runtime=runsc-vfs2 --rm -it -v ~/tmp:/hosttmp ubuntu:focal /hosttmp/issue5732 --bytes1=128 --bytes2=1024 #1: read(128) = 128 #2: read(1024) = 256 ``` Fixes #5732 PiperOrigin-RevId: 365178386
2021-03-25	Lock TaskSet mutex for writing in ptraceClone().	Jamie Liu
	This is necessary since ptraceClone() mutates tracer.ptraceTracees. PiperOrigin-RevId: 365152396
2021-03-25	setgid: skip tests when we can't find usable GIDs	Kevin Krakauer
	PiperOrigin-RevId: 365092320
2021-03-25	Fix comments error	Howard Zhang
	Signed-off-by: Howard Zhang <howard.zhang@arm.com>
2021-03-25	Fix nogo test error	Howard Zhang
	Signed-off-by: Howard Zhang <howard.zhang@arm.com>
2021-03-24	Fix path to runsc in CNI tutorial.	Ian Lewis
	PiperOrigin-RevId: 364931406
2021-03-24	Fix highlighting sidebar menu on the website	Ian Lewis
	Highlighting previously highlighted multiple items in the sidebar if the had the same page name (not full url). This change simplifies this by adding the highlight class in the jekyll template rather than javascript, and highlights only the correct page. PiperOrigin-RevId: 364931350
2021-03-24	Add POLLRDNORM/POLLWRNORM support.	Bhasker Hariharan
	On Linux these are meant to be equivalent to POLLIN/POLLOUT. Rather than hack these on in sys_poll etc it felt cleaner to just cleanup the call sites to notify for both events. This is what linux does as well. Fixes #5544 PiperOrigin-RevId: 364859977
2021-03-24	Fix data race in fdbased when accessing fanoutID.	Bhasker Hariharan
	PiperOrigin-RevId: 364859173
2021-03-24	Unexpose immutable fields in stack.Route	Nick Brown
	This change sets the inner `routeInfo` struct to be a named private member and replaces direct access with access through getters. Note that direct access to the fields of `routeInfo` is still possible through the `RouteInfo` struct. Fixes #4902 PiperOrigin-RevId: 364822872
2021-03-23	Merge pull request #5677 from avagin:kvm-mmio	gVisor bot
	PiperOrigin-RevId: 364728696
2021-03-23	Move the code that manages floating-point state to a separate package	Andrei Vagin
	This change is inspired by Adin's cl/355256448. PiperOrigin-RevId: 364695931
2021-03-23	Add --file-access-mounts flag	Fabricio Voznika
	--file-access-mounts flag is similar to --file-access, but controls non-root mounts that were previously mounted in shared mode only. This gives more flexibility to control how mounts are shared within a container. PiperOrigin-RevId: 364669882
2021-03-23	setgid directory support in goferfs	Kevin Krakauer
	Also adds support for clearing the setuid bit when appropriate (writing, truncating, changing size, changing UID, or changing GID). VFS2 only. PiperOrigin-RevId: 364661835
2021-03-23	Skip checklocks analysis for stateify generated code.	Rahat Mahmood
	Stateify methods are always called without holding the appropriate locks. The system is paused and we know there will be no mutations when we call Save/Load, so this is perfectly safe. However, checklocks can't know about this, and it will always complain. Mark stateify generated methods that touch struct fields as "checklocksignore" to avoid this. PiperOrigin-RevId: 364610241
2021-03-23	Allow FSETXATTR/FGETXATTR host calls for Verity	Chong Cai
	These host calls are needed for Verity fs to generate/verify hashes. PiperOrigin-RevId: 364598180
2021-03-23	Use constant (TestInitialSequenceNumber) instead of integer (789) in tests.	Nayana Bidari
	PiperOrigin-RevId: 364596526
2021-03-23	Split fio read/write and randread/randwrite operations	Zach Koopmans
	The fio benchmark was changed to a fixed size read/write ammount because the timed benchmark was overwhelming machine memory on tmpfs mounts. Now rand(read\|write) operations are prohibitively long, leading to timeouts. Split the benchmarks as they were in python bm-tools: the read/write as fixed sized (1GB) and the rand(read\|write) as timed operations (15s). PiperOrigin-RevId: 364584436
2021-03-23	Explicitly allow martian loopback packets	Ghanan Gowripalan
	...instead of opting out of them. Loopback traffic should be stack-local but gVisor has some clients that depend on the ability to receive loopback traffic that originated from outside of the stack. Because of this, we guard this change behind IP protocol options. A previous change provided the facility to deny these martian loopback packets but this change requires client to opt-in to accepting martian loopback packets as accepting martian loopback packets are not meant to be accepted, as per RFC 1122 section 3.2.1.3.g: (g) { 127, <any> } Internal host loopback address. Addresses of this form MUST NOT appear outside a host. PiperOrigin-RevId: 364581174
2021-03-22	Update apt repository to limit to supported architectures.	Adin Scannell
	Fixes #5703 PiperOrigin-RevId: 364492235
2021-03-22	[lisa] Support dynamic types for all types.	Ayush Ranjan
	We were only supporting dynamic struct types. With this change, users can make any type dynamic. The tool (correctly) blindly just generates the remaining methods needed to implement Marshallable using the 3 methods defined by the user on the dynamic type. This is helpful in situations like: type StringArray []string Added a test for such a use case. PiperOrigin-RevId: 364463164
2021-03-22	Fix logs for packetimpact tests cleanup	Zeling Feng
	- Don't cleanup containers in Network.Cleanup, otherwise containers will be killed and removed several times. - Don't set AutoRemove for containers. This will prevent the confusing 'removal already in progress' messages. Fixes #3795 PiperOrigin-RevId: 364404414
2021-03-22	Return tcpip.Error from (*Stack).GetMainNICAddress	Ghanan Gowripalan
	PiperOrigin-RevId: 364381970
2021-03-22	Emit comment about build tags in gomarshal generated files.	Rahat Mahmood
	This may be useful for tracking down where build tags come from and understanding tag import issues in generated files. PiperOrigin-RevId: 364374931
2021-03-22	Avoid calling sync on each write in writethrough mode.	Nicolas Lacasse
	PiperOrigin-RevId: 364370595
2021-03-22	Fix and merge tcp_{outside_the_window,tcp_unacc_seq_ack}_closing	Zeling Feng
	The tests were not using the correct windowSize so the testing segments were actually within the window for seqNumOffset=0 tests. The issue is already fixed by #5674. PiperOrigin-RevId: 364252630
2021-03-18	Translate syserror when validating partial IO errors	Fabricio Voznika
	syserror allows packages to register translators for errors. These translators should be called prior to checking if the error is valid, otherwise it may not account for possible errors that can be returned from different packages, e.g. safecopy.BusError => syserror.EFAULT. Second attempt, it passes tests now :-) PiperOrigin-RevId: 363714508
2021-03-18	Address post submit comments for fs benchmarks.	Zach Koopmans
	Also, drop fio total reads/writes to 1GB as 10GB is prohibitively slow. PiperOrigin-RevId: 363714060
2021-03-18	Skip /dev submount hack on VFS2.	Jamie Liu
	containerd usually configures both /dev and /dev/shm as tmpfs mounts, e.g.: ``` "mounts": [ ... { "destination": "/dev", "type": "tmpfs", "source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, ... { "destination": "/dev/shm", "type": "tmpfs", "source": "/run/containerd/io.containerd.runtime.v2.task/moby/10eedbd6a0e7937ddfcab90f2c25bd9a9968b734c4ae361318142165d445e67e/shm", "options": [ "nosuid", "noexec", "nodev", "mode=1777", "size=67108864" ] }, ... ``` (This is mostly consistent with how Linux is usually configured, except that /dev is conventionally devtmpfs, not regular tmpfs. runc/libcontainer implements OCI-runtime-spec-undocumented behavior to create /dev/{ptmx,fd,stdin,stdout,stderr} in non-bind /dev mounts. runsc silently switches /dev to devtmpfs. In VFS1, this is necessary to get device files like /dev/null at all, since VFS1 doesn't support real device special files, only what is hardcoded in devfs. VFS2 does support device special files, but using devtmpfs is the easiest way to get pre-created files in /dev.) runsc ignores many /dev submounts in the spec, including /dev/shm. In VFS1, this appears to be to avoid introducing a submount overlay for /dev, and is mostly fine since the typical mode for the /dev/shm mount is ~consistent with the mode of the /dev/shm directory provided by devfs (modulo the sticky bit). In VFS2, this is vestigial (VFS2 does not use submount overlays), and devtmpfs' /dev/shm mode is correct for the mount point but not the mount. So turn off this behavior for VFS2. After this change: ``` $ docker run --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwt 2 root root 40 Mar 18 00:16 . drwxr-xr-x 5 root root 360 Mar 18 00:16 .. $ docker run --runtime=runsc --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwx 1 root root 0 Mar 18 00:16 . dr-xr-xr-x 1 root root 0 Mar 18 00:16 .. $ docker run --runtime=runsc-vfs2 --rm -it ubuntu:focal ls -lah /dev/shm total 0 drwxrwxrwt 2 root root 40 Mar 18 00:16 . drwxr-xr-x 5 root root 320 Mar 18 00:16 .. ``` Fixes #5687 PiperOrigin-RevId: 363699385
2021-03-17	Do not use martian loopback packets in tests	Ghanan Gowripalan
	Transport demuxer and UDP tests should not use a loopback address as the source address for packets injected into the stack as martian loopback packets will be dropped in a later change. PiperOrigin-RevId: 363479681
2021-03-17	Drop loopback traffic from outside of the stack	Ghanan Gowripalan
	Loopback traffic should be stack-local but gVisor has some clients that depend on the ability to receive loopback traffic that originated from outside of the stack. Because of this, we guard this change behind IP protocol options. Test: integration_test.TestExternalLoopbackTraffic PiperOrigin-RevId: 363461242
2021-03-16	kvm: prefault a floating point state before restoring it	Andrei Vagin
	If physical pages of a memory region are not mapped yet, the kernel will trigger KVM_EXIT_MMIO and we will map physical pages in bluepillHandler(). An instruction that triggered a fault will not be re-executed, it will be emulated in the kernel, but it can't emulate complex instructions like xsave, xrstor. We can touch the memory with simple instructions to workaround this problem.
2021-03-16	Fix tcp_fin_retransmission_netstack_test	Zeling Feng
	Netstack does not check ACK number for FIN-ACK packets and goes into TIMEWAIT unconditionally. Fixing the state machine will give us back the retransmission of FIN. PiperOrigin-RevId: 363301883
2021-03-16	Fix a race with synRcvdCount and accept	Mithun Iyer
	There is a race in handling new incoming connections on a listening endpoint that causes the endpoint to reply to more incoming SYNs than what is permitted by the listen backlog. The race occurs when there is a successful passive connection handshake and the synRcvdCount counter is decremented, followed by the endpoint delivered to the accept queue. In the window of time between synRcvdCount decrementing and the endpoint being enqueued for accept, new incoming SYNs can be handled without honoring the listen backlog value, as the backlog could be perceived not full. Fixes #5637 PiperOrigin-RevId: 363279372
2021-03-16	setgid directory support in overlayfs	Kevin Krakauer
	PiperOrigin-RevId: 363276495
2021-03-16	Unexport methods on NDPOption	Ghanan Gowripalan
	They are not used outside of the header package. PiperOrigin-RevId: 363237708
2021-03-16	Detect looped-back NDP DAD messages	Ghanan Gowripalan
	...as per RFC 7527. If a looped-back DAD message is received, do not fail DAD since our own DAD message does not indicate that a neighbor has the address assigned. Test: ndp_test.TestDADResolveLoopback PiperOrigin-RevId: 363224288
2021-03-16	Do not call into Stack from LinkAddressRequest	Ghanan Gowripalan
	Calling into the stack from LinkAddressRequest is not needed as we already have a reference to the network endpoint (IPv6) or network interface (IPv4/ARP). PiperOrigin-RevId: 363213973
2021-03-15	Turn sys_thread constants into variables.	Etienne Perot
	PiperOrigin-RevId: 363092268
2021-03-15	Move `MaxIovs` back to a variable in `iovec.go`.	Etienne Perot
	PiperOrigin-RevId: 363091954
2021-03-15	Deflake proc_test_native	Fabricio Voznika
	Terminating tasks from other tests can mess up with the task list of the current test. Tests were changed to look for added/removed tasks, ignoring other tasks that may exist while the test is running. PiperOrigin-RevId: 363084261
2021-03-15	Make netstack (//pkg/tcpip) buildable for 32 bit	Kevin Krakauer
	Doing so involved breaking dependencies between //pkg/tcpip and the rest of gVisor, which are discouraged anyways. Tested on the Go branch via: gvisor.dev/gvisor/pkg/tcpip/... Addresses #1446. PiperOrigin-RevId: 363081778
2021-03-15	[op] Make gofer client handle return partial write length when err is nil.	Ayush Ranjan
	If there was a partial write (when not using the host FD) which did not generate an error, we were incorrectly returning the number of bytes attempted to write instead of the number of bytes actually written. PiperOrigin-RevId: 363058989
2021-03-15	Merge pull request #5618 from iangudger:unix-transport-race	gVisor bot
	PiperOrigin-RevId: 362999220
2021-03-15	Packetimpact test for ACK to OTW Seq segments behavior in CLOSING	Zeling Feng
	TCP, in CLOSING state, MUST send an ACK with next expected SEQ number after receiving any segment with OTW SEQ number and remain in the same state. While I am here, I also changed shutdown to behave the same as other calls in posix_server. PiperOrigin-RevId: 362976955
2021-03-14	Fix race in tcp_retransmits_test	Mithun Iyer
	The test queries for RTO via TCP_INFO and applies that to the rest of the test. The RTO is estimated by processing incoming ACK. There is a race in the test where we may query for RTO before the incoming ACK was processed. Fix the race in the test by letting the DUT complete a payload receive, thus estimating RTO before proceeding to query the RTO. Bump up the time correction to reduce flakes. PiperOrigin-RevId: 362865904