gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2020-10-23	Support getsockopt for SO_ACCEPTCONN.	Nayana Bidari
	The SO_ACCEPTCONN option is used only on getsockopt(). When this option is specified, getsockopt() indicates whether socket listening is enabled for the socket. A value of zero indicates that socket listening is disabled; non-zero that it is enabled. PiperOrigin-RevId: 338703206
2020-10-23	Decrement e.synRcvdCount once handshake is complete.	Bhasker Hariharan
	Earlier the count was dropped only after calling e.deliverAccepted. This lead to an issue where there were no connections in SYN-RCVD state for the listening endpoint but e.synRcvdCount would not be zero because it was being reduced only when handleSynSegment returned after deliverAccepted returned. This issue is seen when the Nth SYN for a listen backlog of size N which would cause the listen backlog to be full gets dropped occasionally. This happens when the new SYN comes at when the previous completed endpoint has been delivered to the accept queue but the synRcvdCount hasn't yet been decremented because the goroutine running handleSynSegment has not yet completed. PiperOrigin-RevId: 338690646
2020-10-23	Rewrite reference leak checker without finalizers.	Dean Deng
	Our current reference leak checker uses finalizers to verify whether an object has reached zero references before it is garbage collected. There are multiple problems with this mechanism, so a rewrite is in order. With finalizers, there is no way to guarantee that a finalizer will run before the program exits. When an unreachable object with a finalizer is garbage collected, its finalizer will be added to a queue and run asynchronously. The best we can do is run garbage collection upon sandbox exit to make sure that all finalizers are enqueued. Furthermore, if there is a chain of finalized objects, e.g. A points to B points to C, garbage collection needs to run multiple times before all of the finalizers are enqueued. The first GC run will register the finalizer for A but not free it. It takes another GC run to free A, at which point B's finalizer can be registered. As a result, we need to run GC as many times as the length of the longest such chain to have a somewhat reliable leak checker. Finally, a cyclical chain of structs pointing to one another will never be garbage collected if a finalizer is set. This is a well-known issue with Go finalizers (https://github.com/golang/go/issues/7358). Using leak checking on filesystem objects that produce cycles will not work and even result in memory leaks. The new leak checker stores reference counted objects in a global map when leak check is enabled and removes them once they are destroyed. At sandbox exit, any remaining objects in the map are considered as leaked. This provides a deterministic way of detecting leaks without relying on the complexities of finalizers and garbage collection. This approach has several benefits over the former, including: - Always detects leaks of objects that should be destroyed very close to sandbox exit. The old checker very rarely detected these leaks, because it relied on garbage collection to be run in a short window of time. - Panics if we forgot to enable leak check on a ref-counted object (we will try to remove it from the map when it is destroyed, but it will never have been added). - Can store extra logging information in the map values without adding to the size of the ref count struct itself. With the size of just an int64, the ref count object remains compact, meaning frequent operations like IncRef/DecRef are more cache-efficient. - Can aggregate leak results in a single report after the sandbox exits. Instead of having warnings littered in the log, which were non-deterministically triggered by garbage collection, we can print all warning messages at once. Note that this could also be a limitation--the sandbox must exit properly for leaks to be detected. Some basic benchmarking indicates that this change does not significantly affect performance when leak checking is enabled, which is understandable since registering/unregistering is only done once for each filesystem object. Updates #1486. PiperOrigin-RevId: 338685972
2020-10-19	Fix runsc tests on VFS2 overlay.	Jamie Liu
	- Check the sticky bit in overlay.filesystem.UnlinkAt(). Fixes StickyTest.StickyBitPermDenied. - When configuring a VFS2 overlay in runsc, copy the lower layer's root owner/group/mode to the upper layer's root (as in the VFS1 equivalent, boot.addOverlay()). This makes the overlay root owned by UID/GID 65534 with mode 0755 rather than owned by UID/GID 0 with mode 01777. Fixes CreateTest.CreateFailsOnUnpermittedDir, which assumes that the test cannot create files in /. - MknodTest.UnimplementedTypesReturnError assumes that the creation of device special files is not supported. However, while the VFS2 gofer client still doesn't support device special files, VFS2 tmpfs does, and in the overlay test dimension mknod() targets a tmpfs upper layer. The test initially has all capabilities, including CAP_MKNOD, so its creation of these files succeeds. Constrain these tests to VFS1. - Rename overlay.nonDirectoryFD to overlay.regularFileFD and only use it for regular files, using the original FD for pipes and device special files. This is more consistent with Linux (which gets the original inode_operations, and therefore file_operations, for these file types from ovl_fill_inode() => init_special_inode()) and fixes remaining mknod and pipe tests. - Read/write 1KB at a time in PipeTest.Streaming, rather than 4 bytes. This isn't strictly necessary, but it makes the test less obnoxiously slow on ptrace. Fixes #4407 PiperOrigin-RevId: 337971042
2020-10-16	Use POSIX interval timers in flock test.	Dean Deng
	ualarm(2) is obsolete. Move IntervalTimer into a test util, where it can be used by flock tests. These tests were flaky with TSAN, probably because it slowed the tests down enough that the alarm was expiring before flock() was called. Use an interval timer so that even if we miss the first alarm (or more), flock() is still guaranteed to be interrupted. PiperOrigin-RevId: 337578751
2020-10-15	sockets: ignore io.EOF from view.ReadAt	Andrei Vagin
	Reported-by: syzbot+5466463b7604c2902875@syzkaller.appspotmail.com PiperOrigin-RevId: 337451896
2020-10-09	TCP Receive window advertisement fixes.	Bhasker Hariharan
	The fix in commit 028e045da93b7c1c26417e80e4b4e388b86a713d was incorrect as it can cause the right edge of the window to shrink when we announce a zero window due to receive buffer being full as its done before the check for seeing if the window is being shrunk because of the selected window. Further the window was calculated purely on available space but in cases where we are getting full sized segments it makes more sense to use the actual bytes being held. This CL changes to use the lower of the total available space vs the available space in the maximal window we could advertise minus the actual payload bytes being held. This change also cleans up the code so that the window selection logic is not duplicated between getSendParams() and windowCrossedACKThresholdLocked. PiperOrigin-RevId: 336404827
2020-10-09	test/syscall/iptables: don't use designated initializers	Andrei Vagin
	test/syscalls/linux/iptables.cc:130:3: error: C99 designator 'name' outside aggregate initializer 130 \| }; \| PiperOrigin-RevId: 336331738
2020-10-06	Implement membarrier(2) commands other than *_SYNC_CORE.	Jamie Liu
	Updates #267 PiperOrigin-RevId: 335713923
2020-10-03	Fix kcov enabling and disabling procedures.	Dean Deng
	- When the KCOV_ENABLE_TRACE ioctl is called with the trace kind KCOV_TRACE_PC, the kcov mode should be set to KCOV_MODE_TRACE_PC. - When the owning task of kcov exits, the memory mapping should not be cleared so it can be used by other tasks. - Add more tests (also tested on native Linux kcov). PiperOrigin-RevId: 335202585
2020-09-30	ip6tables: redirect support	Kevin Krakauer
	Adds support for the IPv6-compatible redirect target. Redirection is a limited form of DNAT, where the destination is always the localhost. Updates #3549. PiperOrigin-RevId: 334698344
2020-09-29	Add /proc/[pid]/cwd	Fabricio Voznika
	PiperOrigin-RevId: 334478850
2020-09-29	iptables: refactor to make targets extendable	Kevin Krakauer
	Like matchers, targets should use a module-like register/lookup system. This replaces the brittle switch statements we had before. The only behavior change is supporing IPT_GET_REVISION_TARGET. This makes it much easier to add IPv6 redirect in the next change. Updates #3549. PiperOrigin-RevId: 334469418
2020-09-29	Migrates uses of deprecated map types to recommended types.	gVisor bot
	PiperOrigin-RevId: 334419854
2020-09-28	Fix lingering of TCP socket in the initial state.	Nayana Bidari
	When the socket is set with SO_LINGER and close()'d in the initial state, it should not linger and return immediately. PiperOrigin-RevId: 334263149
2020-09-27	Clean up kcov.	Dean Deng
	Previously, we did not check the kcov mode when performing task work. As a result, disabling kcov did not do anything. Also avoid expensive atomic RMW when consuming coverage data. We don't need the swap if the value is already zero (which is most of the time), and it is ok if there are slight inconsistencies due to a race between coverage data generation (incrementing the value) and consumption (reading a nonzero value and writing zero). PiperOrigin-RevId: 334049207
2020-09-24	test/syscall/mknod: Don't use a hard-coded file name	Andrei Vagin
	PiperOrigin-RevId: 333461380
2020-09-23	Clean up inotify tests.	Dean Deng
	Mostly simplifies SKIP_IF statements and adds some more documentation. Also, mknod is now supported by gofer fs, so remove SKIP_IFs related to this. PiperOrigin-RevId: 333449932
2020-09-21	Receive ACK when deleting address in syscall tests	Ghanan Gowripalan
	PiperOrigin-RevId: 332961666
2020-09-21	Fix socket_ipv4_udp_unbound_test_native in opensource.	Zach Koopmans
	Calls to recv sometimes fail with EAGAIN, so call select beforehand. PiperOrigin-RevId: 332943156
2020-09-21	Fix proc_net_test_native for native tests.	Zach Koopmans
	"DefaultValueEqZero" is only valid if the test is in a sandbox. Our CI VMs often have "/proc/sys/net/ipv4/ip_forward" set to 1. PiperOrigin-RevId: 332910859
2020-09-21	Add ftruncate test for writeable fd but no write permissions.	Dean Deng
	PiperOrigin-RevId: 332907453
2020-09-21	Fix flakes in UdpSocketTest	Zach Koopmans
	`recv` calls with MSG_DONTWAIT can fail with EAGAIN randomly in tests. Fix this by calling `select` on sockets with a timeout prior to attempting a `recv`. PiperOrigin-RevId: 332873735
2020-09-20	Merge pull request #3651 from ianlewis:ip-forwarding	gVisor bot
	PiperOrigin-RevId: 332760843
2020-09-18	Disable vdso_clock_gettime on KVM.	Jamie Liu
	Unfortunately, I think TSC misalignment means that we can't really expect any consistent correspondence between a TSC-based VDSO and the sentry's view of time on the KVM platform. PiperOrigin-RevId: 332576147
2020-09-18	Deflake stat_test with save/restore enabled.	Nicolas Lacasse
	PiperOrigin-RevId: 332546659
2020-09-18	Implement fsimpl/overlay.filesystem.RenameAt.	Jamie Liu
	Updates #1199 PiperOrigin-RevId: 332539197
2020-09-18	Use a tmpfs file for shared anonymous and /dev/zero mmap on VFS2.	Jamie Liu
	This is more consistent with Linux (see comment on MM.NewSharedAnonMappable()). We don't do the same thing on VFS1 for reasons documented by the updated comment. PiperOrigin-RevId: 332514849
2020-09-18	Remove SKIP_IF for now-supported features.	Kevin Krakauer
	Updates #3549. PiperOrigin-RevId: 332501660
2020-09-17	{Set,Get} SO_LINGER on all endpoints.	Nayana Bidari
	SO_LINGER is a socket level option and should be stored on all endpoints even though it is used to linger only for TCP endpoints. PiperOrigin-RevId: 332369252
2020-09-17	Complete vfs2 implementation of fallocate.	Dean Deng
	This change includes overlay, special regular gofer files, and hostfs. Fixes #3589. PiperOrigin-RevId: 332330860
2020-09-17	Deflake vdso_clock_gettime test.	Jamie Liu
	PiperOrigin-RevId: 332281930
2020-09-16	Automated rollback of changelist 329526153	Nayana Bidari
	PiperOrigin-RevId: 332097286
2020-09-16	Receive broadcast packets on interested endpoints	Ghanan Gowripalan
	When a broadcast packet is received by the stack, the packet should be delivered to each endpoint that may be interested in the packet. This includes all any address and specified broadcast address listeners. Test: integration_test.TestReuseAddrAndBroadcast PiperOrigin-RevId: 332060652
2020-09-15	Enable automated marshalling for the syscall package.	Rahat Mahmood
	PiperOrigin-RevId: 331940975
2020-09-15	Support setting STATX_SIZE for kernfs.InodeAttrs.	Dean Deng
	Make setting STATX_SIZE a no-op, if it is valid for the given permissions and file type. Also update proc tests, which were overfitted before. Fixes #3842. Updates #1193. PiperOrigin-RevId: 331861087
2020-09-15	Merge pull request #3895 from btw616:fix/issue-3894	gVisor bot
	PiperOrigin-RevId: 331824411
2020-09-15	Fix proc.(*fdDir).IterDirents for VFS2	Tiwei Bie
	Currently the returned offset is an index, and we can't use it to find the next fd to serialize, because getdents should iterate correctly despite mutation of fds. Instead, we can return the next fd to serialize plus 2 (which accounts for "." and "..") as the offset. Fixes: #3894 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-11	Check that we have access to the trusted.* xattr namespace directly.	Nicolas Lacasse
	These operations require CAP_SYS_ADMIN in the root user namespace. There's no easy way to check that other than trying the operation and seeing what happens. PiperOrigin-RevId: 331242256
2020-09-10	[vfs] Disable inode number equality check for overlayfs.	Ayush Ranjan
	Overlayfs does not persist a directory's inode number even while it is mounted. See fs/overlayfs/inode.c:ovl_map_dev_ino(). VFS2 generates a new inode number for directories everytime in lookup. PiperOrigin-RevId: 331045037
2020-09-10	[vfs] Disable nlink tests for overlayfs.	Ayush Ranjan
	Overlayfs intentionally does not compute nlink for directories (because it can be really expensive). Linux returns 1, VFS2 returns 2 and VFS1 actually calculates the correct value. PiperOrigin-RevId: 330967139
2020-09-08	[vfs] overlayfs: Fix socket tests.	Ayush Ranjan
	- BindSocketThenOpen test was expecting the incorrect error when opening a socket. Fixed that. - VirtualFilesystem.BindEndpointAt should not require pop.Path.Begin.Ok() because the filesystem implementations do not need to walk to the parent dentry. This check also exists for MknodAt, MkdirAt, RmdirAt, SymlinkAt and UnlinkAt but those filesystem implementations also need to walk to the parent denty. So that check is valid. Added some syscall tests to test this. PiperOrigin-RevId: 330625220
2020-09-08	[vfs] Capitalize x in the {Get/Set/Remove/List}xattr functions.	Ayush Ranjan
	PiperOrigin-RevId: 330554450
2020-09-02	Fix Accept to not return error for sockets in accept queue.	Bhasker Hariharan
	Accept on gVisor will return an error if a socket in the accept queue was closed before Accept() was called. Linux will return the new fd even if the returned socket is already closed by the peer say due to a RST being sent by the peer. This seems to be intentional in linux more details on the github issue. Fixes #3780 PiperOrigin-RevId: 329828404
2020-09-01	Fix statfs test for opensource.	Zach Koopmans
	PiperOrigin-RevId: 329638946
2020-09-01	Test opening file handles with different permissions.	Dean Deng
	These were problematic for vfs2 gofers before correctly implementing separate read/write handles. PiperOrigin-RevId: 329613261
2020-09-01	Refactor tty codebase to use master-replica terminology.	Ayush Ranjan
	Updates #2972 PiperOrigin-RevId: 329584905
2020-09-01	Automated rollback of changelist 328350576	Nayana Bidari
	PiperOrigin-RevId: 329526153
2020-08-31	Remove __fuchsia__ defines	Tamir Duberstein
	These mostly guard linux-only headers; check for linux instead. PiperOrigin-RevId: 329362762
2020-08-28	Don't bind loopback to all IPs in an IPv6 subnet	Ghanan Gowripalan
	An earlier change considered the loopback bound to all addresses in an assigned subnet. This should have only be done for IPv4 to maintain compatability with Linux: ``` $ ip addr show dev lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3062ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes ^C --- 2001:db8::2 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2030ms $ sudo ip addr add 2001:db8::1/64 dev lo $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes 64 bytes from 2001:db8::1: icmp_seq=1 ttl=64 time=0.055 ms 64 bytes from 2001:db8::1: icmp_seq=2 ttl=64 time=0.074 ms 64 bytes from 2001:db8::1: icmp_seq=3 ttl=64 time=0.073 ms 64 bytes from 2001:db8::1: icmp_seq=4 ttl=64 time=0.071 ms ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3075ms rtt min/avg/max/mdev = 0.055/0.068/0.074/0.007 ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes From 2001:db8::1 icmp_seq=1 Destination unreachable: No route From 2001:db8::1 icmp_seq=2 Destination unreachable: No route From 2001:db8::1 icmp_seq=3 Destination unreachable: No route From 2001:db8::1 icmp_seq=4 Destination unreachable: No route ^C --- 2001:db8::2 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3070ms ``` Test: integration_test.TestLoopbackAcceptAllInSubnet PiperOrigin-RevId: 329011566