gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2019-11-07	Add support for TIME_WAIT timeout.	Bhasker Hariharan
	This change adds explicit support for honoring the 2MSL timeout for sockets in TIME_WAIT state. It also adds support for the TCP_LINGER2 option that allows modification of the FIN_WAIT2 state timeout duration for a given socket. It also adds an option to modify the Stack wide TIME_WAIT timeout but this is only for testing. On Linux this is fixed at 60s. Further, we also now correctly process RST's in CLOSE_WAIT and close the socket similar to linux without moving it to error state. We also now handle SYN in ESTABLISHED state as per RFC5961#section-4.1. Earlier we would just drop these SYNs. Which can result in some tests that pass on linux to fail on gVisor. Netstack now honors TIME_WAIT correctly as well as handles the following cases correctly. - TCP RSTs in TIME_WAIT are ignored. - A duplicate TCP FIN during TIME_WAIT extends the TIME_WAIT and a dup ACK is sent in response to the FIN as the dup FIN indicates potential loss of the original final ACK. - An out of order segment during TIME_WAIT generates a dup ACK. - A new SYN w/ a sequence number > the highest sequence number in the previous connection closes the TIME_WAIT early and opens a new connection. Further to make the SYN case work correctly the ISN (Initial Sequence Number) generation for Netstack has been updated to be as per RFC. Its not a pure random number anymore and follows the recommendation in https://tools.ietf.org/html/rfc6528#page-3. The current hash used is not a cryptographically secure hash function. A separate change will update the hash function used to Siphash similar to what is used in Linux. PiperOrigin-RevId: 279106406
2019-11-06	Fix yet another data race.	Bhasker Hariharan
	Fixes #1140 PiperOrigin-RevId: 279020846
2019-11-06	Fix data race in syscall_test_runner.go	Bhasker Hariharan
	Fixes #1140 PiperOrigin-RevId: 279012793
2019-11-06	Use PacketBuffers, rather than VectorisedViews, in netstack.	Kevin Krakauer
	PacketBuffers are analogous to Linux's sk_buff. They hold all information about a packet, headers, and payload. This is important for: * iptables to access various headers of packets * Preventing the clutter of passing different net and link headers along with VectorisedViews to packet handling functions. This change only affects the incoming packet path, and a future change will change the outgoing path. Benchmark Regular PacketBufferPtr PacketBufferConcrete -------------------------------------------------------------------------------- BM_Recvmsg 400.715MB/s 373.676MB/s 396.276MB/s BM_Sendmsg 361.832MB/s 333.003MB/s 335.571MB/s BM_Recvfrom 453.336MB/s 393.321MB/s 381.650MB/s BM_Sendto 378.052MB/s 372.134MB/s 341.342MB/s BM_SendmsgTCP/0/1k 353.711MB/s 316.216MB/s 322.747MB/s BM_SendmsgTCP/0/2k 600.681MB/s 588.776MB/s 565.050MB/s BM_SendmsgTCP/0/4k 995.301MB/s 888.808MB/s 941.888MB/s BM_SendmsgTCP/0/8k 1.517GB/s 1.274GB/s 1.345GB/s BM_SendmsgTCP/0/16k 1.872GB/s 1.586GB/s 1.698GB/s BM_SendmsgTCP/0/32k 1.017GB/s 1.020GB/s 1.133GB/s BM_SendmsgTCP/0/64k 475.626MB/s 584.587MB/s 627.027MB/s BM_SendmsgTCP/0/128k 416.371MB/s 503.434MB/s 409.850MB/s BM_SendmsgTCP/0/256k 323.449MB/s 449.599MB/s 388.852MB/s BM_SendmsgTCP/0/512k 243.992MB/s 267.676MB/s 314.474MB/s BM_SendmsgTCP/0/1M 95.138MB/s 95.874MB/s 95.417MB/s BM_SendmsgTCP/0/2M 96.261MB/s 94.977MB/s 96.005MB/s BM_SendmsgTCP/0/4M 96.512MB/s 95.978MB/s 95.370MB/s BM_SendmsgTCP/0/8M 95.603MB/s 95.541MB/s 94.935MB/s BM_SendmsgTCP/0/16M 94.598MB/s 94.696MB/s 94.521MB/s BM_SendmsgTCP/0/32M 94.006MB/s 94.671MB/s 94.768MB/s BM_SendmsgTCP/0/64M 94.133MB/s 94.333MB/s 94.746MB/s BM_SendmsgTCP/0/128M 93.615MB/s 93.497MB/s 93.573MB/s BM_SendmsgTCP/0/256M 93.241MB/s 95.100MB/s 93.272MB/s BM_SendmsgTCP/1/1k 303.644MB/s 316.074MB/s 308.430MB/s BM_SendmsgTCP/1/2k 537.093MB/s 584.962MB/s 529.020MB/s BM_SendmsgTCP/1/4k 882.362MB/s 939.087MB/s 892.285MB/s BM_SendmsgTCP/1/8k 1.272GB/s 1.394GB/s 1.296GB/s BM_SendmsgTCP/1/16k 1.802GB/s 2.019GB/s 1.830GB/s BM_SendmsgTCP/1/32k 2.084GB/s 2.173GB/s 2.156GB/s BM_SendmsgTCP/1/64k 2.515GB/s 2.463GB/s 2.473GB/s BM_SendmsgTCP/1/128k 2.811GB/s 3.004GB/s 2.946GB/s BM_SendmsgTCP/1/256k 3.008GB/s 3.159GB/s 3.171GB/s BM_SendmsgTCP/1/512k 2.980GB/s 3.150GB/s 3.126GB/s BM_SendmsgTCP/1/1M 2.165GB/s 2.233GB/s 2.163GB/s BM_SendmsgTCP/1/2M 2.370GB/s 2.219GB/s 2.453GB/s BM_SendmsgTCP/1/4M 2.005GB/s 2.091GB/s 2.214GB/s BM_SendmsgTCP/1/8M 2.111GB/s 2.013GB/s 2.109GB/s BM_SendmsgTCP/1/16M 1.902GB/s 1.868GB/s 1.897GB/s BM_SendmsgTCP/1/32M 1.655GB/s 1.665GB/s 1.635GB/s BM_SendmsgTCP/1/64M 1.575GB/s 1.547GB/s 1.575GB/s BM_SendmsgTCP/1/128M 1.524GB/s 1.584GB/s 1.580GB/s BM_SendmsgTCP/1/256M 1.579GB/s 1.607GB/s 1.593GB/s PiperOrigin-RevId: 278940079
2019-11-04	kokoro: run KVM syscall tests	Andrei Vagin
	We don't know how stable they are, so let's start with warning. PiperOrigin-RevId: 278484186
2019-11-04	Add NETLINK_KOBJECT_UEVENT socket support	Michael Pratt
	NETLINK_KOBJECT_UEVENT sockets send udev-style messages for device events. gVisor doesn't have any device events, so our sockets don't need to do anything once created. systemd's device manager needs to be able to create one of these sockets. It also wants to install a BPF filter on the socket. Since we'll never send any messages, the filter would never be invoked, thus we just fake it out. Fixes #1117 Updates #1119 PiperOrigin-RevId: 278405893
2019-11-01	Add SO_PASSCRED support to netlink sockets	Michael Pratt
	Since we only supporting sending messages from the kernel, the peer is always the kernel, simplifying handling. There are currently no known users of SO_PASSCRED that would actually receive messages from gVisor, but adding full support is barely more work than stubbing out fake support. Updates #1117 Fixes #1119 PiperOrigin-RevId: 277981465
2019-11-01	tests: don't use ASSERT_THAT after fork	Andrei Vagin
	PiperOrigin-RevId: 277965624
2019-10-30	Store endpoints inside multiPortEndpoint in a sorted order	Andrei Vagin
	It is required to guarantee the same order of endpoints after save/restore. PiperOrigin-RevId: 277598665
2019-10-30	Clean up typos in test names.	Dean Deng
	PiperOrigin-RevId: 277572791
2019-10-29	Update symlink traversal limit when resolving interpreter path.	Dean Deng
	When execveat is called on an interpreter script, the symlink count for resolving the script path should be separate from the count for resolving the the corresponding interpreter. An ELOOP error should not occur if we do not hit the symlink limit along any individual path, even if the total number of symlinks encountered exceeds the limit. Closes #574 PiperOrigin-RevId: 277358474
2019-10-29	Fix PollWithFullBufferBlocks.	Bhasker Hariharan
	Set the snd/rcv buffer sizes so that the test is deterministic and runs in a reasonable amount of time. It also ensures that we disable any auto-tuning of the send/receive buffer which may happen. PiperOrigin-RevId: 277337232
2019-10-29	Disallow execveat on interpreter scripts with fd opened with O_CLOEXEC.	Dean Deng
	When an interpreter script is opened with O_CLOEXEC and the resulting fd is passed into execveat, an ENOENT error should occur (the script would otherwise be inaccessible to the interpreter). This matches the actual behavior of Linux's execveat. PiperOrigin-RevId: 277306680
2019-10-25	test/syscall: Remove duplicated gtest/gtest.h.	Haibo
	Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I05a7ec69b98b88931ba4a8adb3e8a7b822006001 COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/1023 from xiaobo55x:syscall_test d44a8b1f827ed4081997af96cd58ba7449e0a9e1 PiperOrigin-RevId: 276740442
2019-10-24	Handle AT_SYMLINK_NOFOLLOW flag for execveat.	Dean Deng
	PiperOrigin-RevId: 276441249
2019-10-23	Handle AT_EMPTY_PATH flag in execveat.	Dean Deng
	PiperOrigin-RevId: 276419967
2019-10-23	Add check for proper settings to AF_PACKET tests.	Kevin Krakauer
	As in packet_socket_raw.cc, we should check that certain proc files are set correctly. PiperOrigin-RevId: 276384534
2019-10-23	Merge pull request #641 from tanjianfeng:master	gVisor bot
	PiperOrigin-RevId: 276380008
2019-10-23	Remove comparison between signed and unsigned int	Michael Pratt
	Some compilers don't like the comparison between int and size_t. Remove it. The other changes are minor style cleanups. PiperOrigin-RevId: 276333450
2019-10-21	Add basic implementation of execveat syscall and associated tests.	Dean Deng
	Allow file descriptors of directories as well as AT_FDCWD. PiperOrigin-RevId: 275929668
2019-10-21	AF_PACKET support for netstack (aka epsocket).	Kevin Krakauer
	Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With runsc, you'll need to pass `--net-raw=true` to enable them. Binding isn't supported yet. PiperOrigin-RevId: 275909366
2019-10-18	Cleanup host UDS support	Michael Pratt
	This change fixes several issues with the fsgofer host UDS support. Notably, it adds support for SOCK_SEQPACKET and SOCK_DGRAM sockets [1]. It also fixes unsafe use of unet.Socket, which could cause a panic if Socket.FD is called when err != nil, and calls to Socket.FD with nothing to prevent the garbage collector from destroying and closing the socket. A set of tests is added to exercise host UDS access. This required extracting most of the syscall test runner into a library that can be used by custom tests. Updates #235 Updates #1003 [1] N.B. SOCK_DGRAM sockets are likely not particularly useful, as a server can only reply to a client that binds first. We don't allow bind, so these are unlikely to be used. PiperOrigin-RevId: 275558502
2019-10-18	test: use a bigger buffer to fill a socket	Andrei Vagin
	Otherwise we need to do a lot of system calls and cooperative_save tests work slow. PiperOrigin-RevId: 275536957
2019-10-16	Merge pull request #736 from tanjianfeng:fix-unix	gVisor bot
	PiperOrigin-RevId: 275114157
2019-10-16	Remove death from exec test names	Michael Pratt
	These aren't actually death tests in the GUnit sense. i.e., they don't call EXPECT_EXIT or EXPECT_DEATH. PiperOrigin-RevId: 275099957
2019-10-15	epsocket: support /proc/net/snmp	Jianfeng Tan
	Netstack has its own stats, we use this to fill /proc/net/snmp. Note that some metrics are not recorded in Netstack, which will be shown as 0 in the proc file. Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: Ie0089184507d16f49bc0057b4b0482094417ebe1
2019-10-15	support /proc/net/snmp	Jianfeng Tan
	This proc file contains statistics according to [1]. [1] https://tools.ietf.org/html/rfc2013 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: I9662132085edd8a7783d356ce4237d7ac0800d94
2019-10-14	Internal change.	gVisor bot
	PiperOrigin-RevId: 274700093
2019-10-10	Allow for zero byte iovec with MSG_PEEK \| MSG_TRUNC in recvmsg.	Ian Lewis
	This allows for peeking at the length of the next message on a netlink socket without pulling it off the socket's buffer/queue, allowing tools like 'ip' to work. This CL also fixes an issue where dump_done_errno was not included in the NLMSG_DONE messages payload. Issue #769 PiperOrigin-RevId: 274068637
2019-10-10	Fix signalfd polling.	Adin Scannell
	The signalfd descriptors otherwise always show as available. This can lead programs to spin, assuming they are looking to see what signals are pending. Updates #139 PiperOrigin-RevId: 274017890
2019-10-07	Implement IP_TTL.	Ian Gudger
	Also change the default TTL to 64 to match Linux. PiperOrigin-RevId: 273430341
2019-10-03	Implement proper local broadcast behavior	Chris Kuiper
	The behavior for sending and receiving local broadcast (255.255.255.255) traffic is as follows: Outgoing -------- * A broadcast packet sent on a socket that is bound to an interface goes out that interface * A broadcast packet sent on an unbound socket follows the route table to select the outgoing interface + if an explicit route entry exists for 255.255.255.255/32, use that one + else use the default route * Broadcast packets are looped back and delivered following the rules for incoming packets (see next). This is the same behavior as for multicast packets, except that it cannot be disabled via sockopt. Incoming -------- * Sockets wishing to receive broadcast packets must bind to either INADDR_ANY (0.0.0.0) or INADDR_BROADCAST (255.255.255.255). No other socket receives broadcast packets. * Broadcast packets are multiplexed to all sockets matching it. This is the same behavior as for multicast packets. * A socket can bind to 255.255.255.255:<port> and then receive its own broadcast packets sent to 255.255.255.255:<port> In addition, this change implicitly fixes an issue with multicast reception. If two sockets want to receive a given multicast stream and one is bound to ANY while the other is bound to the multicast address, only one of them will receive the traffic. PiperOrigin-RevId: 272792377
2019-10-03	Don't report partialResult errors from sendfile	Andrei Vagin
	The input file descriptor is always a regular file, so sendfile can't lose any data if it will not be able to write them to the output file descriptor. Reported-by: syzbot+22d22330a35fa1c02155@syzkaller.appspotmail.com PiperOrigin-RevId: 272730357
2019-10-02	Increase itimer test timeout	Michael Pratt
	https://github.com/google/gvisor/commit/dd69b49ed1103bab82a6b2ac95221b89b46f3376 makes this test take longer. PiperOrigin-RevId: 272535892
2019-10-02	Merge pull request #865 from tanjianfeng:fix-829	gVisor bot
	PiperOrigin-RevId: 272522508
2019-10-02	Sanity test that open(2) on a UDS fails	Michael Pratt
	Spoiler alert: it doesn't. PiperOrigin-RevId: 272513529
2019-10-01	Include AT_SECURE in the aux vector	Michael Pratt
	gVisor does not currently implement the functionality that would result in AT_SECURE = 1, but Linux includes AT_SECURE = 0 in the normal case, so we should do the same. PiperOrigin-RevId: 272311488
2019-10-01	Support new interpreter requirements in test	Michael Pratt
	Refactoring in 0036d1f7eb95bcc52977f15507f00dd07018e7e2 (v4.10) caused Linux to start unconditionally zeroing the remainder of the last page in the interpreter. Previously it did not due so if filesz == memsz, and still does not do so when filesz == memsz for loading binaries, only interpreter. This inconsistency is not worth replicating in gVisor, as it is arguably a bug, but our tests must ensure we create interpreter ELFs compatible with this new requirement. PiperOrigin-RevId: 272266401
2019-10-01	Disable cpuClockTicker when app is idle	Michael Pratt
	Kernel.cpuClockTicker increments kernel.cpuClock, which tasks use as a clock to track their CPU usage. This improves latency in the syscall path by avoid expensive monotonic clock calls on every syscall entry/exit. However, this timer fires every 10ms. Thus, when all tasks are idle (i.e., blocked or stopped), this forces a sentry wakeup every 10ms, when we may otherwise be able to sleep until the next app-relevant event. These wakeups cause the sentry to utilize approximately 2% CPU when the application is otherwise idle. Updates to clock are not strictly necessary when the app is idle, as there are no readers of cpuClock. This commit reduces idle CPU by disabling the timer when tasks are completely idle, and computing its effects at the next wakeup. Rather than disabling the timer as soon as the app goes idle, we wait until the next tick, which provides a window for short sleeps to sleep and wakeup without doing the (relatively) expensive work of disabling and enabling the timer. PiperOrigin-RevId: 272265822
2019-10-01	Honor X bit on extra anon pages in PT_LOAD segments	Michael Pratt
	Linux changed this behavior in 16e72e9b30986ee15f17fbb68189ca842c32af58 (v4.11). Previously, extra pages were always mapped RW. Now, those pages will be executable if the segment specified PF_X. They still must be writeable. PiperOrigin-RevId: 272256280
2019-09-30	De-flake SetForegroundProcessGroupDifferentSession.	Kevin Krakauer
	PiperOrigin-RevId: 272059043
2019-09-30	Only copy out remaining time on nanosleep success	Michael Pratt
	It looks like the old code attempted to do this, but didn't realize that err != nil even in the happy case. PiperOrigin-RevId: 272005887
2019-09-27	Automated rollback of changelist 256276198	Adin Scannell
	PiperOrigin-RevId: 271665517
2019-09-27	Merge pull request #864 from tanjianfeng:fix-861	gVisor bot
	PiperOrigin-RevId: 271649711
2019-09-27	Implement SO_BINDTODEVICE sockopt	gVisor bot
	PiperOrigin-RevId: 271644926
2019-09-26	Make raw socket tests pass in environments with or without CAP_NET_RAW.	Kevin Krakauer
	PiperOrigin-RevId: 271442321
2019-09-24	test: don't use designated initializers	Andrei Vagin
	This change fixes compile errors: pty.cc:1460:7: error: expected primary-expression before '.' token ... PiperOrigin-RevId: 271033729
2019-09-24	Stub out readahead implementation.	Adin Scannell
	Closes #261 PiperOrigin-RevId: 270973347
2019-09-23	Fix bug in RstCausesPollHUP.	Bhasker Hariharan
	The test is checking the wrong poll_fd for POLLHUP. The only reason it passed till now was because it was also checking for POLLIN which was always true on the other fd from the previous poll! PiperOrigin-RevId: 270780401
2019-09-20	fix set hostname	Jianfeng Tan
	Previously, when we set hostname: $ strace hostname abc ... sethostname("abc", 3) = -1 ENAMETOOLONG (File name too long) ... According to man 2 sethostname: "The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)" We wrongly use the CopyStringIn() to check terminating zero byte in the implementation of sethostname syscall. To fix this, we use CopyInBytes() instead. Fixes: #861 Reported-by: chenglang.hy <chenglang.hy@antfin.com> Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>