gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2019-10-16	Merge pull request #736 from tanjianfeng:fix-unix	gVisor bot
	PiperOrigin-RevId: 275114157
2019-10-16	Remove death from exec test names	Michael Pratt
	These aren't actually death tests in the GUnit sense. i.e., they don't call EXPECT_EXIT or EXPECT_DEATH. PiperOrigin-RevId: 275099957
2019-10-14	Internal change.	gVisor bot
	PiperOrigin-RevId: 274700093
2019-10-10	Allow for zero byte iovec with MSG_PEEK \| MSG_TRUNC in recvmsg.	Ian Lewis
	This allows for peeking at the length of the next message on a netlink socket without pulling it off the socket's buffer/queue, allowing tools like 'ip' to work. This CL also fixes an issue where dump_done_errno was not included in the NLMSG_DONE messages payload. Issue #769 PiperOrigin-RevId: 274068637
2019-10-10	Fix signalfd polling.	Adin Scannell
	The signalfd descriptors otherwise always show as available. This can lead programs to spin, assuming they are looking to see what signals are pending. Updates #139 PiperOrigin-RevId: 274017890
2019-10-07	Implement IP_TTL.	Ian Gudger
	Also change the default TTL to 64 to match Linux. PiperOrigin-RevId: 273430341
2019-10-03	Implement proper local broadcast behavior	Chris Kuiper
	The behavior for sending and receiving local broadcast (255.255.255.255) traffic is as follows: Outgoing -------- * A broadcast packet sent on a socket that is bound to an interface goes out that interface * A broadcast packet sent on an unbound socket follows the route table to select the outgoing interface + if an explicit route entry exists for 255.255.255.255/32, use that one + else use the default route * Broadcast packets are looped back and delivered following the rules for incoming packets (see next). This is the same behavior as for multicast packets, except that it cannot be disabled via sockopt. Incoming -------- * Sockets wishing to receive broadcast packets must bind to either INADDR_ANY (0.0.0.0) or INADDR_BROADCAST (255.255.255.255). No other socket receives broadcast packets. * Broadcast packets are multiplexed to all sockets matching it. This is the same behavior as for multicast packets. * A socket can bind to 255.255.255.255:<port> and then receive its own broadcast packets sent to 255.255.255.255:<port> In addition, this change implicitly fixes an issue with multicast reception. If two sockets want to receive a given multicast stream and one is bound to ANY while the other is bound to the multicast address, only one of them will receive the traffic. PiperOrigin-RevId: 272792377
2019-10-03	Don't report partialResult errors from sendfile	Andrei Vagin
	The input file descriptor is always a regular file, so sendfile can't lose any data if it will not be able to write them to the output file descriptor. Reported-by: syzbot+22d22330a35fa1c02155@syzkaller.appspotmail.com PiperOrigin-RevId: 272730357
2019-10-02	Merge pull request #865 from tanjianfeng:fix-829	gVisor bot
	PiperOrigin-RevId: 272522508
2019-10-02	Sanity test that open(2) on a UDS fails	Michael Pratt
	Spoiler alert: it doesn't. PiperOrigin-RevId: 272513529
2019-10-01	Include AT_SECURE in the aux vector	Michael Pratt
	gVisor does not currently implement the functionality that would result in AT_SECURE = 1, but Linux includes AT_SECURE = 0 in the normal case, so we should do the same. PiperOrigin-RevId: 272311488
2019-10-01	Support new interpreter requirements in test	Michael Pratt
	Refactoring in 0036d1f7eb95bcc52977f15507f00dd07018e7e2 (v4.10) caused Linux to start unconditionally zeroing the remainder of the last page in the interpreter. Previously it did not due so if filesz == memsz, and still does not do so when filesz == memsz for loading binaries, only interpreter. This inconsistency is not worth replicating in gVisor, as it is arguably a bug, but our tests must ensure we create interpreter ELFs compatible with this new requirement. PiperOrigin-RevId: 272266401
2019-10-01	Disable cpuClockTicker when app is idle	Michael Pratt
	Kernel.cpuClockTicker increments kernel.cpuClock, which tasks use as a clock to track their CPU usage. This improves latency in the syscall path by avoid expensive monotonic clock calls on every syscall entry/exit. However, this timer fires every 10ms. Thus, when all tasks are idle (i.e., blocked or stopped), this forces a sentry wakeup every 10ms, when we may otherwise be able to sleep until the next app-relevant event. These wakeups cause the sentry to utilize approximately 2% CPU when the application is otherwise idle. Updates to clock are not strictly necessary when the app is idle, as there are no readers of cpuClock. This commit reduces idle CPU by disabling the timer when tasks are completely idle, and computing its effects at the next wakeup. Rather than disabling the timer as soon as the app goes idle, we wait until the next tick, which provides a window for short sleeps to sleep and wakeup without doing the (relatively) expensive work of disabling and enabling the timer. PiperOrigin-RevId: 272265822
2019-10-01	Honor X bit on extra anon pages in PT_LOAD segments	Michael Pratt
	Linux changed this behavior in 16e72e9b30986ee15f17fbb68189ca842c32af58 (v4.11). Previously, extra pages were always mapped RW. Now, those pages will be executable if the segment specified PF_X. They still must be writeable. PiperOrigin-RevId: 272256280
2019-09-30	De-flake SetForegroundProcessGroupDifferentSession.	Kevin Krakauer
	PiperOrigin-RevId: 272059043
2019-09-30	Only copy out remaining time on nanosleep success	Michael Pratt
	It looks like the old code attempted to do this, but didn't realize that err != nil even in the happy case. PiperOrigin-RevId: 272005887
2019-09-27	Automated rollback of changelist 256276198	Adin Scannell
	PiperOrigin-RevId: 271665517
2019-09-27	Merge pull request #864 from tanjianfeng:fix-861	gVisor bot
	PiperOrigin-RevId: 271649711
2019-09-27	Implement SO_BINDTODEVICE sockopt	gVisor bot
	PiperOrigin-RevId: 271644926
2019-09-26	Make raw socket tests pass in environments with or without CAP_NET_RAW.	Kevin Krakauer
	PiperOrigin-RevId: 271442321
2019-09-24	test: don't use designated initializers	Andrei Vagin
	This change fixes compile errors: pty.cc:1460:7: error: expected primary-expression before '.' token ... PiperOrigin-RevId: 271033729
2019-09-24	Stub out readahead implementation.	Adin Scannell
	Closes #261 PiperOrigin-RevId: 270973347
2019-09-23	Fix bug in RstCausesPollHUP.	Bhasker Hariharan
	The test is checking the wrong poll_fd for POLLHUP. The only reason it passed till now was because it was also checking for POLLIN which was always true on the other fd from the previous poll! PiperOrigin-RevId: 270780401
2019-09-20	fix set hostname	Jianfeng Tan
	Previously, when we set hostname: $ strace hostname abc ... sethostname("abc", 3) = -1 ENAMETOOLONG (File name too long) ... According to man 2 sethostname: "The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)" We wrongly use the CopyStringIn() to check terminating zero byte in the implementation of sethostname syscall. To fix this, we use CopyInBytes() instead. Fixes: #861 Reported-by: chenglang.hy <chenglang.hy@antfin.com> Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-09-20	Implement /proc/net/tcp6	Jianfeng Tan
	Fixes: #829 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Signed-off-by: Jielong Zhou <jielong.zjl@antfin.com>
2019-09-19	Job control: controlling TTYs and foreground process groups.	Kevin Krakauer
	Adresses a deadlock with the rolled back change: https://github.com/google/gvisor/commit/b6a5b950d28e0b474fdad160b88bc15314cf9259 Creating a session from an orphaned process group was causing a lock to be acquired twice by a single goroutine. This behavior is addressed, and a test (OrphanRegression) has been added to pty.cc. Implemented the following ioctls: - TIOCSCTTY - set controlling TTY - TIOCNOTTY - remove controlling tty, maybe signal some other processes - TIOCGPGRP - get foreground process group. Also enables tcgetpgrp(). - TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp(). Next steps are to actually turn terminal-generated control characters (e.g. C^c) into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when appropriate. PiperOrigin-RevId: 270088599
2019-09-18	Signalfd support	Adin Scannell
	Note that the exact semantics for these signalfds are slightly different from Linux. These signalfds are bound to the process at creation time. Reads, polls, etc. are all associated with signals directed at that task. In Linux, all signalfd operations are associated with current, regardless of where the signalfd originated. In practice, this should not be an issue given how signalfds are used. In order to fix this however, we will need to plumb the context through all the event APIs. This gets complicated really quickly, because the waiter APIs are all netstack-specific, and not generally exposed to the context. Probably not worthwhile fixing immediately. PiperOrigin-RevId: 269901749
2019-09-16	Migrate from gflags to absl flags	Michael Pratt
	absl flags are more modern and we can easily depend on them directly. The repo now successfully builds with --incompatible_load_cc_rules_from_bzl. PiperOrigin-RevId: 269387081
2019-09-13	gvisor: return ENOTDIR from the unlink syscall	Andrei Vagin
	ENOTDIR has to be returned when a component used as a directory in pathname is not, in fact, a directory. PiperOrigin-RevId: 269037893
2019-09-12	Implement splice methods for pipes and sockets.	Adin Scannell
	This also allows the tee(2) implementation to be enabled, since dup can now be properly supported via WriteTo. Note that this change necessitated some minor restructoring with the fs.FileOperations splice methods. If the *fs.File is passed through directly, then only public API methods are accessible, which will deadlock immediately since the locking is already done by fs.Splice. Instead, we pass through an abstract io.Reader or io.Writer, which elide locks and use the underlying fs.FileOperations directly. PiperOrigin-RevId: 268805207
2019-09-10	Fix minor Kokoro issues.	Adin Scannell
	A recent Kokoro change pointed to go_tests.cfg (in line with the other configurations), which unfortunately broke the presubmits. This change also enabled the KVM tests, which were still using a remote execution strategy. This fixes both of these issues and allows presubmits to pass. One additional test was caught with this case, which seems to have been broken. It's unclear why this was not being caught. PiperOrigin-RevId: 268166291
2019-09-06	Load C++ rules from @rules_cc	Michael Pratt
	See https://github.com/bazelbuild/bazel/issues/8743. This will be required in Bazel 1.0. Protobuf was updated in https://github.com/protocolbuffers/protobuf/commit/bf0c69e1302fe9568fbe310cc54b37d20a9d16a3#diff-96239ee297e0a92ac6ff96a6bc434ef0. GoogleTest was updated in https://github.com/google/googletest/commit/6fd262ecf787d0dc2a91696fd4bf1d3ee1ebfa14. gflags has not yet been updated, so the repo still won't build with --incompatible_load_cc_rules_from_bzl. Tested with buildifier -warnings=native-cc -lint=warn **/BUILD. PiperOrigin-RevId: 267638515
2019-09-05	Fix bug in proc_test.	Bhasker Hariharan
	TestNoDuplicates is racy as it tries to read the /proc file system while the test is running. But it's possible that from the time a directory entries are read and each entry processed something could change and in some cases the entry being processed could have been deleted. In such cases we should not fail the test but just ignore the error and move on. PiperOrigin-RevId: 267483094
2019-09-05	Deflake aio_test.	Jamie Liu
	- Most AIO tests call io_setup(nr_events = 128). sizeof(struct io_event) (12832 = 4096). However, the actual size of the mapping created by io_setup() is determined by: (from fs/aio.c:ioctx_alloc()) / * We keep track of the number of available ringbuffer slots, to prevent * overflow (reqs_available), and we also use percpu counters for this. * * So since up to half the slots might be on other cpu's percpu counters * and unavailable, double nr_events so userspace sees what they * expected: additionally, we move req_batch slots to/from percpu * counters at a time, so make sure that isn't 0: / nr_events = max(nr_events, num_possible_cpus() 4); nr_events = 2; (from fs/aio.c:aio_setup_ring()) / Compensate for the ring buffer's head/tail overlap entry / nr_events += 2; / 1 is required, 2 for good luck / size = sizeof(struct aio_ring); size += sizeof(struct io_event) nr_events; nr_pages = PFN_UP(size); When we mremap() only the first page of a multi-page AIO ring buffer mapping, fs/aio.c:aio_ring_mremap() updates struct kioctx::mmap_base - but struct kioctx::mmap_size is untouched, so sys_io_destroy() => kill_ioctx() vm_unmaps() the mremapped page, plus some number of pages after it. Just get the actual size of the mapping from /proc/self/maps. - Delete test case MremapOver; while it is correct that Linux will not complain if you overwrite the AIO ring buffer with another mapping, it won't actually work in the sense that AIO events will not be written to the new mapping, because Linux stores the struct pages of the ring buffer in struct kioctx::ring_pages and writes to those through kmap() rather than using userspace addresses. - Don't munmap() after mremap(MREMAP_FIXED) returns EFAULT; see new comment in factored-out test case MremapExpansion. PiperOrigin-RevId: 267482903
2019-09-04	Run proc_net tests.	Ian Gudger
	PiperOrigin-RevId: 267280086
2019-08-30	Automated rollback of changelist 261387276	Bhasker Hariharan
	PiperOrigin-RevId: 266491264
2019-08-30	Fix async-signal-unsafety in MlockallTest_Future.	Jamie Liu
	PiperOrigin-RevId: 266491246
2019-08-30	Return correct buffer size for ioctl(socket, FIONREAD)	Fabricio Voznika
	Ioctl was returning just the buffer size from epsocket.endpoint and it was not considering data from epsocket.SocketOperations that was read from the endpoint, but not yet sent to the caller. PiperOrigin-RevId: 266485461
2019-08-30	Add C++ toolchain and fix compile issues.	Adin Scannell
	This was accidentally introduced in 31f05d5d4f62c4cd4fe3b95b333d0130aae4b2c1. Fixes #788. PiperOrigin-RevId: 266462843
2019-08-29	Handle new representation of abstract UDS paths.	Rahat Mahmood
	When abstract unix domain socket paths are displayed in /proc/net/unix, Linux historically emitted null bytes as padding at the end of the path. Newer versions of Linux (v4.9, e7947ea770d0de434d38a0f823e660d3fd4bebb5) display these as '@' characters. Update proc_net_unix test to handle both version of the padding. PiperOrigin-RevId: 266230200
2019-08-29	Implement /proc/net/udp.	Rahat Mahmood
	PiperOrigin-RevId: 266229756
2019-08-29	Internal change.	gVisor bot
	PiperOrigin-RevId: 266199211
2019-08-27	Fix pwritev2 flaky test.	Zach Koopmans
	Fix a uninitialized memory bug in pwritev2 test. PiperOrigin-RevId: 265772176
2019-08-27	Fix sendfile(2) error code	Fabricio Voznika
	When output file is in append mode, sendfile(2) should fail with EINVAL and not EBADF. Closes #721 PiperOrigin-RevId: 265718958
2019-08-26	Internal change.	gVisor bot
	PiperOrigin-RevId: 265535438
2019-08-23	Second try at flaky futex test.	Zach Koopmans
	The flake had the call to futex_unlock_pi() returning EINVAL with the FUTEX_OWNER_DIED set. In this case, userspace has to clean up stale state. So instead of calling FUTEX_UNLOCK_PI outright, we'll use the advised atomic compare_exchange as advised in the man page. PiperOrigin-RevId: 265163920
2019-08-22	Remove ASSERT from fork child	Michael Pratt
	The gunit macros are not safe to use in the child. PiperOrigin-RevId: 264904348
2019-08-22	unix: return ECONNRESET if peer closed with data not read	Jianfeng Tan
	For SOCK_STREAM type unix socket, we shall return ECONNRESET if peer is closed with data not read. We explictly set a flag when closing one end, to differentiate from just shutdown (where zero shall be returned). Fixes: #735 Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-08-22	unix: return zero if peer is closed	Jianfeng Tan
	Previously, recvmsg() on a unix stream socket with its peer closed will never return, with goroutine call trace like this: ... 2 in gvisor.dev/gvisor/pkg/sentry/kernel.(Task).block at pkg/sentry/kernel/task_block.go:124 3 in gvisor.dev/gvisor/pkg/sentry/kernel.(Task).BlockWithDeadline at pkg/sentry/kernel/task_block.go:69 4 in gvisor.dev/gvisor/pkg/sentry/socket/unix.(SocketOperations).RecvMsg at pkg/sentry/socket/unix/unix.go:612 5 in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.recvFrom at pkg/sentry/syscalls/linux/sys_socket.go:885 6 in gvisor.dev/gvisor/pkg/sentry/syscalls/linux.RecvFrom at pkg/sentry/syscalls/linux/sys_socket.go:910 ... The issue is caused by that ErrClosedForReceive returned by unix/transport.queue is turned into nil in unix.(EndpointReader).ReadToBlocks(): err.ToError() As a result, in unix.(*SocketOperations).RecvMsg(): n == 0 and err == nil We shall differentiate it from another case - no data to read where ErrWouldBlock shall be returned; and return 0 immediately. Fixes: #734 Reported-by: chenglang.hy <chenglang.hy@antfin.com> Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-08-21	Support binding to multicast and broadcast addresses	Chris Kuiper
	This fixes the issue of not being able to bind to either a multicast or broadcast address as well as to send and receive data from it. The way to solve this is to treat these addresses similar to the ANY address and register their transport endpoint ID with the global stack's demuxer rather than the NIC's. That way there is no need to require an endpoint with that multicast or broadcast address. The stack's demuxer is in fact the only correct one to use, because neither broadcast- nor multicast-bound sockets care which NIC a packet was received on (for multicast a join is still needed to receive packets on a NIC). I also took the liberty of refactoring udp_test.go to consolidate a lot of duplicate code and make it easier to create repetitive tests that test the same feature for a variety of packet and socket types. For this purpose I created a "flowType" that represents two things: 1) the type of packet being sent or received and 2) the type of socket used for the test. E.g., a "multicastV4in6" flow represents a V4-mapped multicast packet run through a V6-dual socket. This allows writing significantly simpler tests. A nice example is testTTL(). PiperOrigin-RevId: 264766909