gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2019-10-02	fs/proc: report PID-s from a pid namespace of the proc mount	Andrei Vagin
	Right now, we can find more than one process with the 1 PID in /proc. $ for i in `seq 10`; do > unshare -fp sleep 1000 & > done $ ls /proc 1 1 1 1 12 18 24 29 6 loadavg net sys version 1 1 1 1 16 20 26 32 cpuinfo meminfo self thread-self 1 1 1 1 17 21 28 36 filesystems mounts stat uptime PiperOrigin-RevId: 272506593
2019-10-01	Include AT_SECURE in the aux vector	Michael Pratt
	gVisor does not currently implement the functionality that would result in AT_SECURE = 1, but Linux includes AT_SECURE = 0 in the normal case, so we should do the same. PiperOrigin-RevId: 272311488
2019-10-01	Disable cpuClockTicker when app is idle	Michael Pratt
	Kernel.cpuClockTicker increments kernel.cpuClock, which tasks use as a clock to track their CPU usage. This improves latency in the syscall path by avoid expensive monotonic clock calls on every syscall entry/exit. However, this timer fires every 10ms. Thus, when all tasks are idle (i.e., blocked or stopped), this forces a sentry wakeup every 10ms, when we may otherwise be able to sleep until the next app-relevant event. These wakeups cause the sentry to utilize approximately 2% CPU when the application is otherwise idle. Updates to clock are not strictly necessary when the app is idle, as there are no readers of cpuClock. This commit reduces idle CPU by disabling the timer when tasks are completely idle, and computing its effects at the next wakeup. Rather than disabling the timer as soon as the app goes idle, we wait until the next tick, which provides a window for short sleeps to sleep and wakeup without doing the (relatively) expensive work of disabling and enabling the timer. PiperOrigin-RevId: 272265822
2019-10-01	Honor X bit on extra anon pages in PT_LOAD segments	Michael Pratt
	Linux changed this behavior in 16e72e9b30986ee15f17fbb68189ca842c32af58 (v4.11). Previously, extra pages were always mapped RW. Now, those pages will be executable if the segment specified PF_X. They still must be writeable. PiperOrigin-RevId: 272256280
2019-09-30	splice: try another fallback option only if the previous one isn't supported	Andrei Vagin
	Reported-by: syzbot+bb5ed342be51d39b0cbb@syzkaller.appspotmail.com PiperOrigin-RevId: 272110815
2019-09-30	splice: compare inode numbers only if both ends are pipes	Andrei Vagin
	It isn't allowed to splice data from and into the same pipe. But right now this check is broken, because we don't check that both ends are pipes. PiperOrigin-RevId: 272107022
2019-09-30	Update FIXME bug with GitHub issue.	Adin Scannell
	PiperOrigin-RevId: 272101930
2019-09-30	Add a Stringer implementation to PacketDispatchMode	Bhasker Hariharan
	PiperOrigin-RevId: 272083936
2019-09-30	Fix bugs in PickEphemeralPort for TCP.	Bhasker Hariharan
	Netstack always picks a random start point everytime PickEphemeralPort is called. While this is required for UDP so that DNS requests go out through a randomized set of ports it is not required for TCP. Infact Linux explicitly hashes the (srcip, dstip, dstport) and a one time secret initialized at start of the application to get a random offset. But to ensure it doesn't start from the same point on every scan it uses a static hint that is incremented by 2 in every call to pick ephemeral ports. The reason for 2 is Linux seems to split the port ranges where active connects seem to use even ones while odd ones are used by listening sockets. This CL implements a similar strategy where we use a hash + hint to generate the offset to start the search for a free Ephemeral port. This ensures that we cycle through the available port space in order for repeated connects to the same destination and significantly reduces the chance of picking a recently released port. PiperOrigin-RevId: 272058370
2019-09-30	Force timestamps to update when set via InodeOperations.SetTimestamps.	Nicolas Lacasse
	The gofer's CachingInodeOperations implementation contains an optimization for the common open-read-close pattern when we have a host FD. In this case, the host kernel will update the timestamp for us to a reasonably close time, so we don't need an extra RPC to the gofer. However, when the app explicitly sets the timestamps (via futimes or similar) then we actually DO need to update the timestamps, because the host kernel won't do it for us. To fix this, a new boolean `forceSetTimestamps` was added to CachineInodeOperations.SetMaskedAttributes. It is only set by gofer.InodeOperations.SetTimestamps. PiperOrigin-RevId: 272048146
2019-09-30	Only copy out remaining time on nanosleep success	Michael Pratt
	It looks like the old code attempted to do this, but didn't realize that err != nil even in the happy case. PiperOrigin-RevId: 272005887
2019-09-27	Merge pull request #882 from DarcySail:darcy_faster_CopyStringIn	gVisor bot
	PiperOrigin-RevId: 271675009
2019-09-27	Merge pull request #864 from tanjianfeng:fix-861	gVisor bot
	PiperOrigin-RevId: 271649711
2019-09-27	Implement SO_BINDTODEVICE sockopt	gVisor bot
	PiperOrigin-RevId: 271644926
2019-09-26	Make raw socket tests pass in environments with or without CAP_NET_RAW.	Kevin Krakauer
	PiperOrigin-RevId: 271442321
2019-09-25	Merge pull request #765 from trailofbits:uds_support	gVisor bot
	PiperOrigin-RevId: 271235134
2019-09-25	Remove centralized registration of protocols.	Kevin Krakauer
	Also removes the need for protocol names. PiperOrigin-RevId: 271186030
2019-09-25	Merge pull request #863 from tanjianfeng:fix-862	gVisor bot
	PiperOrigin-RevId: 271168948
2019-09-24	gvisor: change syscall.RawSyscall to syscall.RawSyscall6 where required	gVisor bot
	Before https://golang.org/cl/173160 syscall.RawSyscall would zero out the last three register arguments to the system call. That no longer happens. For system calls that take more than three arguments, use RawSyscall6 to ensure that we pass zero, not random data, for the additional arguments. PiperOrigin-RevId: 271062527
2019-09-24	Stub out readahead implementation.	Adin Scannell
	Closes #261 PiperOrigin-RevId: 270973347
2019-09-24	Return only primary addresses in Stack.NICInfo()	Chris Kuiper
	Non-primary addresses are used for endpoints created to accept multicast and broadcast packets, as well as "helper" endpoints (0.0.0.0) that allow sending packets when no proper address has been assigned yet (e.g., for DHCP). These addresses are not real addresses from a user point of view and should not be part of the NICInfo() value. Also see b/127321246 for more info. This switches NICInfo() to call a new NIC.PrimaryAddresses() function. To still allow an option to get all addresses (mostly for testing) I added Stack.GetAllAddresses() and NIC.AllAddresses(). In addition, the return value for GetMainNICAddress() was changed for the case where the NIC has no primary address. Instead of returning an error here, it now returns an empty AddressWithPrefix() value. The rational for this change is that it is a valid case for a NIC to have no primary addresses. Lastly, I refactored the code based on the new additions. PiperOrigin-RevId: 270971764
2019-09-24	Simplify ICMPRateLimiter	Tamir Duberstein
	https://github.com/golang/time/commit/c4c64ca added SetBurst upstream. PiperOrigin-RevId: 270925077
2019-09-24	tty: fix sending SIGTTOU on tty write	henry.tjf
	How to reproduce: $ echo "timeout 10 ls" > foo.sh $ chmod +x foo.sh $ ./foo.sh (will hang here for 10 secs, and the output of ls does not show) When "ls" process writes to stdout, it receives SIGTTOU signal, and hangs there. Until "timeout" process timeouts, and kills "ls" process. The expected result is: "ls" writes its output into tty, and terminates immdedately, then "timeout" process receives SIGCHLD and terminates. The reason for this failure is that we missed the check for TOSTOP (if set, background processes will receive the SIGTTOU signal when they do write). We use drivers/tty/n_tty.c:n_tty_write() as a reference. Fixes: #862 Reported-by: chris.zn <chris.zn@antfin.com> Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Signed-off-by: chenglang.hy <chenglang.hy@antfin.com>
2019-09-23	Add test for concurrent reads and writes.	Adin Scannell
	PiperOrigin-RevId: 270789146
2019-09-23	netstack: convert more socket options to {Set,Get}SockOptInt	Andrei Vagin
	PiperOrigin-RevId: 270763208
2019-09-23	internal BUILD file cleanup.	gVisor bot
	PiperOrigin-RevId: 270680704
2019-09-20	Change vfs.Dirent.Off to NextOff.	Jamie Liu
	"d_off is the distance from the start of the directory to the start of the next linux_dirent." - getdents(2). PiperOrigin-RevId: 270349685
2019-09-20	Allow waiting for LinkEndpoint worker goroutines to finish.	Ian Gudger
	Previously, the only safe way to use an fdbased endpoint was to leak the FD. This change makes it possible to safely close the FD. This is the first step towards having stoppable stacks. Updates #837 PiperOrigin-RevId: 270346582
2019-09-20	fix set hostname	Jianfeng Tan
	Previously, when we set hostname: $ strace hostname abc ... sethostname("abc", 3) = -1 ENAMETOOLONG (File name too long) ... According to man 2 sethostname: "The len argument specifies the number of bytes in name. (Thus, name does not require a terminating null byte.)" We wrongly use the CopyStringIn() to check terminating zero byte in the implementation of sethostname syscall. To fix this, we use CopyInBytes() instead. Fixes: #861 Reported-by: chenglang.hy <chenglang.hy@antfin.com> Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-09-19	Fix p9 integration of flipcall.	Jamie Liu
	- Do not call Rread.SetPayload(flipcall packet window) in p9.channel.recv(). - Ignore EINTR from ppoll() in p9.Client.watch(). - Clean up handling of client socket FD lifetimes so that p9.Client.watch() never ppoll()s a closed FD. - Make p9test.Harness.Finish() call clientSocket.Shutdown() instead of clientSocket.Close() for the same reason. - Rework channel reuse to avoid leaking channels in the following case (suppose we have two channels): sendRecvChannel len(channels) == 2 => idx = 1 inuse[1] = ch0 sendRecvChannel len(channels) == 1 => idx = 0 inuse[0] = ch1 inuse[1] = nil sendRecvChannel len(channels) == 1 => idx = 0 inuse[0] = ch0 inuse[0] = nil inuse[0] == nil => ch0 leaked - Avoid deadlocking p9.Client.watch() by calling channelsWg.Wait() without holding channelsMu. - Bump p9test:client_test size to medium. PiperOrigin-RevId: 270200314
2019-09-19	Fix documentation, clean up seccomp filter installation, rename helpers.	Robert Tonic
	Filter installation has been streamlined and functions renamed. Documentation has been fixed to be standards compliant, and missing documentation added. gofmt has also been applied to modified files.
2019-09-19	Remove defer from hot path and ensure Atomic is applied consistently.	Adin Scannell
	PiperOrigin-RevId: 270114317
2019-09-19	Merge pull request #876 from xiaobo55x:hostcpu	gVisor bot
	PiperOrigin-RevId: 270094324
2019-09-19	Job control: controlling TTYs and foreground process groups.	Kevin Krakauer
	Adresses a deadlock with the rolled back change: https://github.com/google/gvisor/commit/b6a5b950d28e0b474fdad160b88bc15314cf9259 Creating a session from an orphaned process group was causing a lock to be acquired twice by a single goroutine. This behavior is addressed, and a test (OrphanRegression) has been added to pty.cc. Implemented the following ioctls: - TIOCSCTTY - set controlling TTY - TIOCNOTTY - remove controlling tty, maybe signal some other processes - TIOCGPGRP - get foreground process group. Also enables tcgetpgrp(). - TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp(). Next steps are to actually turn terminal-generated control characters (e.g. C^c) into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when appropriate. PiperOrigin-RevId: 270088599
2019-09-19	Accelerate byte lookup in string with `bytealg/indexbyte`	Hang Su
	`bytealg/indexbyte` will use AVX or SSE instruction set, if possible, which could accelerate `CopyStringIn` function by 28%. In worst case(CPU doesn't support SSE), `bytealg/indexbyte` will degenerate to traversal lookup. When dealing with short strings, `bytealg/indexbyte` has the same performance level as before. Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Signed-off-by: Hang Su <darcy.sh@antfin.com>
2019-09-18	Enable pkg/sentry/hostcpu support on arm64.	Haibo Xu
	Signed-off-by: Haibo Xu haibo.xu@arm.com Change-Id: I333872da9bdf56ddfa8ab2f034dfc1f36a7d3132
2019-09-18	Signalfd support	Adin Scannell
	Note that the exact semantics for these signalfds are slightly different from Linux. These signalfds are bound to the process at creation time. Reads, polls, etc. are all associated with signals directed at that task. In Linux, all signalfd operations are associated with current, regardless of where the signalfd originated. In practice, this should not be an issue given how signalfds are used. In order to fix this however, we will need to plumb the context through all the event APIs. This gets complicated really quickly, because the waiter APIs are all netstack-specific, and not generally exposed to the context. Probably not worthwhile fixing immediately. PiperOrigin-RevId: 269901749
2019-09-17	Automated rollback of changelist 268047073	Ghanan Gowripalan
	PiperOrigin-RevId: 269658971
2019-09-17	platform/ptrace: log exit code for stub processes	Andrei Vagin
	PiperOrigin-RevId: 269631877
2019-09-17	Update remaining users of LinkEndpoints to not refer to them as an ID.	Ian Gudger
	PiperOrigin-RevId: 269614517
2019-09-13	gvisor: return ENOTDIR from the unlink syscall	Andrei Vagin
	ENOTDIR has to be returned when a component used as a directory in pathname is not, in fact, a directory. PiperOrigin-RevId: 269037893
2019-09-12	Update p9 to support flipcall.	Adin Scannell
	PiperOrigin-RevId: 268845090
2019-09-12	Implement splice methods for pipes and sockets.	Adin Scannell
	This also allows the tee(2) implementation to be enabled, since dup can now be properly supported via WriteTo. Note that this change necessitated some minor restructoring with the fs.FileOperations splice methods. If the *fs.File is passed through directly, then only public API methods are accessible, which will deadlock immediately since the locking is already done by fs.Splice. Instead, we pass through an abstract io.Reader or io.Writer, which elide locks and use the underlying fs.FileOperations directly. PiperOrigin-RevId: 268805207
2019-09-12	Remove go_test from go_stateify and go_marshal	Michael Pratt
	They are no-ops, so the standard rule works fine. PiperOrigin-RevId: 268776264
2019-09-12	Automated rollback of changelist 268047073	Ghanan Gowripalan
	PiperOrigin-RevId: 268757842
2019-09-09	Fix ephemeral port leak.	Ian Gudger
	Fix a bug where udp.(endpoint).Disconnect [accessible in gVisor via epsocket.(SocketOperations).Connect with AF_UNSPEC] would leak a port reservation if the socket/endpoint had an ephemeral port assigned to it. glibc's getaddrinfo uses connect with AF_UNSPEC, causing each call of getaddrinfo to leak a port. Call getaddrinfo too many times and you run out of ports (shows up as connect returning EAGAIN and getaddrinfo returning EAI_NONAME "Name or service not known"). PiperOrigin-RevId: 268071160
2019-09-09	go_marshal: Implement automatic generation of ABI marshalling code.	Rahat Mahmood
	This CL implements go_marshal, a code generation utility for automatically serializing and deserializing ABI structs. The go_marshal tool automatically generates implementations of the new marshal interface. Unlike binary.Marshal/Unmarshal, the generated interface implementations use no runtime reflection, and translates to a single memcpy for most structs. See go_marshal/README.md for details. PiperOrigin-RevId: 268065475
2019-09-09	Join IPv6 all-nodes and solicited-node multicast addresses where appropriate.	Ghanan Gowripalan
	The IPv6 all-nodes multicast address will be joined on NIC enable, and the appropriate IPv6 solicited-node multicast address will be joined when IPv6 addresses are added. Tests: Test receiving packets destined to the IPv6 link-local all-nodes multicast address and the IPv6 solicted node address of an added IPv6 address. PiperOrigin-RevId: 268047073
2019-09-06	Remove reundant global tcpip.LinkEndpointID.	Ian Gudger
	PiperOrigin-RevId: 267709597
2019-09-06	Indicate flipcall synchronization to the Go race detector.	Jamie Liu
	Since each Endpoint has a distinct mapping of the packet window, the Go race detector does not recognize accesses by connected Endpoints to be related. This means that this change isn't necessary for the Go race detector to accept accesses of flipcall.Endpoint.Data(), but it is necessary for it to accept accesses to shared variables outside the scope of flipcall that are synchronized by flipcall.Endpoint state; see updated test for an example. RaceReleaseMerge is needed (instead of RaceRelease) because calls to raceBecomeInactive() from unrelated Endpoints can occur in any order. (DowngradableRWMutex.RUnlock() has a similar property: calls to RUnlock() on the same DowngradableRWMutex from different goroutines can occur in any order. Remove the TODO asking to explain this now that this is understood.) PiperOrigin-RevId: 267705325