gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2020-09-04	Simplify FD handling for container start/exec	Fabricio Voznika
	VFS1 and VFS2 host FDs have different dupping behavior, making error prone to code for both. Change the contract so that FDs are released as they are used, so the caller can simple defer a block that closes all remaining files. This also addresses handling of partial failures. With this fix, more VFS2 tests can be enabled. Updates #1487 PiperOrigin-RevId: 330112266
2020-09-03	Adjust input file offset when sendfile only completes a partial write.	Dean Deng
	Fixes #3779. PiperOrigin-RevId: 330057268
2020-09-03	Use fine-grained mutex for stack.cleanupEndpoints.	Bhasker Hariharan
	stack.cleanupEndpoints is protected by the stack.mu but that can cause contention as the stack mutex is already acquired in a lot of hot paths during new endpoint creation /cleanup etc. Moving this to a fine grained mutex should reduce contention on the stack.mu. PiperOrigin-RevId: 330026151
2020-09-03	Use atomic.Value for Stack.tcpProbeFunc.	Jamie Liu
	b/166980357#comment56 shows: - 837 goroutines blocked in: gvisor/pkg/sync/sync.(RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(Stack).StartTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).protocolMainLoop - 695 goroutines blocked in: gvisor/pkg/sync/sync.(RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(Stack).CompleteTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).protocolMainLoop - 3882 goroutines blocked in: gvisor/pkg/sync/sync.(RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(Stack).GetTCPProbe gvisor/pkg/tcpip/transport/tcp/tcp.newEndpoint gvisor/pkg/tcpip/transport/tcp/tcp.(protocol).NewEndpoint gvisor/pkg/tcpip/stack/stack.(Stack).NewEndpoint All of these are contending on Stack.mu. Stack.StartTransportEndpointCleanup() and Stack.CompleteTransportEndpointCleanup() insert/delete TransportEndpoints in a map (Stack.cleanupEndpoints), and the former also does endpoint unregistration while holding Stack.mu, so it's not immediately clear how feasible it is to replace the map with a mutex-less implementation or how much doing so would help. However, Stack.GetTCPProbe() just reads a function object (Stack.tcpProbeFunc) that is almost always nil (as far as I can tell, Stack.AddTCPProbe() is only called in tests), and it's called for every new TCP endpoint. So converting it to an atomic.Value should significantly reduce contention on Stack.mu, improving TCP endpoint creation latency and allowing TCP endpoint cleanup to proceed. PiperOrigin-RevId: 330004140
2020-09-02	Add support to run packetimpact tests against Fuchsia	Zeling Feng
	blaze test <test_name>_fuchsia_test will run the corresponding packetimpact test against fuchsia. PiperOrigin-RevId: 329835290
2020-09-02	Fix Accept to not return error for sockets in accept queue.	Bhasker Hariharan
	Accept on gVisor will return an error if a socket in the accept queue was closed before Accept() was called. Linux will return the new fd even if the returned socket is already closed by the peer say due to a RST being sent by the peer. This seems to be intentional in linux more details on the github issue. Fixes #3780 PiperOrigin-RevId: 329828404
2020-09-02	[vfs] Implement xattr for overlayfs.	Ayush Ranjan
	PiperOrigin-RevId: 329825497
2020-09-02	[vfs] Fix error handling in overlayfs OpenAt.	Ayush Ranjan
	Updates #1199 PiperOrigin-RevId: 329802274
2020-09-02	Update Go version constraint on sync/spin_unsafe.go.	Jamie Liu
	PiperOrigin-RevId: 329801584
2020-09-02	Improve sync.SeqCount performance.	Jamie Liu
	- Make sync.SeqCountEpoch not a struct. This allows sync.SeqCount.BeginRead() to be inlined. - Mark sync.SeqAtomicLoad<T> nosplit to mitigate the Go compiler's refusal to inline it. (Best I could get was "cost 92 exceeds budget 80".) - Use runtime-guided spinning in SeqCount.BeginRead(). Benchmarks: name old time/op new time/op delta pkg:pkg/sync/sync goos:linux goarch:amd64 SeqCountWriteUncontended-12 8.24ns ± 0% 11.40ns ± 0% +38.35% (p=0.000 n=10+10) SeqCountReadUncontended-12 0.33ns ± 0% 0.14ns ± 3% -57.77% (p=0.000 n=7+8) pkg:pkg/sync/seqatomictest/seqatomic goos:linux goarch:amd64 SeqAtomicLoadIntUncontended-12 0.64ns ± 1% 0.41ns ± 1% -36.40% (p=0.000 n=10+8) SeqAtomicTryLoadIntUncontended-12 0.18ns ± 4% 0.18ns ± 1% ~ (p=0.206 n=10+8) AtomicValueLoadIntUncontended-12 0.27ns ± 3% 0.27ns ± 0% -1.77% (p=0.000 n=10+8) (atomic.Value.Load is, of course, inlined. We would expect an uncontended inline SeqAtomicLoad<int> to perform identically to SeqAtomicTryLoad<int>.) The "regression" in BenchmarkSeqCountWriteUncontended, despite this CL changing nothing in that path, is attributed to microarchitectural subtlety; the benchmark loop is unchanged except for its address: Before this CL: :0 0x4e62d1 48ffc2 INCQ DX :0 0x4e62d4 48399110010000 CMPQ DX, 0x110(CX) :0 0x4e62db 7e26 JLE 0x4e6303 :0 0x4e62dd 90 NOPL :0 0x4e62de bb01000000 MOVL $0x1, BX :0 0x4e62e3 f00fc118 LOCK XADDL BX, 0(AX) :0 0x4e62e7 ffc3 INCL BX :0 0x4e62e9 0fbae300 BTL $0x0, BX :0 0x4e62ed 733a JAE 0x4e6329 :0 0x4e62ef 90 NOPL :0 0x4e62f0 bb01000000 MOVL $0x1, BX :0 0x4e62f5 f00fc118 LOCK XADDL BX, 0(AX) :0 0x4e62f9 ffc3 INCL BX :0 0x4e62fb 0fbae300 BTL $0x0, BX :0 0x4e62ff 73d0 JAE 0x4e62d1 After this CL: :0 0x4e6361 48ffc2 INCQ DX :0 0x4e6364 48399110010000 CMPQ DX, 0x110(CX) :0 0x4e636b 7e26 JLE 0x4e6393 :0 0x4e636d 90 NOPL :0 0x4e636e bb01000000 MOVL $0x1, BX :0 0x4e6373 f00fc118 LOCK XADDL BX, 0(AX) :0 0x4e6377 ffc3 INCL BX :0 0x4e6379 0fbae300 BTL $0x0, BX :0 0x4e637d 733a JAE 0x4e63b9 :0 0x4e637f 90 NOPL :0 0x4e6380 bb01000000 MOVL $0x1, BX :0 0x4e6385 f00fc118 LOCK XADDL BX, 0(AX) :0 0x4e6389 ffc3 INCL BX :0 0x4e638b 0fbae300 BTL $0x0, BX :0 0x4e638f 73d0 JAE 0x4e6361 PiperOrigin-RevId: 329754148
2020-09-01	Implement setattr+clunk in 9P	Fabricio Voznika
	This is to cover the common pattern: open->read/write->close, where SetAttr needs to be called to update atime/mtime before the file is closed. Benchmark results: BM_OpenReadClose/10240 CPU setattr+clunk: 63783 ns VFS2: 68109 ns VFS1: 72507 ns Updates #1198 PiperOrigin-RevId: 329628461
2020-09-01	Fix handling of unacceptable ACKs during close.	Mithun Iyer
	On receiving an ACK with unacceptable ACK number, in a closing state, TCP, needs to reply back with an ACK with correct seq and ack numbers and remain in same state. This change is as per RFC793 page 37, but with a difference that it does not apply to ESTABLISHED state, just as in Linux. Also add more tests to check for OTW sequence number and unacceptable ack numbers in these states. Fixes #3785 PiperOrigin-RevId: 329616283
2020-09-01	Refactor tty codebase to use master-replica terminology.	Ayush Ranjan
	Updates #2972 PiperOrigin-RevId: 329584905
2020-09-01	Fix panic when calling dup2().	Nayana Bidari
	PiperOrigin-RevId: 329572337
2020-09-01	[go-marshal] Enable auto-marshalling for fs/tty.	Ayush Ranjan
	PiperOrigin-RevId: 329564614
2020-09-01	Automated rollback of changelist 328350576	Nayana Bidari
	PiperOrigin-RevId: 329526153
2020-08-31	Don't use read-only host FD for writable gofer dentries in VFS2.	Jamie Liu
	As documented for gofer.dentry.hostFD. PiperOrigin-RevId: 329372319
2020-08-31	Implement walk in gvisor verity fs	gVisor bot
	Implement walk directories in gvisor verity file system. For each step, the child dentry is verified against a verified parent root hash. PiperOrigin-RevId: 329358747
2020-08-31	stateify: Bring back struct field and type names in pretty print	Ting-Yu Wang
	PiperOrigin-RevId: 329349158
2020-08-28	Fix kernfs.Dentry reference leak.	Nicolas Lacasse
	PiperOrigin-RevId: 329036994
2020-08-28	Don't bind loopback to all IPs in an IPv6 subnet	Ghanan Gowripalan
	An earlier change considered the loopback bound to all addresses in an assigned subnet. This should have only be done for IPv4 to maintain compatability with Linux: ``` $ ip addr show dev lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3062ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes ^C --- 2001:db8::2 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2030ms $ sudo ip addr add 2001:db8::1/64 dev lo $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes 64 bytes from 2001:db8::1: icmp_seq=1 ttl=64 time=0.055 ms 64 bytes from 2001:db8::1: icmp_seq=2 ttl=64 time=0.074 ms 64 bytes from 2001:db8::1: icmp_seq=3 ttl=64 time=0.073 ms 64 bytes from 2001:db8::1: icmp_seq=4 ttl=64 time=0.071 ms ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3075ms rtt min/avg/max/mdev = 0.055/0.068/0.074/0.007 ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes From 2001:db8::1 icmp_seq=1 Destination unreachable: No route From 2001:db8::1 icmp_seq=2 Destination unreachable: No route From 2001:db8::1 icmp_seq=3 Destination unreachable: No route From 2001:db8::1 icmp_seq=4 Destination unreachable: No route ^C --- 2001:db8::2 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3070ms ``` Test: integration_test.TestLoopbackAcceptAllInSubnet PiperOrigin-RevId: 329011566
2020-08-28	Implement StatFS for various VFS2 filesystems.	Rahat Mahmood
	This mainly involved enabling kernfs' client filesystems to provide a StatFS implementation. Fixes #3411, #3515. PiperOrigin-RevId: 329009864
2020-08-28	Improve type safety for network protocol options	Ghanan Gowripalan
	The existing implementation for NetworkProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for network protocol options that may be set or queried which network protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. PiperOrigin-RevId: 328980359
2020-08-28	Fix EOF handling for splice.	Dean Deng
	Also, add corresponding EOF tests for splice/sendfile. Discovered by syzkaller. PiperOrigin-RevId: 328975990
2020-08-28	fix panic when calling SO_ORIGINAL_DST without initializing iptables	Kevin Krakauer
	Reported-by: syzbot+074ec22c42305725b79f@syzkaller.appspotmail.com PiperOrigin-RevId: 328963899
2020-08-28	Use a single NetworkEndpoint per address	Ghanan Gowripalan
	This change was already done as of https://github.com/google/gvisor/commit/1736b2208f but https://github.com/google/gvisor/commit/a174aa7597 conflicted with that change and it was missed in reviews. This change fixes the conflict. PiperOrigin-RevId: 328920372
2020-08-27	[go-marshal] Enable auto-marshalling for tundev.	Ayush Ranjan
	PiperOrigin-RevId: 328863725
2020-08-27	Fix vfs2 pipe behavior when splicing to a non-pipe.	Dean Deng
	Fixes *.sh Java runtime tests, where splice()-ing from a pipe to /dev/zero would not actually empty the pipe. There was no guarantee that the data would actually be consumed on a splice operation unless the output file's implementation of Write/PWrite actually called VFSPipeFD.CopyIn. Now, whatever bytes are "written" are consumed regardless of whether CopyIn is called or not. Furthermore, the number of bytes in the IOSequence for reads is now capped at the amount of data actually available. Before, splicing to /dev/zero would always return the requested splice size without taking the actual available data into account. This change also refactors the case where an input file is spliced into an output pipe so that it follows a similar pattern, which is arguably cleaner anyway. Updates #3576. PiperOrigin-RevId: 328843954
2020-08-27	unix: return ECONNREFUSE if a socket file exists but a socket isn't bound to it	Andrei Vagin
	PiperOrigin-RevId: 328843560
2020-08-27	[go-marshal] Support for usermem.IOOpts.	Ayush Ranjan
	PiperOrigin-RevId: 328839759
2020-08-27	Improve type safety for socket options	Ghanan Gowripalan
	The existing implementation for {G,S}etSockOpt take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for socket options that may be set or queried which socket option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. Fixes #3714. RELNOTES: n/a PiperOrigin-RevId: 328832161
2020-08-27	Add function to get error from a tcpip.Endpoint	Ghanan Gowripalan
	In an upcoming CL, socket option types are made to implement a marker interface with pointer receivers. Since this results in calling methods of an interface with a pointer, we incur an allocation when attempting to get an Endpoint's last error with the current implementation. When calling the method of an interface, the compiler is unable to determine what the interface implementation does with the pointer (since calling a method on an interface uses virtual dispatch at runtime so the compiler does not know what the interface method will do) so it allocates on the heap to be safe incase an implementation continues to hold the pointer after the functioon returns (the reference escapes the scope of the object). In the example below, the compiler does not know what b.foo does with the reference to a it allocates a on the heap as the reference to a may escape the scope of a. ``` var a int var b someInterface b.foo(&a) ``` This change removes the opportunity for that allocation. RELNOTES: n/a PiperOrigin-RevId: 328796559
2020-08-27	ip6tables: (de)serialize ip6tables structs	Kevin Krakauer
	More implementation+testing to follow. #3549. PiperOrigin-RevId: 328770160
2020-08-26	Make flag propagation automatic	Fabricio Voznika
	Use reflection and tags to provide automatic conversion from Config to flags. This makes adding new flags less error-prone, skips flags using default values (easier to read), and makes tests correctly use default flag values for test Configs. Updates #3494 PiperOrigin-RevId: 328662070
2020-08-26	Merge pull request #3742 from lubinszARM:pr_n1_1	gVisor bot
	PiperOrigin-RevId: 328639254
2020-08-26	Support stdlib analyzers with nogo.	Adin Scannell
	This immediately revealed an escape analysis violation (!), where the sync.Map was being used in a context that escapes were not allowed. This is a relatively minor fix and is included. PiperOrigin-RevId: 328611237
2020-08-26	Remove spurious fd.IncRef().	Nicolas Lacasse
	PiperOrigin-RevId: 328583461
2020-08-26	tmpfs: Allow xattrs in the trusted namespace if creds has CAP_SYS_ADMIN.	Nicolas Lacasse
	This is needed to support the overlay opaque attribute. PiperOrigin-RevId: 328552985
2020-08-25	Use new reference count utility throughout gvisor.	Dean Deng
	This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted soon anyway. The following structs now use the new tool, with leak check enabled: devpts:rootInode fuse:inode kernfs:Dentry kernfs:dir kernfs:readonlyDir kernfs:StaticDirectory proc:fdDirInode proc:fdInfoDirInode proc:subtasksInode proc:taskInode proc:tasksInode vfs:FileDescription vfs:MountNamespace vfs:Filesystem sys:dir kernel:FSContext kernel:ProcessGroup kernel:Session shm:Shm mm:aioMappable mm:SpecialMappable transport:queue And the following use the template, but because they currently are not leak checked, a TODO is left instead of enabling leak check in this patch: kernel:FDTable tun:tunEndpoint Updates #1486. PiperOrigin-RevId: 328460377
2020-08-25	Return non-zero size for tmpfs statfs(2).	Jamie Liu
	This does not implement accepting or enforcing any size limit, which will be more complex and has performance implications; it just returns a fixed non-zero size. Updates #1936 PiperOrigin-RevId: 328428588
2020-08-25	Expose basic coverage information to userspace through kcov interface.	Dean Deng
	In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block, which updates the memory-mapped coverage information. The Go coverage tool does not allow us to inject arbitrary instructions into basic blocks, but it does provide data that we can convert to a kcov-like format and transfer them to userspace through a memory mapping. Note that this is not a strict implementation of kcov, which is especially tricky to do because we do not have the same coverage tools available in Go that that are available for the actual Linux kernel. In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block to write program counters to the kcov memory mapping. In Go, however, coverage tools only give us a count of basic blocks as they are executed. Every time we return to userspace, we collect the coverage information and write out PCs for each block that was executed, providing userspace with the illusion that the kcov data is always up to date. For convenience, we also generate a unique synthetic PC for each block instead of using actual PCs. Finally, we do not provide thread-specific coverage data (each kcov instance only contains PCs executed by the thread owning it); instead, we will supply data for any file specified by -- instrumentation_filter. Also, fix issue in nogo that was causing pkg/coverage:coverage_nogo compilation to fail. PiperOrigin-RevId: 328426526
2020-08-25	Only send an ICMP error message if UDP checksum is valid.	Toshi Kikuchi
	Test: - TestV4UnknownDestination - TestV6UnknownDestination PiperOrigin-RevId: 328424137
2020-08-25	[go-marshal] Enable auto-marshalling for host tty.	Ayush Ranjan
	PiperOrigin-RevId: 328415633
2020-08-25	overlay: clonePrivateMount must pass a Dentry reference to MakeVirtualDentry.	Nicolas Lacasse
	PiperOrigin-RevId: 328410065
2020-08-25	Clarify comment on NetworkProtocolNumber.	Bhasker Hariharan
	The actual values used for this field in Netstack are actually EtherType values of the protocol in an Ethernet frame. Eg. header.IPv4ProtocolNumber is 0x0800 and not the number of the IPv4 Protocol Number itself which is 4. Similarly header.IPv6ProtocolNumber is set to 0x86DD whereas the IPv6 protocol number is 41. See: - https://www.iana.org/assignments/ieee-802-numbers/ieee-802-numbers.xhtml (For EtherType) - https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml (For ProtocolNumbers) PiperOrigin-RevId: 328407293
2020-08-25	remove iptables sockopt special cases	Kevin Krakauer
	iptables sockopts were kludged into an unnecessary check, this properly relegates them to the {get,set}SockOptIP functions. PiperOrigin-RevId: 328395135
2020-08-25	Add nogo support to go_binary and go_test targets.	Adin Scannell
	Updates #3374 PiperOrigin-RevId: 328378700
2020-08-25	Change "Fd" member to "FD" according to convension	gVisor bot
	PiperOrigin-RevId: 328374775
2020-08-25	Add option to replace linkAddrCache with neighborCache	Sam Balana
	This change adds an option to replace the current implementation of ARP through linkAddrCache, with an implementation of NUD through neighborCache. Switching to using NUD for both ARP and NDP is beneficial for the reasons described by RFC 4861 Section 3.1: "[Using NUD] significantly improves the robustness of packet delivery in the presence of failing routers, partially failing or partitioned links, or nodes that change their link-layer addresses. For instance, mobile nodes can move off-link without losing any connectivity due to stale ARP caches." "Unlike ARP, Neighbor Unreachability Detection detects half-link failures and avoids sending traffic to neighbors with which two-way connectivity is absent." Along with these changes exposes the API for querying and operating the neighbor cache. Operations include: - Create a static entry - List all entries - Delete all entries - Remove an entry by address This also exposes the API to change the NUD protocol constants on a per-NIC basis to allow Neighbor Discovery to operate over links with widely varying performance characteristics. See [RFC 4861 Section 10][1] for the list of constants. Finally, an API for subscribing to NUD state changes is exposed through NUDDispatcher. See [RFC 4861 Appendix C][3] for the list of edges. Tests: pkg/tcpip/network/arp:arp_test + TestDirectRequest pkg/tcpip/network/ipv6:ipv6_test + TestLinkResolution + TestNDPValidation + TestNeighorAdvertisementWithTargetLinkLayerOption + TestNeighorSolicitationResponse + TestNeighorSolicitationWithSourceLinkLayerOption + TestRouterAdvertValidation pkg/tcpip/stack:stack_test + TestCacheWaker + TestForwardingWithFakeResolver + TestForwardingWithFakeResolverManyPackets + TestForwardingWithFakeResolverManyResolutions + TestForwardingWithFakeResolverPartialTimeout + TestForwardingWithFakeResolverTwoPackets + TestIPv6SourceAddressSelectionScopeAndSameAddress [1]: https://tools.ietf.org/html/rfc4861#section-10 [2]: https://tools.ietf.org/html/rfc4861#appendix-C Fixes #1889 Fixes #1894 Fixes #1895 Fixes #1947 Fixes #1948 Fixes #1949 Fixes #1950 PiperOrigin-RevId: 328365034
2020-08-25	Support SO_LINGER socket option.	Nayana Bidari
	When SO_LINGER option is enabled, the close will not return until all the queued messages are sent and acknowledged for the socket or linger timeout is reached. If the option is not set, close will return immediately. This option is mainly supported for connection oriented protocols such as TCP. PiperOrigin-RevId: 328350576