gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2020-09-29	iptables: remove unused min/max NAT range fields	Kevin Krakauer
	PiperOrigin-RevId: 334531794
2020-09-29	Return permanent addresses when NIC is down	Ghanan Gowripalan
	Test: stack_test.TestGetMainNICAddressWhenNICDisabled PiperOrigin-RevId: 334513286
2020-09-29	Don't allow broadcast/multicast source address	Ghanan Gowripalan
	As per relevant IP RFCS (see code comments), broadcast (for IPv4) and multicast addresses are not allowed. Currently checks for these are done at the transport layer, but since it is explicitly forbidden at the IP layers, check for them there. This change also removes the UDP.InvalidSourceAddress stat since there is no longer a need for it. Test: ip_test.TestSourceAddressValidation PiperOrigin-RevId: 334490971
2020-09-29	iptables: refactor to make targets extendable	Kevin Krakauer
	Like matchers, targets should use a module-like register/lookup system. This replaces the brittle switch statements we had before. The only behavior change is supporing IPT_GET_REVISION_TARGET. This makes it much easier to add IPv6 redirect in the next change. Updates #3549. PiperOrigin-RevId: 334469418
2020-09-29	Don't generate link-local IPv6 for loopback	Ghanan Gowripalan
	Linux doesn't generate a link-local address for the loopback interface. Test: integration_test.TestInitialLoopbackAddresses PiperOrigin-RevId: 334453182
2020-09-29	Discard IP fragments as soon as it expires	Toshi Kikuchi
	Currently expired IP fragments are discarded only if another fragment for the same IP datagram is received after timeout or the total size of the fragment queue exceeded a predefined value. Test: fragmentation.TestReassemblingTimeout Fixes #3960 PiperOrigin-RevId: 334423710
2020-09-29	Trim Network/Transport Endpoint/Protocol	Ghanan Gowripalan
	* Remove Capabilities and NICID methods from NetworkEndpoint. * Remove linkEP and stack parameters from NetworkProtocol.NewEndpoint. The LinkEndpoint can be fetched from the NetworkInterface. The stack is passed to the NetworkProtocol when it is created so the NetworkEndpoint can get it from its protocol. * Remove stack parameter from TransportProtocol.NewEndpoint. Like the NetworkProtocol/Endpoint, the stack is passed to the TransportProtocol when it is created. PiperOrigin-RevId: 334332721
2020-09-29	Move IP state from NIC to NetworkEndpoint/Protocol	Ghanan Gowripalan
	* Add network address to network endpoints. Hold network-specific state in the NetworkEndpoint instead of the stack. This results in the stack no longer needing to "know" about the network endpoints and special case certain work for various endpoints (e.g. IPv6 DAD). * Provide NetworkEndpoints with an NetworkInterface interface. Instead of just passing the NIC ID of a NIC, pass an interface so the network endpoint may query other information about the NIC such as whether or not it is a loopback device. * Move NDP code and state to the IPv6 package. NDP is IPv6 specific so there is no need for it to live in the stack. * Control forwarding through NetworkProtocols instead of Stack Forwarding should be controlled on a per-network protocol basis so forwarding configurations are now controlled through network protocols. * Remove stack.referencedNetworkEndpoint. Now that addresses are exposed via AddressEndpoint and only one NetworkEndpoint is created per interface, there is no need for a referenced NetworkEndpoint. * Assume network teardown methods are infallible. Fixes #3871, #3916 PiperOrigin-RevId: 334319433
2020-09-28	Fix 1 zero window advertisement bug and a TCP test flake.	Bhasker Hariharan
	In TestReceiveBufferAutoTuning we now send a keep-alive packet to measure the current window rather than a 1 byte segment as the returned window value in the latter case is reduced due to the 1 byte segment now being held in the receive buffer and can cause the test to flake if the segment overheads were to change. In getSendParams in rcv.go we were advertising a non-zero window even if available window space was zero after we received the previous segment. In such a case newWnd and curWnd will be the same and we end up advertising a tiny but non-zero window and this can cause the next segment to be dropped. PiperOrigin-RevId: 334314070
2020-09-28	Fix lingering of TCP socket in the initial state.	Nayana Bidari
	When the socket is set with SO_LINGER and close()'d in the initial state, it should not linger and return immediately. PiperOrigin-RevId: 334263149
2020-09-28	Support creating protocol instances with Stack ref	Ghanan Gowripalan
	Network or transport protocols may want to reach the stack. Support this by letting the stack create the protocol instances so it can pass a reference to itself at protocol creation time. Note, protocols do not yet use the stack in this CL but later CLs will make use of the stack from protocols. PiperOrigin-RevId: 334260210
2020-09-26	Remove generic ICMP errors	Ghanan Gowripalan
	Generic ICMP errors were required because the transport dispatcher was given the responsibility of sending ICMP errors in response to transport packet delivery failures. Instead, the transport dispatcher should let network layer know it failed to deliver a packet (and why) and let the network layer make the decision as to what error to send (if any). Fixes #4068 PiperOrigin-RevId: 333962333
2020-09-24	Remove useless endpoint construction	Tamir Duberstein
	PiperOrigin-RevId: 333591566
2020-09-24	Change segment/pending queue to use receive buffer limits.	Bhasker Hariharan
	segment_queue today has its own standalone limit of MaxUnprocessedSegments but this can be a problem in UnlockUser() we do not release the lock till there are segments to be processed. What can happen is as handleSegments dequeues packets more keep getting queued and we will never release the lock. This can keep happening even if the receive buffer is full because nothing can read() till we release the lock. Further having a separate limit for pending segments makes it harder to track memory usage etc. Unifying the limits makes it easier to reason about memory in use and makes the overall buffer behaviour more consistent. PiperOrigin-RevId: 333508122
2020-09-23	Remove unused field from neighborEntry	Ghanan Gowripalan
	PiperOrigin-RevId: 333405169
2020-09-23	Extract ICMP error sender from UDP	Julian Elischer
	Store transport protocol number on packet buffers for use in ICMP error generation. Updates #2211. PiperOrigin-RevId: 333252762
2020-09-22	Refactor testutil.TestEndpoint and use it instead of limitedEP	Arthur Sfez
	The new testutil.MockLinkEndpoint implementation is not composed by channel.Channel anymore because none of its features were used. PiperOrigin-RevId: 333167753
2020-09-22	Move stack.fakeClock into a separate package	Toshi Kikuchi
	PiperOrigin-RevId: 333138701
2020-09-20	Merge pull request #3651 from ianlewis:ip-forwarding	gVisor bot
	PiperOrigin-RevId: 332760843
2020-09-18	Count packets dropped by iptables in IPStats	Kevin Krakauer
	PiperOrigin-RevId: 332486383
2020-09-18	Enqueue TCP sends arriving in SYN_SENT state.	Mithun Iyer
	TCP needs to enqueue any send requests arriving when the connection is in SYN_SENT state. The data should be sent out soon after completion of the connection handshake. Fixes #3995 PiperOrigin-RevId: 332482041
2020-09-18	Use common parsing utilities when sniffing	Ghanan Gowripalan
	Extract parsing utilities so they can be used by the sniffer. Fixes #3930 PiperOrigin-RevId: 332401880
2020-09-17	Test IPv4 WritePackets stats	Kevin Krakauer
	IPv6 tests will be added in another CL along with ip6tables. PiperOrigin-RevId: 332389102
2020-09-17	ip6tables: filter table support	Kevin Krakauer
	`ip6tables -t filter` is now usable. NAT support will come in a future CL. #3549 PiperOrigin-RevId: 332381801
2020-09-17	{Set,Get} SO_LINGER on all endpoints.	Nayana Bidari
	SO_LINGER is a socket level option and should be stored on all endpoints even though it is used to linger only for TCP endpoints. PiperOrigin-RevId: 332369252
2020-09-16	Automated rollback of changelist 329526153	Nayana Bidari
	PiperOrigin-RevId: 332097286
2020-09-16	Bind loopback subnets' lifetime to perm address	Ghanan Gowripalan
	The lifetime of addreses in a loopback interface's associated subnets should be bound to their respective permanent addresses. This change also fixes a race when the stack attempts to get an IPv4 rereferencedNetworkEndpoint for an address in an associated subnet on a loopback interface. Before this change, the stack would only check if an IPv4 address is contained in an associated subnet while holding a read lock but wouldn't do this same check after releasing the read lock for a write lock to create a temporary address. This may cause the stack to bind the lifetime of the address to a new (temporary) endpoint instead of the associated subnet's permanent address. Test: integration_test.TestLoopbackSubnetLifetimeBoundToAddr PiperOrigin-RevId: 332094719
2020-09-16	Gracefully translate unknown errno.	Ting-Yu Wang
	Neither POSIX.1 nor Linux defines an upperbound for errno. PiperOrigin-RevId: 332085017
2020-09-16	Receive broadcast packets on interested endpoints	Ghanan Gowripalan
	When a broadcast packet is received by the stack, the packet should be delivered to each endpoint that may be interested in the packet. This includes all any address and specified broadcast address listeners. Test: integration_test.TestReuseAddrAndBroadcast PiperOrigin-RevId: 332060652
2020-09-15	Move reusable IPv4 test code into a testutil module and refactor it	Arthur Sfez
	The refactor aims to simplify the package, by replacing the Go channel with a PacketBuffer slice. This code will be reused by tests for IPv6 fragmentation. PiperOrigin-RevId: 331860411
2020-09-15	Don't conclude broadcast from route destination	Ghanan Gowripalan
	The routing table (in its current) form should not be used to make decisions about whether a remote address is a broadcast address or not (for IPv4). Note, a destination subnet does not always map to a network. E.g. RouterA may have a route to 192.168.0.0/22 through RouterB, but RouterB may be configured with 4x /24 subnets on 4 different interfaces. See https://github.com/google/gvisor/issues/3938. PiperOrigin-RevId: 331819868
2020-09-14	Store multicast memberships in a set	Tamir Duberstein
	This is simpler and more performant. PiperOrigin-RevId: 331639978
2020-09-14	Test RST handling in TIME_WAIT.	Mithun Iyer
	gVisor stack ignores RSTs when in TIME_WAIT which is not the default Linux behavior. Add a packetimpact test to test the same. Also update code comments to reflect the rationale for the current gVisor behavior. PiperOrigin-RevId: 331629879
2020-09-12	Cap reassembled IPv6 packets at 65535 octets	Toshi Kikuchi
	IPv4 can accept 65536-octet reassembled packets. Test: - ipv4_test.TestInvalidFragments - ipv4_test.TestReceiveFragments - ipv6.TestInvalidIPv6Fragments - ipv6.TestReceiveIPv6Fragments Fixes #3770 PiperOrigin-RevId: 331382977
2020-09-08	Increase resolution timeout for TestCacheResolution	Sam Balana
	Fixes pkg/tcpip/stack:stack_test flake experienced while running TestCacheResolution with gotsan. This occurs when the test-runner takes longer than the resolution timeout to call linkAddrCache.get. In this test we don't care about the resolution timeout, so set it to the maximum and rely on test-runner timeouts to avoid deadlocks. PiperOrigin-RevId: 330566250
2020-09-08	Fix data race in tcp.GetSockOpt.	Bhasker Hariharan
	e.ID can't be read without holding e.mu. GetSockOpt was reading e.ID when looking up OriginalDst without holding e.mu. PiperOrigin-RevId: 330562293
2020-09-08	Improve type safety for transport protocol options	Ghanan Gowripalan
	The existing implementation for TransportProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for transport protocol options that may be set or queried which transport protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. RELNOTES: n/a PiperOrigin-RevId: 330559811
2020-09-03	Use fine-grained mutex for stack.cleanupEndpoints.	Bhasker Hariharan
	stack.cleanupEndpoints is protected by the stack.mu but that can cause contention as the stack mutex is already acquired in a lot of hot paths during new endpoint creation /cleanup etc. Moving this to a fine grained mutex should reduce contention on the stack.mu. PiperOrigin-RevId: 330026151
2020-09-03	Use atomic.Value for Stack.tcpProbeFunc.	Jamie Liu
	b/166980357#comment56 shows: - 837 goroutines blocked in: gvisor/pkg/sync/sync.(RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(Stack).StartTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).protocolMainLoop - 695 goroutines blocked in: gvisor/pkg/sync/sync.(RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(Stack).CompleteTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(endpoint).protocolMainLoop - 3882 goroutines blocked in: gvisor/pkg/sync/sync.(RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(Stack).GetTCPProbe gvisor/pkg/tcpip/transport/tcp/tcp.newEndpoint gvisor/pkg/tcpip/transport/tcp/tcp.(protocol).NewEndpoint gvisor/pkg/tcpip/stack/stack.(Stack).NewEndpoint All of these are contending on Stack.mu. Stack.StartTransportEndpointCleanup() and Stack.CompleteTransportEndpointCleanup() insert/delete TransportEndpoints in a map (Stack.cleanupEndpoints), and the former also does endpoint unregistration while holding Stack.mu, so it's not immediately clear how feasible it is to replace the map with a mutex-less implementation or how much doing so would help. However, Stack.GetTCPProbe() just reads a function object (Stack.tcpProbeFunc) that is almost always nil (as far as I can tell, Stack.AddTCPProbe() is only called in tests), and it's called for every new TCP endpoint. So converting it to an atomic.Value should significantly reduce contention on Stack.mu, improving TCP endpoint creation latency and allowing TCP endpoint cleanup to proceed. PiperOrigin-RevId: 330004140
2020-09-02	Fix Accept to not return error for sockets in accept queue.	Bhasker Hariharan
	Accept on gVisor will return an error if a socket in the accept queue was closed before Accept() was called. Linux will return the new fd even if the returned socket is already closed by the peer say due to a RST being sent by the peer. This seems to be intentional in linux more details on the github issue. Fixes #3780 PiperOrigin-RevId: 329828404
2020-09-01	Fix handling of unacceptable ACKs during close.	Mithun Iyer
	On receiving an ACK with unacceptable ACK number, in a closing state, TCP, needs to reply back with an ACK with correct seq and ack numbers and remain in same state. This change is as per RFC793 page 37, but with a difference that it does not apply to ESTABLISHED state, just as in Linux. Also add more tests to check for OTW sequence number and unacceptable ack numbers in these states. Fixes #3785 PiperOrigin-RevId: 329616283
2020-09-01	Automated rollback of changelist 328350576	Nayana Bidari
	PiperOrigin-RevId: 329526153
2020-08-28	Don't bind loopback to all IPs in an IPv6 subnet	Ghanan Gowripalan
	An earlier change considered the loopback bound to all addresses in an assigned subnet. This should have only be done for IPv4 to maintain compatability with Linux: ``` $ ip addr show dev lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3062ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes ^C --- 2001:db8::2 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2030ms $ sudo ip addr add 2001:db8::1/64 dev lo $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes 64 bytes from 2001:db8::1: icmp_seq=1 ttl=64 time=0.055 ms 64 bytes from 2001:db8::1: icmp_seq=2 ttl=64 time=0.074 ms 64 bytes from 2001:db8::1: icmp_seq=3 ttl=64 time=0.073 ms 64 bytes from 2001:db8::1: icmp_seq=4 ttl=64 time=0.071 ms ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3075ms rtt min/avg/max/mdev = 0.055/0.068/0.074/0.007 ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes From 2001:db8::1 icmp_seq=1 Destination unreachable: No route From 2001:db8::1 icmp_seq=2 Destination unreachable: No route From 2001:db8::1 icmp_seq=3 Destination unreachable: No route From 2001:db8::1 icmp_seq=4 Destination unreachable: No route ^C --- 2001:db8::2 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3070ms ``` Test: integration_test.TestLoopbackAcceptAllInSubnet PiperOrigin-RevId: 329011566
2020-08-28	Improve type safety for network protocol options	Ghanan Gowripalan
	The existing implementation for NetworkProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for network protocol options that may be set or queried which network protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. PiperOrigin-RevId: 328980359
2020-08-28	fix panic when calling SO_ORIGINAL_DST without initializing iptables	Kevin Krakauer
	Reported-by: syzbot+074ec22c42305725b79f@syzkaller.appspotmail.com PiperOrigin-RevId: 328963899
2020-08-28	Use a single NetworkEndpoint per address	Ghanan Gowripalan
	This change was already done as of https://github.com/google/gvisor/commit/1736b2208f but https://github.com/google/gvisor/commit/a174aa7597 conflicted with that change and it was missed in reviews. This change fixes the conflict. PiperOrigin-RevId: 328920372
2020-08-27	Improve type safety for socket options	Ghanan Gowripalan
	The existing implementation for {G,S}etSockOpt take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for socket options that may be set or queried which socket option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. Fixes #3714. RELNOTES: n/a PiperOrigin-RevId: 328832161
2020-08-27	Add function to get error from a tcpip.Endpoint	Ghanan Gowripalan
	In an upcoming CL, socket option types are made to implement a marker interface with pointer receivers. Since this results in calling methods of an interface with a pointer, we incur an allocation when attempting to get an Endpoint's last error with the current implementation. When calling the method of an interface, the compiler is unable to determine what the interface implementation does with the pointer (since calling a method on an interface uses virtual dispatch at runtime so the compiler does not know what the interface method will do) so it allocates on the heap to be safe incase an implementation continues to hold the pointer after the functioon returns (the reference escapes the scope of the object). In the example below, the compiler does not know what b.foo does with the reference to a it allocates a on the heap as the reference to a may escape the scope of a. ``` var a int var b someInterface b.foo(&a) ``` This change removes the opportunity for that allocation. RELNOTES: n/a PiperOrigin-RevId: 328796559
2020-08-27	ip6tables: (de)serialize ip6tables structs	Kevin Krakauer
	More implementation+testing to follow. #3549. PiperOrigin-RevId: 328770160
2020-08-25	Use new reference count utility throughout gvisor.	Dean Deng
	This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted soon anyway. The following structs now use the new tool, with leak check enabled: devpts:rootInode fuse:inode kernfs:Dentry kernfs:dir kernfs:readonlyDir kernfs:StaticDirectory proc:fdDirInode proc:fdInfoDirInode proc:subtasksInode proc:taskInode proc:tasksInode vfs:FileDescription vfs:MountNamespace vfs:Filesystem sys:dir kernel:FSContext kernel:ProcessGroup kernel:Session shm:Shm mm:aioMappable mm:SpecialMappable transport:queue And the following use the template, but because they currently are not leak checked, a TODO is left instead of enabling leak check in this patch: kernel:FDTable tun:tunEndpoint Updates #1486. PiperOrigin-RevId: 328460377