gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2020-08-04	Update variables for implementation of RACK in TCP	Nayana Bidari
	RACK (Recent Acknowledgement) is a new loss detection algorithm in TCP. These are the fields which should be stored on connections to implement RACK algorithm. PiperOrigin-RevId: 324948703
2020-08-04	Use 1 fragmentation component per IP stack	Ghanan Gowripalan
	This will help manage memory consumption by IP reassembly when receiving IP fragments on multiple network endpoints. Previously, each endpoint would cap memory consumption at 4MB, but with this change, each IP stack will cap memory consumption at 4MB. No behaviour changes. PiperOrigin-RevId: 324913904
2020-08-03	Plumbing context.Context to DecRef() and Release().	Nayana Bidari
	context is passed to DecRef() and Release() which is needed for SO_LINGER implementation. PiperOrigin-RevId: 324672584
2020-07-31	Support fragments from different sources	Ghanan Gowripalan
	Prevent fragments with different source-destination pairs from conflicting with each other. Test: - ipv6_test.TestReceiveIPv6Fragments - ipv4_test.TestReceiveIPv6Fragments PiperOrigin-RevId: 324283246
2020-07-31	iptables: support SO_ORIGINAL_DST	Kevin Krakauer
	Envoy (#170) uses this to get the original destination of redirected packets.
2020-07-30	Fix TCP CurrentConnected counter updates.	Mithun Iyer
	CurrentConnected counter is incorrectly decremented on close of an endpoint which is still not connected. Fixes #3443 PiperOrigin-RevId: 324155171
2020-07-30	Revert change to default buffer size.	Bhasker Hariharan
	In https://github.com/google/gvisor/commit/ca6bded95dbce07f9683904b4b768dfc2d4a09b2 we reduced the default buffer size to 32KB. This mostly works fine except at high throughput where we hit zero window very quickly and the TCP receive buffer moderation is not able to grow the window. This can be seen in the benchmarks where with a 32KB buffer and 100 connections downloading a 10MB file we get about 30 requests/s vs the 1MB buffer gives us about 53 requests/s. A proper fix requires a few changes to when we send a zero window as well as when we decide to send a zero window update. Today we consider available space below 1MSS as zero and send an update when it crosses 1MSS of available space. This is way too low and results in the window staying very small once we hit a zero window condition as we keep sending updates with size barely over 1MSS. Linux and BSD are smarter about this and use different thresholds. We should separately update our logic to match linux or BSD so that we don't send window updates that are really tiny or wait until we drop below 1MSS to advertise a zero window. PiperOrigin-RevId: 324087019
2020-07-30	Enforce fragment block size and validate args	Ghanan Gowripalan
	Allow configuring fragmentation.Fragmentation with a fragment block size which will be enforced when processing fragments. Also validate arguments when processing fragments. Test: - fragmentation.TestErrors - ipv6_test.TestReceiveIPv6Fragments - ipv4_test.TestReceiveIPv6Fragments PiperOrigin-RevId: 324081521
2020-07-30	Implement neighbor unreachability detection for ARP and NDP.	Sam Balana
	This change implements the Neighbor Unreachability Detection (NUD) state machine, as per RFC 4861 [1]. The state machine operates on a single neighbor in the local network. This requires the state machine to be implemented on each entry of the neighbor table. This change also adds, but does not expose, several APIs. The first API is for performing basic operations on the neighbor table: - Create a static entry - List all entries - Delete all entries - Remove an entry by address The second API is used for changing the NUD protocol constants on a per-NIC basis to allow Neighbor Discovery to operate over links with widely varying performance characteristics. See [RFC 4861 Section 10][2] for the list of constants. Finally, the last API is for allowing users to subscribe to NUD state changes. See [RFC 4861 Appendix C][3] for the list of edges. [1]: https://tools.ietf.org/html/rfc4861 [2]: https://tools.ietf.org/html/rfc4861#section-10 [3]: https://tools.ietf.org/html/rfc4861#appendix-C Tests: pkg/tcpip/stack:stack_test - TestNeighborCacheAddStaticEntryThenOverflow - TestNeighborCacheClear - TestNeighborCacheClearThenOverflow - TestNeighborCacheConcurrent - TestNeighborCacheDuplicateStaticEntryWithDifferentLinkAddress - TestNeighborCacheDuplicateStaticEntryWithSameLinkAddress - TestNeighborCacheEntry - TestNeighborCacheEntryNoLinkAddress - TestNeighborCacheGetConfig - TestNeighborCacheKeepFrequentlyUsed - TestNeighborCacheNotifiesWaker - TestNeighborCacheOverflow - TestNeighborCacheOverwriteWithStaticEntryThenOverflow - TestNeighborCacheRemoveEntry - TestNeighborCacheRemoveEntryThenOverflow - TestNeighborCacheRemoveStaticEntry - TestNeighborCacheRemoveStaticEntryThenOverflow - TestNeighborCacheRemoveWaker - TestNeighborCacheReplace - TestNeighborCacheResolutionFailed - TestNeighborCacheResolutionTimeout - TestNeighborCacheSetConfig - TestNeighborCacheStaticResolution - TestEntryAddsAndClearsWakers - TestEntryDelayToProbe - TestEntryDelayToReachableWhenSolicitedOverrideConfirmation - TestEntryDelayToReachableWhenUpperLevelConfirmation - TestEntryDelayToStaleWhenConfirmationWithDifferentAddress - TestEntryDelayToStaleWhenProbeWithDifferentAddress - TestEntryFailedGetsDeleted - TestEntryIncompleteToFailed - TestEntryIncompleteToIncompleteDoesNotChangeUpdatedAt - TestEntryIncompleteToReachable - TestEntryIncompleteToReachableWithRouterFlag - TestEntryIncompleteToStale - TestEntryInitiallyUnknown - TestEntryProbeToFailed - TestEntryProbeToReachableWhenSolicitedConfirmationWithSameAddress - TestEntryProbeToReachableWhenSolicitedOverrideConfirmation - TestEntryProbeToStaleWhenConfirmationWithDifferentAddress - TestEntryProbeToStaleWhenProbeWithDifferentAddress - TestEntryReachableToStaleWhenConfirmationWithDifferentAddress - TestEntryReachableToStaleWhenConfirmationWithDifferentAddressAndOverride - TestEntryReachableToStaleWhenProbeWithDifferentAddress - TestEntryReachableToStaleWhenTimeout - TestEntryStaleToDelay - TestEntryStaleToReachableWhenSolicitedOverrideConfirmation - TestEntryStaleToStaleWhenOverrideConfirmation - TestEntryStaleToStaleWhenProbeUpdateAddress - TestEntryStaysDelayWhenOverrideConfirmationWithSameAddress - TestEntryStaysProbeWhenOverrideConfirmationWithSameAddress - TestEntryStaysReachableWhenConfirmationWithRouterFlag - TestEntryStaysReachableWhenProbeWithSameAddress - TestEntryStaysStaleWhenProbeWithSameAddress - TestEntryUnknownToIncomplete - TestEntryUnknownToStale - TestEntryUnknownToUnknownWhenConfirmationWithUnknownAddress pkg/tcpip/stack:stack_x_test - TestDefaultNUDConfigurations - TestNUDConfigurationFailsForNotSupported - TestNUDConfigurationsBaseReachableTime - TestNUDConfigurationsDelayFirstProbeTime - TestNUDConfigurationsMaxMulticastProbes - TestNUDConfigurationsMaxRandomFactor - TestNUDConfigurationsMaxUnicastProbes - TestNUDConfigurationsMinRandomFactor - TestNUDConfigurationsRetransmitTimer - TestNUDConfigurationsUnreachableTime - TestNUDStateReachableTime - TestNUDStateRecomputeReachableTime - TestSetNUDConfigurationFailsForBadNICID - TestSetNUDConfigurationFailsForNotSupported [1]: https://tools.ietf.org/html/rfc4861 [2]: https://tools.ietf.org/html/rfc4861#section-10 [3]: https://tools.ietf.org/html/rfc4861#appendix-C Updates #1889 Updates #1894 Updates #1895 Updates #1947 Updates #1948 Updates #1949 Updates #1950 PiperOrigin-RevId: 324070795
2020-07-30	Use brodcast MAC for broadcast IPv4 packets	Ghanan Gowripalan
	When sending packets to a known network's broadcast address, use the broadcast MAC address. Test: - stack_test.TestOutgoingSubnetBroadcast - udp_test.TestOutgoingSubnetBroadcast PiperOrigin-RevId: 324062407
2020-07-28	Redirect TODO to GitHub issues	Fabricio Voznika
	PiperOrigin-RevId: 323715260
2020-07-27	Add ability to send unicast ARP requests and Neighbor Solicitations	Sam Balana
	The previous implementation of LinkAddressRequest only supported sending broadcast ARP requests and multicast Neighbor Solicitations. The ability to send these packets as unicast is required for Neighbor Unreachability Detection. Tests: pkg/tcpip/network/arp:arp_test - TestLinkAddressRequest pkg/tcpip/network/ipv6:ipv6_test - TestLinkAddressRequest Updates #1889 Updates #1894 Updates #1895 Updates #1947 Updates #1948 Updates #1949 Updates #1950 PiperOrigin-RevId: 323451569
2020-07-27	Fix memory accounting in TCP pending segment queue.	Bhasker Hariharan
	TCP now tracks the overhead of the segment structure itself in it's out-of-order queue (pending). This is required to ensure that a malicious sender sending 1 byte out-of-order segments cannot queue like 1000's of segments which bloat up memory usage. We also reduce the default receive window to 32KB. With TCP moderation there is no need to keep this window at 1MB which means that for new connections the default out-of-order queue will be small unless the application actually reads the data that is being sent. This prevents a sender from just maliciously filling up pending buf with lots of tiny out-of-order segments. PiperOrigin-RevId: 323450913
2020-07-23	Add AfterFunc to tcpip.Clock	Sam Balana
	Changes the API of tcpip.Clock to also provide a method for scheduling and rescheduling work after a specified duration. This change also implements the AfterFunc method for existing implementations of tcpip.Clock. This is the groundwork required to mock time within tests. All references to CancellableTimer has been replaced with the tcpip.Job interface, allowing for custom implementations of scheduling work. This is a BREAKING CHANGE for clients that implement their own tcpip.Clock or use tcpip.CancellableTimer. Migration plan: 1. Add AfterFunc(d, f) to tcpip.Clock 2. Replace references of tcpip.CancellableTimer with tcpip.Job 3. Replace calls to tcpip.CancellableTimer#StopLocked with tcpip.Job#Cancel 4. Replace calls to tcpip.CancellableTimer#Reset with tcpip.Job#Schedule 5. Replace calls to tcpip.NewCancellableTimer with tcpip.NewJob. PiperOrigin-RevId: 322906897
2020-07-23	iptables: use keyed array literals	Kevin Krakauer
	PiperOrigin-RevId: 322882426
2020-07-23	Merge pull request #3207 from kevinGC:icmp-connect	gVisor bot
	PiperOrigin-RevId: 322853192
2020-07-23	Fix wildcard bind for raw socket.	Bhasker Hariharan
	Fixes #3334 PiperOrigin-RevId: 322846384
2020-07-22	make connect(2) fail when dest is unreachable	Kevin Krakauer
	Previously, ICMP destination unreachable datagrams were ignored by TCP endpoints. This caused connect to hang when an intermediate router couldn't find a route to the host. This manifested as a Kokoro error when Docker IPv6 was enabled. The Ruby image test would try to install the sinatra gem and hang indefinitely attempting to use an IPv6 address. Fixes #3079.
2020-07-22	iptables: don't NAT existing connections	Kevin Krakauer
	Fixes a NAT bug that manifested as: - A SYN was sent from gVisor to another host, unaffected by iptables. - The corresponding SYN/ACK was NATted by a PREROUTING REDIRECT rule despite being part of the existing connection. - The socket that sent the SYN never received the SYN/ACK and thus a connection could not be established. We handle this (as Linux does) by tracking all connections, inserting a no-op conntrack rule for new connections with no rules of their own. Needed for istio support (#170).
2020-07-22	iptables: replace maps with arrays	Kevin Krakauer
	For iptables users, Check() is a hot path called for every packet one or more times. Let's avoid a bunch of map lookups. PiperOrigin-RevId: 322678699
2020-07-22	Support for receiving outbound packets in AF_PACKET.	Bhasker Hariharan
	Updates #173 PiperOrigin-RevId: 322665518
2020-07-16	Add support to return protocol in recvmsg for AF_PACKET.	Bhasker Hariharan
	Updates #173 PiperOrigin-RevId: 321690756
2020-07-16	Add ethernet broadcast address constant	Ghanan Gowripalan
	PiperOrigin-RevId: 321620517
2020-07-15	iptables: remove check for NetworkHeader	Kevin Krakauer
	This is no longer necessary, as we always set NetworkHeader before calling iptables.Check. PiperOrigin-RevId: 321461978
2020-07-15	fdbased: Vectorized write for packet; relax writev syscall filter.	Ting-Yu Wang
	Now it calls pkt.Data.ToView() when writing the packet. This may require copying when the packet is large, which puts the worse case in an even worse situation. This sent out in a separate preparation change as it requires syscall filter changes. This change will be followed by the change for the adoption of the new PacketHeader API. PiperOrigin-RevId: 321447003
2020-07-15	Add support for SO_ERROR to packet sockets.	Bhasker Hariharan
	Packet sockets also seem to allow double binding and do not return an error on linux. This was tested by running the syscall test in a linux namespace as root and the current test DoubleBind fails@HEAD. Passes after this change. Updates #173 PiperOrigin-RevId: 321445137
2020-07-15	Fix minor bugs in a couple of interface IOCTLs.	Bhasker Hariharan
	gVisor incorrectly returns the wrong ARP type for SIOGIFHWADDR. This breaks tcpdump as it tries to interpret the packets incorrectly. Similarly, SIOCETHTOOL is used by tcpdump to query interface properties which fails with an EINVAL since we don't implement it. For now change it to return EOPNOTSUPP to indicate that we don't support the query rather than return EINVAL. NOTE: ARPHRD types for link endpoints are distinct from NIC capabilities and NIC flags. In Linux all 3 exist eg. ARPHRD types are stored in dev->type field while NIC capabilities are more like the device features which can be queried using SIOCETHTOOL but not modified and NIC Flags are fields that can be modified from user space. eg. NIC status (UP/DOWN/MULTICAST/BROADCAST) etc. Updates #2746 PiperOrigin-RevId: 321436525
2020-07-13	Merge pull request #2672 from amscanne:shim-integrated	gVisor bot
	PiperOrigin-RevId: 321053634
2020-07-13	Fix recvMMsgDispatcher not slicing link header correctly.	Ting-Yu Wang
	PiperOrigin-RevId: 321035635
2020-07-13	garbage collect connections	Kevin Krakauer
	As in Linux, we must periodically clean up unused connections. PiperOrigin-RevId: 321003353
2020-07-12	Do not copy sleep.Waker	Ghanan Gowripalan
	sleep.Waker's fields are modified as values. PiperOrigin-RevId: 320873451
2020-07-11	Stub out SO_DETACH_FILTER.	Bhasker Hariharan
	Updates #2746 PiperOrigin-RevId: 320757963
2020-07-09	Discard multicast UDP source address.	gVisor bot
	RFC-1122 (and others) specify that UDP should not receive datagrams that have a source address that is a multicast address. Packets should never be received FROM a multicast address. See also, RFC 768: 'User Datagram Protocol' J. Postel, ISI, 28 August 1980 A UDP datagram received with an invalid IP source address (e.g., a broadcast or multicast address) must be discarded by UDP or by the IP layer (see rfc 1122 Section 3.2.1.3). This CL does not address TCP or broadcast which is more complicated. Also adds a test for both ipv6 and ipv4 UDP. Fixes #3154 PiperOrigin-RevId: 320547674
2020-07-09	Add support for IP_HDRINCL IP option for raw sockets.	Bhasker Hariharan
	Updates #2746 Fixes #3158 PiperOrigin-RevId: 320497190
2020-07-08	Avoid accidental zero-checksum	Tamir Duberstein
	PiperOrigin-RevId: 320250773
2020-07-07	Set IPv4 ID on all non-atomic datagrams	Tony Gong
	RFC 6864 imposes various restrictions on the uniqueness of the IPv4 Identification field for non-atomic datagrams, defined as an IP datagram that either can be fragmented (DF=0) or is already a fragment (MF=1 or positive fragment offset). In order to be compliant, the ID field is assigned for all non-atomic datagrams. Add a TCP unit test that induces retransmissions and checks that the IPv4 ID field is unique every time. Add basic handling of the IP_MTU_DISCOVER socket option so that the option can be used to disable PMTU discovery, effectively setting DF=0. Attempting to set the sockopt to anything other than disabled will fail because PMTU discovery is currently not implemented, and the default behavior matches that of disabled. PiperOrigin-RevId: 320081842
2020-07-07	icmp: When setting TransportHeader, remove from the Data portion.	Ting-Yu Wang
	The current convention is when a header is set to pkt.XxxHeader field, it gets removed from pkt.Data. ICMP does not currently follow this convention. PiperOrigin-RevId: 320078606
2020-07-06	Add support for SO_RCVBUF/SO_SNDBUF for AF_PACKET sockets.	Bhasker Hariharan
	Updates #2746 PiperOrigin-RevId: 319887810
2020-07-06	Fix NonBlockingWrite3 not writing b3 if b2 is zero-length.	Ting-Yu Wang
	PiperOrigin-RevId: 319882171
2020-07-06	Shard some slow tests.	Ting-Yu Wang
	stack_x_test: 2m -> 20s tcp_x_test: 80s -> 25s PiperOrigin-RevId: 319828101
2020-07-06	Remove dependency on pkg/binary	Tamir Duberstein
	PiperOrigin-RevId: 319770124
2020-07-05	Add wakers synchronously	Tamir Duberstein
	Avoid a race where an arbitrary goroutine scheduling delay can cause the processor to miss events and hang indefinitely. Reduce allocations by storing processors by-value in the dispatcher, and by using a single WaitGroup rather than one per processor. PiperOrigin-RevId: 319665861
2020-07-01	TCP receive should block when in SYN-SENT state.	Mithun Iyer
	The application can choose to initiate a non-blocking connect and later block on a read, when the endpoint is still in SYN-SENT state. PiperOrigin-RevId: 319311016
2020-06-30	Fix two bugs in TCP sender.	Bhasker Hariharan
	a) When GSO is in use we should not cap the segment to maxPayloadSize in sender.maybeSendSegment as the GSO logic will cap the segment to the correct size. Without this the host GSO is not used as we end up breaking up large segments into small MSS sized segments before writing the packets to the host. b) The check to not split a segment due to it not fitting in the receiver window when there are pending segments is incorrect as segments in writeList can be really large as we just take the write call's buffer size and create a single large segment. So a write of say 128KB will just be 1 segment in the writeList. The linux code checks if 1 MSS sized segments fits in the receiver's window and if not then does not split the current segment. gVisor's check was incorrect that it was checking if the whole segment which could be >>> 1 MSS would fit in the receiver's window. This was causing us to prematurely stop sending and falling back to retransmit timer/probe from the other end to send data. This was seen when running HTTPD benchmarks where @ HEAD when sending large files the benchmark was taking forever to run. The tcp_splitseg_mss_test.go is being deleted as the test as written doesn't test what is intended correctly. This is because GSO is enabled by default and the reason the MSS+1 sized segment is sent is because GSO is in use. A proper test will require disabling GSO on linux and netstack which is going to take a bit of work in packetimpact to do it correctly. Separately a new test probably should be written that verifies that a segment > availableWindow is not split if the availableWindow is < 1 MSS. Fixes #3107 PiperOrigin-RevId: 319172089
2020-06-30	Avoid multiple atomic loads	Tamir Duberstein
	...by calling (tcp.endpoint).EndpointState only once when possible. Avoid wrapping (sleep.Waker).Assert in a useless func while I'm here. PiperOrigin-RevId: 319074149
2020-06-26	IPv6 raw sockets. Needed for ip6tables.	Kevin Krakauer
	IPv6 raw sockets never include the IPv6 header. PiperOrigin-RevId: 318582989
2020-06-26	Implement SO_NO_CHECK socket option.	gVisor bot
	SO_NO_CHECK is used to skip the UDP checksum generation on a TX socket (UDP checksum is optional on IPv4). Test: - TestNoChecksum - SoNoCheckOffByDefault (UdpSocketTest) - SoNoCheck (UdpSocketTest) Fixes #3055 PiperOrigin-RevId: 318575215
2020-06-25	conntrack refactor, no behavior changes	Kevin Krakauer
	- Split connTrackForPacket into 2 functions instead of switching on flag - Replace hash with struct keys. - Remove prefixes where possible - Remove unused connStatus, timeout - Flatten ConnTrack struct a bit - some intermediate structs had no meaning outside of the context of their parent. - Protect conn.tcb with a mutex - Remove redundant error checking (e.g. when is pkt.NetworkHeader valid) - Clarify that HandlePacket and CreateConnFor are the expected entrypoints for ConnTrack PiperOrigin-RevId: 318407168
2020-06-24	Add support for Stack level options.	Bhasker Hariharan
	Linux controls socket send/receive buffers using a few sysctl variables - net.core.rmem_default - net.core.rmem_max - net.core.wmem_max - net.core.wmem_default - net.ipv4.tcp_rmem - net.ipv4.tcp_wmem The first 4 control the default socket buffer sizes for all sockets raw/packet/tcp/udp and also the maximum permitted socket buffer that can be specified in setsockopt(SOL_SOCKET, SO_(RCV\|SND)BUF,...). The last two control the TCP auto-tuning limits and override the default specified in rmem_default/wmem_default as well as the max limits. Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it to limit the maximum size in setsockopt() as well as uses it for raw/udp sockets. This changelist introduces the other 4 and updates the udp/raw sockets to use the newly introduced variables. The values for min/max match the current tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is updated to match the linux value of 212KiB up from the really low current value of 32 KiB. Updates #3043 Fixes #3043 PiperOrigin-RevId: 318089805
2020-06-23	Add support for SO_REUSEADDR to TCP sockets/endpoints.	Ian Gudger
	For TCP sockets, SO_REUSEADDR relaxes the rules for binding addresses. gVisor/netstack already supported a behavior similar to SO_REUSEADDR, but did not allow disabling it. This change brings the SO_REUSEADDR behavior closer to the behavior implemented by Linux and adds a new SO_REUSEADDR disabled behavior. Like Linux, SO_REUSEADDR is now disabled by default. PiperOrigin-RevId: 317984380