summaryrefslogtreecommitdiffhomepage
path: root/pkg/tcpip/transport
AgeCommit message (Collapse)Author
2020-09-24Change segment/pending queue to use receive buffer limits.Bhasker Hariharan
segment_queue today has its own standalone limit of MaxUnprocessedSegments but this can be a problem in UnlockUser() we do not release the lock till there are segments to be processed. What can happen is as handleSegments dequeues packets more keep getting queued and we will never release the lock. This can keep happening even if the receive buffer is full because nothing can read() till we release the lock. Further having a separate limit for pending segments makes it harder to track memory usage etc. Unifying the limits makes it easier to reason about memory in use and makes the overall buffer behaviour more consistent. PiperOrigin-RevId: 333508122
2020-09-23Extract ICMP error sender from UDPJulian Elischer
Store transport protocol number on packet buffers for use in ICMP error generation. Updates #2211. PiperOrigin-RevId: 333252762
2020-09-18Enqueue TCP sends arriving in SYN_SENT state.Mithun Iyer
TCP needs to enqueue any send requests arriving when the connection is in SYN_SENT state. The data should be sent out soon after completion of the connection handshake. Fixes #3995 PiperOrigin-RevId: 332482041
2020-09-18Use common parsing utilities when sniffingGhanan Gowripalan
Extract parsing utilities so they can be used by the sniffer. Fixes #3930 PiperOrigin-RevId: 332401880
2020-09-17{Set,Get} SO_LINGER on all endpoints.Nayana Bidari
SO_LINGER is a socket level option and should be stored on all endpoints even though it is used to linger only for TCP endpoints. PiperOrigin-RevId: 332369252
2020-09-16Automated rollback of changelist 329526153Nayana Bidari
PiperOrigin-RevId: 332097286
2020-09-16Receive broadcast packets on interested endpointsGhanan Gowripalan
When a broadcast packet is received by the stack, the packet should be delivered to each endpoint that may be interested in the packet. This includes all any address and specified broadcast address listeners. Test: integration_test.TestReuseAddrAndBroadcast PiperOrigin-RevId: 332060652
2020-09-15Don't conclude broadcast from route destinationGhanan Gowripalan
The routing table (in its current) form should not be used to make decisions about whether a remote address is a broadcast address or not (for IPv4). Note, a destination subnet does not always map to a network. E.g. RouterA may have a route to 192.168.0.0/22 through RouterB, but RouterB may be configured with 4x /24 subnets on 4 different interfaces. See https://github.com/google/gvisor/issues/3938. PiperOrigin-RevId: 331819868
2020-09-14Store multicast memberships in a setTamir Duberstein
This is simpler and more performant. PiperOrigin-RevId: 331639978
2020-09-14Test RST handling in TIME_WAIT.Mithun Iyer
gVisor stack ignores RSTs when in TIME_WAIT which is not the default Linux behavior. Add a packetimpact test to test the same. Also update code comments to reflect the rationale for the current gVisor behavior. PiperOrigin-RevId: 331629879
2020-09-08Fix data race in tcp.GetSockOpt.Bhasker Hariharan
e.ID can't be read without holding e.mu. GetSockOpt was reading e.ID when looking up OriginalDst without holding e.mu. PiperOrigin-RevId: 330562293
2020-09-08Improve type safety for transport protocol optionsGhanan Gowripalan
The existing implementation for TransportProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for transport protocol options that may be set or queried which transport protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. RELNOTES: n/a PiperOrigin-RevId: 330559811
2020-09-02Fix Accept to not return error for sockets in accept queue.Bhasker Hariharan
Accept on gVisor will return an error if a socket in the accept queue was closed before Accept() was called. Linux will return the new fd even if the returned socket is already closed by the peer say due to a RST being sent by the peer. This seems to be intentional in linux more details on the github issue. Fixes #3780 PiperOrigin-RevId: 329828404
2020-09-01Fix handling of unacceptable ACKs during close.Mithun Iyer
On receiving an ACK with unacceptable ACK number, in a closing state, TCP, needs to reply back with an ACK with correct seq and ack numbers and remain in same state. This change is as per RFC793 page 37, but with a difference that it does not apply to ESTABLISHED state, just as in Linux. Also add more tests to check for OTW sequence number and unacceptable ack numbers in these states. Fixes #3785 PiperOrigin-RevId: 329616283
2020-09-01Automated rollback of changelist 328350576Nayana Bidari
PiperOrigin-RevId: 329526153
2020-08-27Improve type safety for socket optionsGhanan Gowripalan
The existing implementation for {G,S}etSockOpt take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for socket options that may be set or queried which socket option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. Fixes #3714. RELNOTES: n/a PiperOrigin-RevId: 328832161
2020-08-27Add function to get error from a tcpip.EndpointGhanan Gowripalan
In an upcoming CL, socket option types are made to implement a marker interface with pointer receivers. Since this results in calling methods of an interface with a pointer, we incur an allocation when attempting to get an Endpoint's last error with the current implementation. When calling the method of an interface, the compiler is unable to determine what the interface implementation does with the pointer (since calling a method on an interface uses virtual dispatch at runtime so the compiler does not know what the interface method will do) so it allocates on the heap to be safe incase an implementation continues to hold the pointer after the functioon returns (the reference escapes the scope of the object). In the example below, the compiler does not know what b.foo does with the reference to a it allocates a on the heap as the reference to a may escape the scope of a. ``` var a int var b someInterface b.foo(&a) ``` This change removes the opportunity for that allocation. RELNOTES: n/a PiperOrigin-RevId: 328796559
2020-08-25Only send an ICMP error message if UDP checksum is valid.Toshi Kikuchi
Test: - TestV4UnknownDestination - TestV6UnknownDestination PiperOrigin-RevId: 328424137
2020-08-25Add option to replace linkAddrCache with neighborCacheSam Balana
This change adds an option to replace the current implementation of ARP through linkAddrCache, with an implementation of NUD through neighborCache. Switching to using NUD for both ARP and NDP is beneficial for the reasons described by RFC 4861 Section 3.1: "[Using NUD] significantly improves the robustness of packet delivery in the presence of failing routers, partially failing or partitioned links, or nodes that change their link-layer addresses. For instance, mobile nodes can move off-link without losing any connectivity due to stale ARP caches." "Unlike ARP, Neighbor Unreachability Detection detects half-link failures and avoids sending traffic to neighbors with which two-way connectivity is absent." Along with these changes exposes the API for querying and operating the neighbor cache. Operations include: - Create a static entry - List all entries - Delete all entries - Remove an entry by address This also exposes the API to change the NUD protocol constants on a per-NIC basis to allow Neighbor Discovery to operate over links with widely varying performance characteristics. See [RFC 4861 Section 10][1] for the list of constants. Finally, an API for subscribing to NUD state changes is exposed through NUDDispatcher. See [RFC 4861 Appendix C][3] for the list of edges. Tests: pkg/tcpip/network/arp:arp_test + TestDirectRequest pkg/tcpip/network/ipv6:ipv6_test + TestLinkResolution + TestNDPValidation + TestNeighorAdvertisementWithTargetLinkLayerOption + TestNeighorSolicitationResponse + TestNeighorSolicitationWithSourceLinkLayerOption + TestRouterAdvertValidation pkg/tcpip/stack:stack_test + TestCacheWaker + TestForwardingWithFakeResolver + TestForwardingWithFakeResolverManyPackets + TestForwardingWithFakeResolverManyResolutions + TestForwardingWithFakeResolverPartialTimeout + TestForwardingWithFakeResolverTwoPackets + TestIPv6SourceAddressSelectionScopeAndSameAddress [1]: https://tools.ietf.org/html/rfc4861#section-10 [2]: https://tools.ietf.org/html/rfc4861#appendix-C Fixes #1889 Fixes #1894 Fixes #1895 Fixes #1947 Fixes #1948 Fixes #1949 Fixes #1950 PiperOrigin-RevId: 328365034
2020-08-25Support SO_LINGER socket option.Nayana Bidari
When SO_LINGER option is enabled, the close will not return until all the queued messages are sent and acknowledged for the socket or linger timeout is reached. If the option is not set, close will return immediately. This option is mainly supported for connection oriented protocols such as TCP. PiperOrigin-RevId: 328350576
2020-08-25Fix TCP_LINGER2 behavior to match linux.Bhasker Hariharan
We still deviate a bit from linux in how long we will actually wait in FIN-WAIT-2. Linux seems to cap it with TIME_WAIT_LEN and it's not completely obvious as to why it's done that way. For now I think we can ignore that and fix it if it really is an issue. PiperOrigin-RevId: 328324922
2020-08-24Automated rollback of changelist 327325153Ghanan Gowripalan
PiperOrigin-RevId: 328259353
2020-08-20Skip listening TCP ports when trying to bind a free port.Bhasker Hariharan
PiperOrigin-RevId: 327686558
2020-08-18RACK: Create a new list for segments.Nayana Bidari
RACK requires the segments to be in the order of their transmission or retransmission times. This cl creates a new list and moves the retransmitted segments to the end of the list. PiperOrigin-RevId: 327325153
2020-08-14Use a single NetworkEndpoint per NIC per protocolGhanan Gowripalan
The NetworkEndpoint does not need to be created for each address. Most of the work the NetworkEndpoint does is address agnostic. PiperOrigin-RevId: 326759605
2020-08-14Give the ICMP Code its own typeJulian Elischer
This is a preparatory commit for a larger commit working on ICMP generation in error cases. This is removal of technical debt and cleanup in the gvisor code as part of gvisor issue 2211. Updates #2211. PiperOrigin-RevId: 326615389
2020-08-13Use the user supplied MSS for accepted connectionsGhanan Gowripalan
This change supports using the user supplied MSS (TCP_MAXSEG socket option) for new socket connections created from a listening TCP socket. Note that the user supplied MSS will only be used if it is not greater than the maximum possible MSS for a TCP connection's route. If it is greater than the maximum possible MSS, the MSS will be capped at that maximum value. Test: tcp_test.TestUserSuppliedMSSOnListenAccept PiperOrigin-RevId: 326567442
2020-08-13Migrate to PacketHeader API for PacketBuffer.Ting-Yu Wang
Formerly, when a packet is constructed or parsed, all headers are set by the client code. This almost always involved prepending to pk.Header buffer or trimming pk.Data portion. This is known to prone to bugs, due to the complexity and number of the invariants assumed across netstack to maintain. In the new PacketHeader API, client will call Push()/Consume() method to construct/parse an outgoing/incoming packet. All invariants, such as slicing and trimming, are maintained by the API itself. NewPacketBuffer() is introduced to create new PacketBuffer. Zero value is no longer valid. PacketBuffer now assumes the packet is a concatenation of following portions: * LinkHeader * NetworkHeader * TransportHeader * Data Any of them could be empty, or zero-length. PiperOrigin-RevId: 326507688
2020-08-13Ensure TCP TIME-WAIT is not terminated prematurely.Bhasker Hariharan
Netstack's TIME-WAIT state for a TCP socket could be terminated prematurely if the socket entered TIME-WAIT using shutdown(..., SHUT_RDWR) and then was closed using close(). This fixes that bug and updates the tests to verify that Netstack correctly honors TIME-WAIT under such conditions. Fixes #3106 PiperOrigin-RevId: 326456443
2020-08-10Populate IPPacketInfo with destination addressGhanan Gowripalan
IPPacketInfo.DestinationAddr should hold the destination of the IP packet, not the source. This change fixes that bug. PiperOrigin-RevId: 325910766
2020-08-08Use unicast source for ICMP echo repliesGhanan Gowripalan
Packets MUST NOT use a non-unicast source address for ICMP Echo Replies. Test: integration_test.TestPingMulticastBroadcast PiperOrigin-RevId: 325634380
2020-08-07tcp: change the limit of TCP_LINGER2Andrei Vagin
It was changed in the Linux kernel: commit f0628c524fd188c3f9418e12478dfdfadacba815 Date: Fri Apr 24 16:06:16 2020 +0800 net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX PiperOrigin-RevId: 325493859
2020-08-05Add loss recovery option for TCP.Nayana Bidari
/proc/sys/net/ipv4/tcp_recovery is used to enable RACK loss recovery in TCP. PiperOrigin-RevId: 325157807
2020-08-04Update variables for implementation of RACK in TCPNayana Bidari
RACK (Recent Acknowledgement) is a new loss detection algorithm in TCP. These are the fields which should be stored on connections to implement RACK algorithm. PiperOrigin-RevId: 324948703
2020-07-31iptables: support SO_ORIGINAL_DSTKevin Krakauer
Envoy (#170) uses this to get the original destination of redirected packets.
2020-07-30Fix TCP CurrentConnected counter updates.Mithun Iyer
CurrentConnected counter is incorrectly decremented on close of an endpoint which is still not connected. Fixes #3443 PiperOrigin-RevId: 324155171
2020-07-30Revert change to default buffer size.Bhasker Hariharan
In https://github.com/google/gvisor/commit/ca6bded95dbce07f9683904b4b768dfc2d4a09b2 we reduced the default buffer size to 32KB. This mostly works fine except at high throughput where we hit zero window very quickly and the TCP receive buffer moderation is not able to grow the window. This can be seen in the benchmarks where with a 32KB buffer and 100 connections downloading a 10MB file we get about 30 requests/s vs the 1MB buffer gives us about 53 requests/s. A proper fix requires a few changes to when we send a zero window as well as when we decide to send a zero window update. Today we consider available space below 1MSS as zero and send an update when it crosses 1MSS of available space. This is way too low and results in the window staying very small once we hit a zero window condition as we keep sending updates with size barely over 1MSS. Linux and BSD are smarter about this and use different thresholds. We should separately update our logic to match linux or BSD so that we don't send window updates that are really tiny or wait until we drop below 1MSS to advertise a zero window. PiperOrigin-RevId: 324087019
2020-07-30Use brodcast MAC for broadcast IPv4 packetsGhanan Gowripalan
When sending packets to a known network's broadcast address, use the broadcast MAC address. Test: - stack_test.TestOutgoingSubnetBroadcast - udp_test.TestOutgoingSubnetBroadcast PiperOrigin-RevId: 324062407
2020-07-28Redirect TODO to GitHub issuesFabricio Voznika
PiperOrigin-RevId: 323715260
2020-07-27Fix memory accounting in TCP pending segment queue.Bhasker Hariharan
TCP now tracks the overhead of the segment structure itself in it's out-of-order queue (pending). This is required to ensure that a malicious sender sending 1 byte out-of-order segments cannot queue like 1000's of segments which bloat up memory usage. We also reduce the default receive window to 32KB. With TCP moderation there is no need to keep this window at 1MB which means that for new connections the default out-of-order queue will be small unless the application actually reads the data that is being sent. This prevents a sender from just maliciously filling up pending buf with lots of tiny out-of-order segments. PiperOrigin-RevId: 323450913
2020-07-23Add AfterFunc to tcpip.ClockSam Balana
Changes the API of tcpip.Clock to also provide a method for scheduling and rescheduling work after a specified duration. This change also implements the AfterFunc method for existing implementations of tcpip.Clock. This is the groundwork required to mock time within tests. All references to CancellableTimer has been replaced with the tcpip.Job interface, allowing for custom implementations of scheduling work. This is a BREAKING CHANGE for clients that implement their own tcpip.Clock or use tcpip.CancellableTimer. Migration plan: 1. Add AfterFunc(d, f) to tcpip.Clock 2. Replace references of tcpip.CancellableTimer with tcpip.Job 3. Replace calls to tcpip.CancellableTimer#StopLocked with tcpip.Job#Cancel 4. Replace calls to tcpip.CancellableTimer#Reset with tcpip.Job#Schedule 5. Replace calls to tcpip.NewCancellableTimer with tcpip.NewJob. PiperOrigin-RevId: 322906897
2020-07-23Merge pull request #3207 from kevinGC:icmp-connectgVisor bot
PiperOrigin-RevId: 322853192
2020-07-23Fix wildcard bind for raw socket.Bhasker Hariharan
Fixes #3334 PiperOrigin-RevId: 322846384
2020-07-22make connect(2) fail when dest is unreachableKevin Krakauer
Previously, ICMP destination unreachable datagrams were ignored by TCP endpoints. This caused connect to hang when an intermediate router couldn't find a route to the host. This manifested as a Kokoro error when Docker IPv6 was enabled. The Ruby image test would try to install the sinatra gem and hang indefinitely attempting to use an IPv6 address. Fixes #3079.
2020-07-22Support for receiving outbound packets in AF_PACKET.Bhasker Hariharan
Updates #173 PiperOrigin-RevId: 322665518
2020-07-16Add support to return protocol in recvmsg for AF_PACKET.Bhasker Hariharan
Updates #173 PiperOrigin-RevId: 321690756
2020-07-15Add support for SO_ERROR to packet sockets.Bhasker Hariharan
Packet sockets also seem to allow double binding and do not return an error on linux. This was tested by running the syscall test in a linux namespace as root and the current test DoubleBind fails@HEAD. Passes after this change. Updates #173 PiperOrigin-RevId: 321445137
2020-07-13garbage collect connectionsKevin Krakauer
As in Linux, we must periodically clean up unused connections. PiperOrigin-RevId: 321003353
2020-07-11Stub out SO_DETACH_FILTER.Bhasker Hariharan
Updates #2746 PiperOrigin-RevId: 320757963
2020-07-09Discard multicast UDP source address.gVisor bot
RFC-1122 (and others) specify that UDP should not receive datagrams that have a source address that is a multicast address. Packets should never be received FROM a multicast address. See also, RFC 768: 'User Datagram Protocol' J. Postel, ISI, 28 August 1980 A UDP datagram received with an invalid IP source address (e.g., a broadcast or multicast address) must be discarded by UDP or by the IP layer (see rfc 1122 Section 3.2.1.3). This CL does not address TCP or broadcast which is more complicated. Also adds a test for both ipv6 and ipv4 UDP. Fixes #3154 PiperOrigin-RevId: 320547674