summaryrefslogtreecommitdiffhomepage
path: root/pkg/tcpip/transport
AgeCommit message (Collapse)Author
2019-11-22Add segment dequeue check while emptying segment queue.Mithun Iyer
PiperOrigin-RevId: 282023891
2019-11-15Handle in-flight TCP segments when moving to CLOSE.Mithun Iyer
As we move to CLOSE state from LAST-ACK or TIME-WAIT, ensure that we re-match all in-flight segments to any listening endpoint. Also fix LISTEN state handling of any ACK segments as per RFC793. Fixes #1153 PiperOrigin-RevId: 280703556
2019-11-14Use PacketBuffers for outgoing packets.Kevin Krakauer
PiperOrigin-RevId: 280455453
2019-11-13Fix flaky behaviour during S/R.Bhasker Hariharan
PiperOrigin-RevId: 280280156
2019-11-12Do not handle TCP packets that include a non-unicast IP addressGhanan Gowripalan
This change drops TCP packets with a non-unicast IP address as the source or destination address as TCP is meant for communication between two endpoints. Test: Make sure that if the source or destination address contains a non-unicast address, no TCP packet is sent in response and the packet is dropped. PiperOrigin-RevId: 280073731
2019-11-12Add tests for SO_REUSEADDR and SO_REUSEPORT.Ian Gudger
* Basic tests for the SO_REUSEADDR and SO_REUSEPORT options. * SO_REUSEADDR functional tests for TCP and UDP. * SO_REUSEADDR and SO_REUSEPORT interaction tests for UDP. * Stubbed support for UDP getsockopt(SO_REUSEADDR). PiperOrigin-RevId: 280049265
2019-11-11Make `connect` on socket returned by `accept` correctly error out with EISCONNgVisor bot
PiperOrigin-RevId: 279814493
2019-11-07Add support for TIME_WAIT timeout.Bhasker Hariharan
This change adds explicit support for honoring the 2MSL timeout for sockets in TIME_WAIT state. It also adds support for the TCP_LINGER2 option that allows modification of the FIN_WAIT2 state timeout duration for a given socket. It also adds an option to modify the Stack wide TIME_WAIT timeout but this is only for testing. On Linux this is fixed at 60s. Further, we also now correctly process RST's in CLOSE_WAIT and close the socket similar to linux without moving it to error state. We also now handle SYN in ESTABLISHED state as per RFC5961#section-4.1. Earlier we would just drop these SYNs. Which can result in some tests that pass on linux to fail on gVisor. Netstack now honors TIME_WAIT correctly as well as handles the following cases correctly. - TCP RSTs in TIME_WAIT are ignored. - A duplicate TCP FIN during TIME_WAIT extends the TIME_WAIT and a dup ACK is sent in response to the FIN as the dup FIN indicates potential loss of the original final ACK. - An out of order segment during TIME_WAIT generates a dup ACK. - A new SYN w/ a sequence number > the highest sequence number in the previous connection closes the TIME_WAIT early and opens a new connection. Further to make the SYN case work correctly the ISN (Initial Sequence Number) generation for Netstack has been updated to be as per RFC. Its not a pure random number anymore and follows the recommendation in https://tools.ietf.org/html/rfc6528#page-3. The current hash used is not a cryptographically secure hash function. A separate change will update the hash function used to Siphash similar to what is used in Linux. PiperOrigin-RevId: 279106406
2019-11-06Rename nicid to nicID to follow go-readability initialismsGhanan Gowripalan
https://github.com/golang/go/wiki/CodeReviewComments#initialisms This change does not introduce any new functionality. It just renames variables from `nicid` to `nicID`. PiperOrigin-RevId: 278992966
2019-11-06Internal change.gVisor bot
PiperOrigin-RevId: 278979065
2019-11-06Use PacketBuffers, rather than VectorisedViews, in netstack.Kevin Krakauer
PacketBuffers are analogous to Linux's sk_buff. They hold all information about a packet, headers, and payload. This is important for: * iptables to access various headers of packets * Preventing the clutter of passing different net and link headers along with VectorisedViews to packet handling functions. This change only affects the incoming packet path, and a future change will change the outgoing path. Benchmark Regular PacketBufferPtr PacketBufferConcrete -------------------------------------------------------------------------------- BM_Recvmsg 400.715MB/s 373.676MB/s 396.276MB/s BM_Sendmsg 361.832MB/s 333.003MB/s 335.571MB/s BM_Recvfrom 453.336MB/s 393.321MB/s 381.650MB/s BM_Sendto 378.052MB/s 372.134MB/s 341.342MB/s BM_SendmsgTCP/0/1k 353.711MB/s 316.216MB/s 322.747MB/s BM_SendmsgTCP/0/2k 600.681MB/s 588.776MB/s 565.050MB/s BM_SendmsgTCP/0/4k 995.301MB/s 888.808MB/s 941.888MB/s BM_SendmsgTCP/0/8k 1.517GB/s 1.274GB/s 1.345GB/s BM_SendmsgTCP/0/16k 1.872GB/s 1.586GB/s 1.698GB/s BM_SendmsgTCP/0/32k 1.017GB/s 1.020GB/s 1.133GB/s BM_SendmsgTCP/0/64k 475.626MB/s 584.587MB/s 627.027MB/s BM_SendmsgTCP/0/128k 416.371MB/s 503.434MB/s 409.850MB/s BM_SendmsgTCP/0/256k 323.449MB/s 449.599MB/s 388.852MB/s BM_SendmsgTCP/0/512k 243.992MB/s 267.676MB/s 314.474MB/s BM_SendmsgTCP/0/1M 95.138MB/s 95.874MB/s 95.417MB/s BM_SendmsgTCP/0/2M 96.261MB/s 94.977MB/s 96.005MB/s BM_SendmsgTCP/0/4M 96.512MB/s 95.978MB/s 95.370MB/s BM_SendmsgTCP/0/8M 95.603MB/s 95.541MB/s 94.935MB/s BM_SendmsgTCP/0/16M 94.598MB/s 94.696MB/s 94.521MB/s BM_SendmsgTCP/0/32M 94.006MB/s 94.671MB/s 94.768MB/s BM_SendmsgTCP/0/64M 94.133MB/s 94.333MB/s 94.746MB/s BM_SendmsgTCP/0/128M 93.615MB/s 93.497MB/s 93.573MB/s BM_SendmsgTCP/0/256M 93.241MB/s 95.100MB/s 93.272MB/s BM_SendmsgTCP/1/1k 303.644MB/s 316.074MB/s 308.430MB/s BM_SendmsgTCP/1/2k 537.093MB/s 584.962MB/s 529.020MB/s BM_SendmsgTCP/1/4k 882.362MB/s 939.087MB/s 892.285MB/s BM_SendmsgTCP/1/8k 1.272GB/s 1.394GB/s 1.296GB/s BM_SendmsgTCP/1/16k 1.802GB/s 2.019GB/s 1.830GB/s BM_SendmsgTCP/1/32k 2.084GB/s 2.173GB/s 2.156GB/s BM_SendmsgTCP/1/64k 2.515GB/s 2.463GB/s 2.473GB/s BM_SendmsgTCP/1/128k 2.811GB/s 3.004GB/s 2.946GB/s BM_SendmsgTCP/1/256k 3.008GB/s 3.159GB/s 3.171GB/s BM_SendmsgTCP/1/512k 2.980GB/s 3.150GB/s 3.126GB/s BM_SendmsgTCP/1/1M 2.165GB/s 2.233GB/s 2.163GB/s BM_SendmsgTCP/1/2M 2.370GB/s 2.219GB/s 2.453GB/s BM_SendmsgTCP/1/4M 2.005GB/s 2.091GB/s 2.214GB/s BM_SendmsgTCP/1/8M 2.111GB/s 2.013GB/s 2.109GB/s BM_SendmsgTCP/1/16M 1.902GB/s 1.868GB/s 1.897GB/s BM_SendmsgTCP/1/32M 1.655GB/s 1.665GB/s 1.635GB/s BM_SendmsgTCP/1/64M 1.575GB/s 1.547GB/s 1.575GB/s BM_SendmsgTCP/1/128M 1.524GB/s 1.584GB/s 1.580GB/s BM_SendmsgTCP/1/256M 1.579GB/s 1.607GB/s 1.593GB/s PiperOrigin-RevId: 278940079
2019-11-06Send a TCP RST in response to a TCP SYN-ACK on a listening endpointGhanan Gowripalan
This change better follows what is outlined in RFC 793 section 3.4 figure 12 where a listening socket should not accept a SYN-ACK segment in response to a (potentially) old SYN segment. Tests: Test that checks the TCP RST segment sent in response to a TCP SYN-ACK segment received on a listening TCP endpoint. PiperOrigin-RevId: 278893114
2019-10-30Deep copy dispatcher views.Kevin Krakauer
When VectorisedViews were passed up the stack from packet_dispatchers, we were passing a sub-slice of the dispatcher's views fields. The dispatchers then immediately set those views to nil. This wasn't caught before because every implementer copied the data in these views before returning. PiperOrigin-RevId: 277615351
2019-10-30Store endpoints inside multiPortEndpoint in a sorted orderAndrei Vagin
It is required to guarantee the same order of endpoints after save/restore. PiperOrigin-RevId: 277598665
2019-10-29Add endpoint tracking to the stack.Ian Gudger
In the future this will replace DanglingEndpoints. DanglingEndpoints must be kept for now due to issues with save/restore. This is arguably a cleaner design and allows the stack to know which transport endpoints might still be using its link endpoints. Updates #837 PiperOrigin-RevId: 277386633
2019-10-29Allow waiting for Endpoint worker goroutines to finish.Ian Gudger
Updates #837 PiperOrigin-RevId: 277325162
2019-10-28Use the user supplied TCP MSS when creating a new active socketGhanan Gowripalan
This change supports using a user supplied TCP MSS for new active TCP connections. Note, the user supplied MSS must be less than or equal to the maximum possible MSS for a TCP connection's route. If it is greater than the maximum possible MSS, the maximum possible MSS will be used as the connection's MSS instead. This change does not use this user supplied MSS for connections accepted from listening sockets - that will come in a later change. Test: Test that outgoing TCP SYN segments contain a TCP MSS option with the user supplied MSS if it is not greater than the maximum possible MSS for the route. PiperOrigin-RevId: 277185125
2019-10-25Validate the checksum for incoming ICMPv6 packetsGhanan Gowripalan
This change validates the ICMPv6 checksum field before further processing an ICMPv6 packet. Tests: Unittests to make sure that only ICMPv6 packets with a valid checksum are accepted/processed. Existing tests using checker.ICMPv6 now also check the ICMPv6 checksum field. PiperOrigin-RevId: 276779148
2019-10-25Convert DelayOption to the newer/faster SockOpt int type.Ian Gudger
DelayOption is set on all new endpoints in gVisor. PiperOrigin-RevId: 276746791
2019-10-24Remove the amss field from tcpip.tcp.handshake as it was unusedGhanan Gowripalan
The amss field in the tcpip.tcp.handshake was not used anywhere. Removed it to not cause confusion with the amss field in the tcpip.tcp.endpoint struct, which was documented to be used (and is actually being used) for the same purpose. PiperOrigin-RevId: 276577088
2019-10-23Merge pull request #641 from tanjianfeng:mastergVisor bot
PiperOrigin-RevId: 276380008
2019-10-22netstack/tcp: software segmentation offloadAndrei Vagin
Right now, we send each tcp packet separately, we call one system call per-packet. This patch allows to generate multiple tcp packets and send them by sendmmsg. The arguable part of this CL is a way how to handle multiple headers. This CL adds the next field to the Prepandable buffer. Nginx test results: Server Software: nginx/1.15.9 Server Hostname: 10.138.0.2 Server Port: 8080 Document Path: /10m.txt Document Length: 10485760 bytes w/o gso: Concurrency Level: 5 Time taken for tests: 5.491 seconds Complete requests: 100 Failed requests: 0 Total transferred: 1048600200 bytes HTML transferred: 1048576000 bytes Requests per second: 18.21 [#/sec] (mean) Time per request: 274.525 [ms] (mean) Time per request: 54.905 [ms] (mean, across all concurrent requests) Transfer rate: 186508.03 [Kbytes/sec] received sw-gso: Concurrency Level: 5 Time taken for tests: 3.852 seconds Complete requests: 100 Failed requests: 0 Total transferred: 1048600200 bytes HTML transferred: 1048576000 bytes Requests per second: 25.96 [#/sec] (mean) Time per request: 192.576 [ms] (mean) Time per request: 38.515 [ms] (mean, across all concurrent requests) Transfer rate: 265874.92 [Kbytes/sec] received w/o gso: $ ./tcp_benchmark --client --duration 15 --ideal [SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec software gso: $ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso [SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec PiperOrigin-RevId: 276112677
2019-10-21AF_PACKET support for netstack (aka epsocket).Kevin Krakauer
Like (AF_INET, SOCK_RAW) sockets, AF_PACKET sockets require CAP_NET_RAW. With runsc, you'll need to pass `--net-raw=true` to enable them. Binding isn't supported yet. PiperOrigin-RevId: 275909366
2019-10-18Fix typo while initializing protocol for UDP endpoints.Mithun Iyer
Fixes #763 PiperOrigin-RevId: 275563222
2019-10-15epsocket: support /proc/net/snmpJianfeng Tan
Netstack has its own stats, we use this to fill /proc/net/snmp. Note that some metrics are not recorded in Netstack, which will be shown as 0 in the proc file. Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com> Change-Id: Ie0089184507d16f49bc0057b4b0482094417ebe1
2019-10-15netstack: add counters for tcp CurrEstab and EstabResetsJianfeng Tan
Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-10-14Internal change.gVisor bot
PiperOrigin-RevId: 274700093
2019-10-14Reorder BUILD license and load functions in netstack.Kevin Krakauer
PiperOrigin-RevId: 274672346
2019-10-09Internal change.gVisor bot
PiperOrigin-RevId: 273861936
2019-10-07Implement IP_TTL.Ian Gudger
Also change the default TTL to 64 to match Linux. PiperOrigin-RevId: 273430341
2019-10-03Implement proper local broadcast behaviorChris Kuiper
The behavior for sending and receiving local broadcast (255.255.255.255) traffic is as follows: Outgoing -------- * A broadcast packet sent on a socket that is bound to an interface goes out that interface * A broadcast packet sent on an unbound socket follows the route table to select the outgoing interface + if an explicit route entry exists for 255.255.255.255/32, use that one + else use the default route * Broadcast packets are looped back and delivered following the rules for incoming packets (see next). This is the same behavior as for multicast packets, except that it cannot be disabled via sockopt. Incoming -------- * Sockets wishing to receive broadcast packets must bind to either INADDR_ANY (0.0.0.0) or INADDR_BROADCAST (255.255.255.255). No other socket receives broadcast packets. * Broadcast packets are multiplexed to all sockets matching it. This is the same behavior as for multicast packets. * A socket can bind to 255.255.255.255:<port> and then receive its own broadcast packets sent to 255.255.255.255:<port> In addition, this change implicitly fixes an issue with multicast reception. If two sockets want to receive a given multicast stream and one is bound to ANY while the other is bound to the multicast address, only one of them will receive the traffic. PiperOrigin-RevId: 272792377
2019-09-30Fix bugs in PickEphemeralPort for TCP.Bhasker Hariharan
Netstack always picks a random start point everytime PickEphemeralPort is called. While this is required for UDP so that DNS requests go out through a randomized set of ports it is not required for TCP. Infact Linux explicitly hashes the (srcip, dstip, dstport) and a one time secret initialized at start of the application to get a random offset. But to ensure it doesn't start from the same point on every scan it uses a static hint that is incremented by 2 in every call to pick ephemeral ports. The reason for 2 is Linux seems to split the port ranges where active connects seem to use even ones while odd ones are used by listening sockets. This CL implements a similar strategy where we use a hash + hint to generate the offset to start the search for a free Ephemeral port. This ensures that we cycle through the available port space in order for repeated connects to the same destination and significantly reduces the chance of picking a recently released port. PiperOrigin-RevId: 272058370
2019-09-27Implement SO_BINDTODEVICE sockoptgVisor bot
PiperOrigin-RevId: 271644926
2019-09-25Remove centralized registration of protocols.Kevin Krakauer
Also removes the need for protocol names. PiperOrigin-RevId: 271186030
2019-09-23netstack: convert more socket options to {Set,Get}SockOptIntAndrei Vagin
PiperOrigin-RevId: 270763208
2019-09-12Implement splice methods for pipes and sockets.Adin Scannell
This also allows the tee(2) implementation to be enabled, since dup can now be properly supported via WriteTo. Note that this change necessitated some minor restructoring with the fs.FileOperations splice methods. If the *fs.File is passed through directly, then only public API methods are accessible, which will deadlock immediately since the locking is already done by fs.Splice. Instead, we pass through an abstract io.Reader or io.Writer, which elide locks and use the underlying fs.FileOperations directly. PiperOrigin-RevId: 268805207
2019-09-12Remove go_test from go_stateify and go_marshalMichael Pratt
They are no-ops, so the standard rule works fine. PiperOrigin-RevId: 268776264
2019-09-09Fix ephemeral port leak.Ian Gudger
Fix a bug where udp.(*endpoint).Disconnect [accessible in gVisor via epsocket.(*SocketOperations).Connect with AF_UNSPEC] would leak a port reservation if the socket/endpoint had an ephemeral port assigned to it. glibc's getaddrinfo uses connect with AF_UNSPEC, causing each call of getaddrinfo to leak a port. Call getaddrinfo too many times and you run out of ports (shows up as connect returning EAGAIN and getaddrinfo returning EAI_NONAME "Name or service not known"). PiperOrigin-RevId: 268071160
2019-09-06Remove reundant global tcpip.LinkEndpointID.Ian Gudger
PiperOrigin-RevId: 267709597
2019-09-04Fix RST generation bugs.Bhasker Hariharan
There are a few cases addressed by this change - We no longer generate a RST in response to a RST packet. - When we receive a RST we cleanup and release all reservations immediately as the connection is now aborted. - An ACK received by a listening socket generates a RST when SYN cookies are not in-use. The only reason an ACK should land at the listening socket is if we are using SYN cookies otherwise the goroutine for the handshake in progress should have gotten the packet and it should never have arrived at the listening endpoint. - Also fixes the error returned when a connection times out due to a Keepalive timer expiration from ECONNRESET to a ETIMEDOUT. PiperOrigin-RevId: 267238427
2019-09-03Make UDP traceroute work.Bhasker Hariharan
Adds support to generate Port Unreachable messages for UDP datagrams received on a port for which there is no valid endpoint. Fixes #703 PiperOrigin-RevId: 267034418
2019-08-29Implement /proc/net/udp.Rahat Mahmood
PiperOrigin-RevId: 266229756
2019-08-26netstack/tcp: Add LastAck transition.Rahat Mahmood
Add missing state transition to LastAck, which should happen when the endpoint has already recieved a FIN from the remote side, and is sending its own FIN. PiperOrigin-RevId: 265568314
2019-08-21Support binding to multicast and broadcast addressesChris Kuiper
This fixes the issue of not being able to bind to either a multicast or broadcast address as well as to send and receive data from it. The way to solve this is to treat these addresses similar to the ANY address and register their transport endpoint ID with the global stack's demuxer rather than the NIC's. That way there is no need to require an endpoint with that multicast or broadcast address. The stack's demuxer is in fact the only correct one to use, because neither broadcast- nor multicast-bound sockets care which NIC a packet was received on (for multicast a join is still needed to receive packets on a NIC). I also took the liberty of refactoring udp_test.go to consolidate a lot of duplicate code and make it easier to create repetitive tests that test the same feature for a variety of packet and socket types. For this purpose I created a "flowType" that represents two things: 1) the type of packet being sent or received and 2) the type of socket used for the test. E.g., a "multicastV4in6" flow represents a V4-mapped multicast packet run through a V6-dual socket. This allows writing significantly simpler tests. A nice example is testTTL(). PiperOrigin-RevId: 264766909
2019-08-21Use tcpip.Subnet in tcpip.RouteTamir Duberstein
This is the first step in replacing some of the redundant types with the standard library equivalents. PiperOrigin-RevId: 264706552
2019-08-16netstack: disconnect an unix socket only if the address family is AF_UNSPECAndrei Vagin
Linux allows to call connect for ANY and the zero port. PiperOrigin-RevId: 263892534
2019-08-15Don't dereference errors passed to panic()Tamir Duberstein
These errors are always pointers; there's no sense in dereferencing them in the panic call. Changed one false positive for clarity. PiperOrigin-RevId: 263611579
2019-08-15netstack: move resumption logic into *_state.goTamir Duberstein
13a98df rearranged some of this code in a way that broke compilation of the netstack-only export at github.com/google/netstack because *_state.go files are not included in that export. This commit moves resumption logic back into *_state.go, fixing the compilation breakage. PiperOrigin-RevId: 263601629
2019-08-14Replace uinptr with int64 when returning lengthsTamir Duberstein
This is in accordance with newer parts of the standard library. PiperOrigin-RevId: 263449916
2019-08-14Improve SendMsg performance.Bhasker Hariharan
SendMsg before this change would copy all the data over into a new slice even if the underlying socket could only accept a small amount of data. This is really inefficient with non-blocking sockets and under high throughput where large writes could get ErrWouldBlock or if there was say a timeout associated with the sendmsg() syscall. With this change we delay copying bytes in till they are needed and only copy what can be potentially sent/held in the socket buffer. Reducing the need to repeatedly copy data over. Also a minor fix to change state FIN-WAIT-1 when shutdown(..., SHUT_WR) is called instead of when we transmit the actual FIN. Otherwise the socket could remain in CONNECTED state even though the user has called shutdown() on the socket. Updates #627 PiperOrigin-RevId: 263430505