Age | Commit message (Collapse) | Author |
|
The change to introduce worker goroutines can cause the endpoint
to transition to StateError and we should terminate the loop rather
than let the endpoint transition to a CLOSED state as we do
in case the endpoint enters TIME-WAIT/CLOSED. Moving to a closed
state would cause the actual error to not be propagated to
any read() calls etc.
PiperOrigin-RevId: 289923568
|
|
All inbound segments for connections in ESTABLISHED state are delivered to the
endpoint's queue but for every segment delivered we also queue the endpoint for
processing to a selected processor. This ensures that when there are a large
number of connections in ESTABLISHED state the inbound packets are all handled
by a small number of goroutines and significantly reduces the amount of work the
goscheduler has to perform.
We let connections in other states follow the current path where the
endpoint's goroutine directly handles the segments.
Updates #231
PiperOrigin-RevId: 289728325
|
|
PiperOrigin-RevId: 289718534
|
|
Fixes #1490
Fixes #1495
PiperOrigin-RevId: 289523250
|
|
This is a band-aid fix for now to prevent panics.
PiperOrigin-RevId: 289078453
|
|
* Rename syncutil to sync.
* Add aliases to sync types.
* Replace existing usage of standard library sync package.
This will make it easier to swap out synchronization primitives. For example,
this will allow us to use primitives from github.com/sasha-s/go-deadlock to
check for lock ordering violations.
Updates #1472
PiperOrigin-RevId: 289033387
|
|
PiperOrigin-RevId: 289019953
|
|
This makes it possible to call the sockopt from go even when the NIC has no
name.
PiperOrigin-RevId: 288955236
|
|
...and port V6OnlyOption to it.
PiperOrigin-RevId: 288789451
|
|
PiperOrigin-RevId: 288779416
|
|
PiperOrigin-RevId: 288772878
|
|
Before, each of small read()'s that raises window either from zero
or above threshold of aMSS, would generate an ACK. In a classic
silly-window-syndrome scenario, we can imagine a pessimistic case
when small read()'s generate a stream of ACKs.
This PR fixes that, essentially treating window size < aMSS as zero.
We send ACK exactly in a moment when window increases to >= aMSS
or half of receive buffer size (whichever smaller).
|
|
When receiving data, netstack avoids sending spurious acks. When
user does recv() should netstack send ack telling the sender that
the window was increased? It depends. Before this patch, netstack
_will_ send the ack in the case when window was zero or window >>
scale was zero. Basically - when recv space increased from zero.
This is not working right with silly-window-avoidance on the sender
side. Some network stacks refuse to transmit segments, that will fill
the window but are below MSS. Before this patch, this confuses
netstack. On one hand if the window was like 3 bytes, netstack
will _not_ send ack if the window increases. On the other hand
sending party will refuse to transmit 3-byte packet.
This patch changes that, making netstack will send an ACK when
the available buffer size increases to or above 1*MSS. This will
inform other party buffer is large enough, and hopefully uncork it.
Signed-off-by: Marek Majkowski <marek@cloudflare.com>
|
|
PiperOrigin-RevId: 287217899
|
|
Added the ability to get/set the IP_RECVTOS socket option on UDP endpoints. If
enabled, TOS from the incoming Network Header passed as ancillary data in the
ControlMessages.
Test:
* Added unit test to udp_test.go that tests getting/setting as well as
verifying that we receive expected TOS from incoming packet.
* Added a syscall test
PiperOrigin-RevId: 287029703
|
|
When listen(2) is called on an unbound socket, the socket is
automatically bound to a random free port with the local address
set to INADDR_ANY.
PiperOrigin-RevId: 286305906
|
|
PiperOrigin-RevId: 286003946
|
|
The implementation follows the linux behavior where specifying
a TCP_USER_TIMEOUT will cause the resend timer to honor the
user specified timeout rather than the default rto based timeout.
Further it alters when connections are timedout due to keepalive
failures. It does not alter the behavior of when keepalives are
sent. This is as per the linux behavior.
PiperOrigin-RevId: 285099795
|
|
Next steps include adding support to the transport demuxer and the UDP endpoint.
PiperOrigin-RevId: 284652151
|
|
Fix bugs in updates to TCP CurrentEstablished stat.
Fixes #1277
PiperOrigin-RevId: 284292459
|
|
This change marks the socket as ESTABLISHED and creates the receiver and sender
the moment we send the final ACK in case of an active TCP handshake or when we
receive the final ACK for a passive TCP handshake. Before this change there was
a short window in which an ACK can be received and processed but the state on
the socket is not yet ESTABLISHED.
This can be seen in TestConnectBindToDevice which is flaky because sometimes
the socket is in SYN-SENT and not ESTABLISHED even though the other side has
already received the final ACK of the handshake.
PiperOrigin-RevId: 284277713
|
|
If the socket is bound to ANY and connected to a loopback address,
getsockname() has to return the loopback address. Without this fix,
getsockname() returns ANY.
PiperOrigin-RevId: 283647781
|
|
The code in rcv.consumeSegment incorrectly transitions to
CLOSED state from LAST-ACK before the final ACK for the FIN.
Further if receiving a segment changes a socket to a closed state
then we should not invoke the sender as the socket is now closed
and sending any segments is incorrect.
PiperOrigin-RevId: 283625300
|
|
This change does not introduce any new features, or modify existing ones.
This change tests handling TCP segments right away for connections that were
completed from a listening endpoint.
PiperOrigin-RevId: 282986457
|
|
These are necessary for iptables to read and parse headers for packet filtering.
PiperOrigin-RevId: 282372811
|
|
PiperOrigin-RevId: 282194656
|
|
PiperOrigin-RevId: 282068093
|
|
This allows us to ensure that the correct port reservation is released.
Fixes #1217
PiperOrigin-RevId: 282048155
|
|
PiperOrigin-RevId: 282045221
|
|
PiperOrigin-RevId: 282023891
|
|
As we move to CLOSE state from LAST-ACK or TIME-WAIT,
ensure that we re-match all in-flight segments to any
listening endpoint.
Also fix LISTEN state handling of any ACK segments as per RFC793.
Fixes #1153
PiperOrigin-RevId: 280703556
|
|
PiperOrigin-RevId: 280455453
|
|
PiperOrigin-RevId: 280280156
|
|
This change drops TCP packets with a non-unicast IP address as the source or
destination address as TCP is meant for communication between two endpoints.
Test: Make sure that if the source or destination address contains a non-unicast
address, no TCP packet is sent in response and the packet is dropped.
PiperOrigin-RevId: 280073731
|
|
* Basic tests for the SO_REUSEADDR and SO_REUSEPORT options.
* SO_REUSEADDR functional tests for TCP and UDP.
* SO_REUSEADDR and SO_REUSEPORT interaction tests for UDP.
* Stubbed support for UDP getsockopt(SO_REUSEADDR).
PiperOrigin-RevId: 280049265
|
|
PiperOrigin-RevId: 279814493
|
|
This change adds explicit support for honoring the 2MSL timeout
for sockets in TIME_WAIT state. It also adds support for the
TCP_LINGER2 option that allows modification of the FIN_WAIT2
state timeout duration for a given socket.
It also adds an option to modify the Stack wide TIME_WAIT timeout
but this is only for testing. On Linux this is fixed at 60s.
Further, we also now correctly process RST's in CLOSE_WAIT and
close the socket similar to linux without moving it to error
state.
We also now handle SYN in ESTABLISHED state as per
RFC5961#section-4.1. Earlier we would just drop these SYNs.
Which can result in some tests that pass on linux to fail on
gVisor.
Netstack now honors TIME_WAIT correctly as well as handles the
following cases correctly.
- TCP RSTs in TIME_WAIT are ignored.
- A duplicate TCP FIN during TIME_WAIT extends the TIME_WAIT
and a dup ACK is sent in response to the FIN as the dup FIN
indicates potential loss of the original final ACK.
- An out of order segment during TIME_WAIT generates a dup ACK.
- A new SYN w/ a sequence number > the highest sequence number
in the previous connection closes the TIME_WAIT early and
opens a new connection.
Further to make the SYN case work correctly the ISN (Initial
Sequence Number) generation for Netstack has been updated to
be as per RFC. Its not a pure random number anymore and follows
the recommendation in https://tools.ietf.org/html/rfc6528#page-3.
The current hash used is not a cryptographically secure hash
function. A separate change will update the hash function used
to Siphash similar to what is used in Linux.
PiperOrigin-RevId: 279106406
|
|
https://github.com/golang/go/wiki/CodeReviewComments#initialisms
This change does not introduce any new functionality. It just renames variables
from `nicid` to `nicID`.
PiperOrigin-RevId: 278992966
|
|
PiperOrigin-RevId: 278979065
|
|
PacketBuffers are analogous to Linux's sk_buff. They hold all information about
a packet, headers, and payload. This is important for:
* iptables to access various headers of packets
* Preventing the clutter of passing different net and link headers along with
VectorisedViews to packet handling functions.
This change only affects the incoming packet path, and a future change will
change the outgoing path.
Benchmark Regular PacketBufferPtr PacketBufferConcrete
--------------------------------------------------------------------------------
BM_Recvmsg 400.715MB/s 373.676MB/s 396.276MB/s
BM_Sendmsg 361.832MB/s 333.003MB/s 335.571MB/s
BM_Recvfrom 453.336MB/s 393.321MB/s 381.650MB/s
BM_Sendto 378.052MB/s 372.134MB/s 341.342MB/s
BM_SendmsgTCP/0/1k 353.711MB/s 316.216MB/s 322.747MB/s
BM_SendmsgTCP/0/2k 600.681MB/s 588.776MB/s 565.050MB/s
BM_SendmsgTCP/0/4k 995.301MB/s 888.808MB/s 941.888MB/s
BM_SendmsgTCP/0/8k 1.517GB/s 1.274GB/s 1.345GB/s
BM_SendmsgTCP/0/16k 1.872GB/s 1.586GB/s 1.698GB/s
BM_SendmsgTCP/0/32k 1.017GB/s 1.020GB/s 1.133GB/s
BM_SendmsgTCP/0/64k 475.626MB/s 584.587MB/s 627.027MB/s
BM_SendmsgTCP/0/128k 416.371MB/s 503.434MB/s 409.850MB/s
BM_SendmsgTCP/0/256k 323.449MB/s 449.599MB/s 388.852MB/s
BM_SendmsgTCP/0/512k 243.992MB/s 267.676MB/s 314.474MB/s
BM_SendmsgTCP/0/1M 95.138MB/s 95.874MB/s 95.417MB/s
BM_SendmsgTCP/0/2M 96.261MB/s 94.977MB/s 96.005MB/s
BM_SendmsgTCP/0/4M 96.512MB/s 95.978MB/s 95.370MB/s
BM_SendmsgTCP/0/8M 95.603MB/s 95.541MB/s 94.935MB/s
BM_SendmsgTCP/0/16M 94.598MB/s 94.696MB/s 94.521MB/s
BM_SendmsgTCP/0/32M 94.006MB/s 94.671MB/s 94.768MB/s
BM_SendmsgTCP/0/64M 94.133MB/s 94.333MB/s 94.746MB/s
BM_SendmsgTCP/0/128M 93.615MB/s 93.497MB/s 93.573MB/s
BM_SendmsgTCP/0/256M 93.241MB/s 95.100MB/s 93.272MB/s
BM_SendmsgTCP/1/1k 303.644MB/s 316.074MB/s 308.430MB/s
BM_SendmsgTCP/1/2k 537.093MB/s 584.962MB/s 529.020MB/s
BM_SendmsgTCP/1/4k 882.362MB/s 939.087MB/s 892.285MB/s
BM_SendmsgTCP/1/8k 1.272GB/s 1.394GB/s 1.296GB/s
BM_SendmsgTCP/1/16k 1.802GB/s 2.019GB/s 1.830GB/s
BM_SendmsgTCP/1/32k 2.084GB/s 2.173GB/s 2.156GB/s
BM_SendmsgTCP/1/64k 2.515GB/s 2.463GB/s 2.473GB/s
BM_SendmsgTCP/1/128k 2.811GB/s 3.004GB/s 2.946GB/s
BM_SendmsgTCP/1/256k 3.008GB/s 3.159GB/s 3.171GB/s
BM_SendmsgTCP/1/512k 2.980GB/s 3.150GB/s 3.126GB/s
BM_SendmsgTCP/1/1M 2.165GB/s 2.233GB/s 2.163GB/s
BM_SendmsgTCP/1/2M 2.370GB/s 2.219GB/s 2.453GB/s
BM_SendmsgTCP/1/4M 2.005GB/s 2.091GB/s 2.214GB/s
BM_SendmsgTCP/1/8M 2.111GB/s 2.013GB/s 2.109GB/s
BM_SendmsgTCP/1/16M 1.902GB/s 1.868GB/s 1.897GB/s
BM_SendmsgTCP/1/32M 1.655GB/s 1.665GB/s 1.635GB/s
BM_SendmsgTCP/1/64M 1.575GB/s 1.547GB/s 1.575GB/s
BM_SendmsgTCP/1/128M 1.524GB/s 1.584GB/s 1.580GB/s
BM_SendmsgTCP/1/256M 1.579GB/s 1.607GB/s 1.593GB/s
PiperOrigin-RevId: 278940079
|
|
This change better follows what is outlined in RFC 793 section 3.4 figure 12
where a listening socket should not accept a SYN-ACK segment in response to a
(potentially) old SYN segment.
Tests: Test that checks the TCP RST segment sent in response to a TCP SYN-ACK
segment received on a listening TCP endpoint.
PiperOrigin-RevId: 278893114
|
|
When VectorisedViews were passed up the stack from packet_dispatchers, we were
passing a sub-slice of the dispatcher's views fields. The dispatchers then
immediately set those views to nil.
This wasn't caught before because every implementer copied the data in these
views before returning.
PiperOrigin-RevId: 277615351
|
|
It is required to guarantee the same order of endpoints after save/restore.
PiperOrigin-RevId: 277598665
|
|
In the future this will replace DanglingEndpoints. DanglingEndpoints must be
kept for now due to issues with save/restore.
This is arguably a cleaner design and allows the stack to know which transport
endpoints might still be using its link endpoints.
Updates #837
PiperOrigin-RevId: 277386633
|
|
Updates #837
PiperOrigin-RevId: 277325162
|
|
This change supports using a user supplied TCP MSS for new active TCP
connections. Note, the user supplied MSS must be less than or equal to the
maximum possible MSS for a TCP connection's route. If it is greater than the
maximum possible MSS, the maximum possible MSS will be used as the connection's
MSS instead.
This change does not use this user supplied MSS for connections accepted from
listening sockets - that will come in a later change.
Test: Test that outgoing TCP SYN segments contain a TCP MSS option with the user
supplied MSS if it is not greater than the maximum possible MSS for the route.
PiperOrigin-RevId: 277185125
|
|
This change validates the ICMPv6 checksum field before further processing an
ICMPv6 packet.
Tests: Unittests to make sure that only ICMPv6 packets with a valid checksum
are accepted/processed. Existing tests using checker.ICMPv6 now also check the
ICMPv6 checksum field.
PiperOrigin-RevId: 276779148
|
|
DelayOption is set on all new endpoints in gVisor.
PiperOrigin-RevId: 276746791
|
|
The amss field in the tcpip.tcp.handshake was not used anywhere. Removed it to
not cause confusion with the amss field in the tcpip.tcp.endpoint struct, which
was documented to be used (and is actually being used) for the same purpose.
PiperOrigin-RevId: 276577088
|
|
PiperOrigin-RevId: 276380008
|