Age | Commit message (Collapse) | Author |
|
segment_queue today has its own standalone limit of MaxUnprocessedSegments but
this can be a problem in UnlockUser() we do not release the lock till there are
segments to be processed. What can happen is as handleSegments dequeues packets
more keep getting queued and we will never release the lock. This can keep
happening even if the receive buffer is full because nothing can read() till we
release the lock.
Further having a separate limit for pending segments makes it harder to track
memory usage etc. Unifying the limits makes it easier to reason about memory in
use and makes the overall buffer behaviour more consistent.
PiperOrigin-RevId: 333508122
|
|
Store transport protocol number on packet buffers for use in ICMP error
generation.
Updates #2211.
PiperOrigin-RevId: 333252762
|
|
TCP needs to enqueue any send requests arriving when the connection is in
SYN_SENT state. The data should be sent out soon after completion of the
connection handshake.
Fixes #3995
PiperOrigin-RevId: 332482041
|
|
Extract parsing utilities so they can be used by the sniffer.
Fixes #3930
PiperOrigin-RevId: 332401880
|
|
SO_LINGER is a socket level option and should be stored on all endpoints even
though it is used to linger only for TCP endpoints.
PiperOrigin-RevId: 332369252
|
|
PiperOrigin-RevId: 332097286
|
|
When a broadcast packet is received by the stack, the packet should be
delivered to each endpoint that may be interested in the packet. This
includes all any address and specified broadcast address listeners.
Test: integration_test.TestReuseAddrAndBroadcast
PiperOrigin-RevId: 332060652
|
|
The routing table (in its current) form should not be used to make
decisions about whether a remote address is a broadcast address or
not (for IPv4).
Note, a destination subnet does not always map to a network.
E.g. RouterA may have a route to 192.168.0.0/22 through RouterB,
but RouterB may be configured with 4x /24 subnets on 4 different
interfaces.
See https://github.com/google/gvisor/issues/3938.
PiperOrigin-RevId: 331819868
|
|
This is simpler and more performant.
PiperOrigin-RevId: 331639978
|
|
gVisor stack ignores RSTs when in TIME_WAIT which is not the default
Linux behavior. Add a packetimpact test to test the same.
Also update code comments to reflect the rationale for the current
gVisor behavior.
PiperOrigin-RevId: 331629879
|
|
e.ID can't be read without holding e.mu. GetSockOpt was reading e.ID
when looking up OriginalDst without holding e.mu.
PiperOrigin-RevId: 330562293
|
|
The existing implementation for TransportProtocol.{Set}Option take
arguments of an empty interface type which all types (implicitly)
implement; any type may be passed to the functions.
This change introduces marker interfaces for transport protocol options
that may be set or queried which transport protocol option types
implement to ensure that invalid types are caught at compile time.
Different interfaces are used to allow the compiler to enforce read-only
or set-only socket options.
RELNOTES: n/a
PiperOrigin-RevId: 330559811
|
|
Accept on gVisor will return an error if a socket in the accept queue was closed
before Accept() was called. Linux will return the new fd even if the returned
socket is already closed by the peer say due to a RST being sent by the peer.
This seems to be intentional in linux more details on the github issue.
Fixes #3780
PiperOrigin-RevId: 329828404
|
|
On receiving an ACK with unacceptable ACK number, in a closing state,
TCP, needs to reply back with an ACK with correct seq and ack numbers and
remain in same state. This change is as per RFC793 page 37, but with a
difference that it does not apply to ESTABLISHED state, just as in Linux.
Also add more tests to check for OTW sequence number and unacceptable
ack numbers in these states.
Fixes #3785
PiperOrigin-RevId: 329616283
|
|
PiperOrigin-RevId: 329526153
|
|
The existing implementation for {G,S}etSockOpt take arguments of an
empty interface type which all types (implicitly) implement; any
type may be passed to the functions.
This change introduces marker interfaces for socket options that may be
set or queried which socket option types implement to ensure that invalid
types are caught at compile time. Different interfaces are used to allow
the compiler to enforce read-only or set-only socket options.
Fixes #3714.
RELNOTES: n/a
PiperOrigin-RevId: 328832161
|
|
In an upcoming CL, socket option types are made to implement a marker
interface with pointer receivers. Since this results in calling methods
of an interface with a pointer, we incur an allocation when attempting
to get an Endpoint's last error with the current implementation.
When calling the method of an interface, the compiler is unable to
determine what the interface implementation does with the pointer
(since calling a method on an interface uses virtual dispatch at runtime
so the compiler does not know what the interface method will do) so it
allocates on the heap to be safe incase an implementation continues to
hold the pointer after the functioon returns (the reference escapes the
scope of the object).
In the example below, the compiler does not know what b.foo does with
the reference to a it allocates a on the heap as the reference to a may
escape the scope of a.
```
var a int
var b someInterface
b.foo(&a)
```
This change removes the opportunity for that allocation.
RELNOTES: n/a
PiperOrigin-RevId: 328796559
|
|
Test:
- TestV4UnknownDestination
- TestV6UnknownDestination
PiperOrigin-RevId: 328424137
|
|
This change adds an option to replace the current implementation of ARP through
linkAddrCache, with an implementation of NUD through neighborCache. Switching
to using NUD for both ARP and NDP is beneficial for the reasons described by
RFC 4861 Section 3.1:
"[Using NUD] significantly improves the robustness of packet delivery in the
presence of failing routers, partially failing or partitioned links, or nodes
that change their link-layer addresses. For instance, mobile nodes can move
off-link without losing any connectivity due to stale ARP caches."
"Unlike ARP, Neighbor Unreachability Detection detects half-link failures and
avoids sending traffic to neighbors with which two-way connectivity is
absent."
Along with these changes exposes the API for querying and operating the
neighbor cache. Operations include:
- Create a static entry
- List all entries
- Delete all entries
- Remove an entry by address
This also exposes the API to change the NUD protocol constants on a per-NIC
basis to allow Neighbor Discovery to operate over links with widely varying
performance characteristics. See [RFC 4861 Section 10][1] for the list of
constants.
Finally, an API for subscribing to NUD state changes is exposed through
NUDDispatcher. See [RFC 4861 Appendix C][3] for the list of edges.
Tests:
pkg/tcpip/network/arp:arp_test
+ TestDirectRequest
pkg/tcpip/network/ipv6:ipv6_test
+ TestLinkResolution
+ TestNDPValidation
+ TestNeighorAdvertisementWithTargetLinkLayerOption
+ TestNeighorSolicitationResponse
+ TestNeighorSolicitationWithSourceLinkLayerOption
+ TestRouterAdvertValidation
pkg/tcpip/stack:stack_test
+ TestCacheWaker
+ TestForwardingWithFakeResolver
+ TestForwardingWithFakeResolverManyPackets
+ TestForwardingWithFakeResolverManyResolutions
+ TestForwardingWithFakeResolverPartialTimeout
+ TestForwardingWithFakeResolverTwoPackets
+ TestIPv6SourceAddressSelectionScopeAndSameAddress
[1]: https://tools.ietf.org/html/rfc4861#section-10
[2]: https://tools.ietf.org/html/rfc4861#appendix-C
Fixes #1889
Fixes #1894
Fixes #1895
Fixes #1947
Fixes #1948
Fixes #1949
Fixes #1950
PiperOrigin-RevId: 328365034
|
|
When SO_LINGER option is enabled, the close will not return until all the
queued messages are sent and acknowledged for the socket or linger timeout is
reached. If the option is not set, close will return immediately. This option
is mainly supported for connection oriented protocols such as TCP.
PiperOrigin-RevId: 328350576
|
|
We still deviate a bit from linux in how long we will actually wait in
FIN-WAIT-2. Linux seems to cap it with TIME_WAIT_LEN and it's not completely
obvious as to why it's done that way. For now I think we can ignore that and
fix it if it really is an issue.
PiperOrigin-RevId: 328324922
|
|
PiperOrigin-RevId: 328259353
|
|
PiperOrigin-RevId: 327686558
|
|
RACK requires the segments to be in the order of their transmission
or retransmission times. This cl creates a new list and moves the
retransmitted segments to the end of the list.
PiperOrigin-RevId: 327325153
|
|
The NetworkEndpoint does not need to be created for each address.
Most of the work the NetworkEndpoint does is address agnostic.
PiperOrigin-RevId: 326759605
|
|
This is a preparatory commit for a larger commit working on
ICMP generation in error cases.
This is removal of technical debt and cleanup in the gvisor code
as part of gvisor issue 2211.
Updates #2211.
PiperOrigin-RevId: 326615389
|
|
This change supports using the user supplied MSS (TCP_MAXSEG socket
option) for new socket connections created from a listening TCP socket.
Note that the user supplied MSS will only be used if it is not greater
than the maximum possible MSS for a TCP connection's route. If it is
greater than the maximum possible MSS, the MSS will be capped at that
maximum value.
Test: tcp_test.TestUserSuppliedMSSOnListenAccept
PiperOrigin-RevId: 326567442
|
|
Formerly, when a packet is constructed or parsed, all headers are set by the
client code. This almost always involved prepending to pk.Header buffer or
trimming pk.Data portion. This is known to prone to bugs, due to the complexity
and number of the invariants assumed across netstack to maintain.
In the new PacketHeader API, client will call Push()/Consume() method to
construct/parse an outgoing/incoming packet. All invariants, such as slicing
and trimming, are maintained by the API itself.
NewPacketBuffer() is introduced to create new PacketBuffer. Zero value is no
longer valid.
PacketBuffer now assumes the packet is a concatenation of following portions:
* LinkHeader
* NetworkHeader
* TransportHeader
* Data
Any of them could be empty, or zero-length.
PiperOrigin-RevId: 326507688
|
|
Netstack's TIME-WAIT state for a TCP socket could be terminated prematurely if
the socket entered TIME-WAIT using shutdown(..., SHUT_RDWR) and then was closed
using close(). This fixes that bug and updates the tests to verify that Netstack
correctly honors TIME-WAIT under such conditions.
Fixes #3106
PiperOrigin-RevId: 326456443
|
|
IPPacketInfo.DestinationAddr should hold the destination of the IP
packet, not the source. This change fixes that bug.
PiperOrigin-RevId: 325910766
|
|
Packets MUST NOT use a non-unicast source address for ICMP
Echo Replies.
Test: integration_test.TestPingMulticastBroadcast
PiperOrigin-RevId: 325634380
|
|
It was changed in the Linux kernel:
commit f0628c524fd188c3f9418e12478dfdfadacba815
Date: Fri Apr 24 16:06:16 2020 +0800
net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX
PiperOrigin-RevId: 325493859
|
|
/proc/sys/net/ipv4/tcp_recovery is used to enable RACK loss
recovery in TCP.
PiperOrigin-RevId: 325157807
|
|
RACK (Recent Acknowledgement) is a new loss detection
algorithm in TCP. These are the fields which should be
stored on connections to implement RACK algorithm.
PiperOrigin-RevId: 324948703
|
|
Envoy (#170) uses this to get the original destination of redirected
packets.
|
|
CurrentConnected counter is incorrectly decremented on close of an
endpoint which is still not connected.
Fixes #3443
PiperOrigin-RevId: 324155171
|
|
In
https://github.com/google/gvisor/commit/ca6bded95dbce07f9683904b4b768dfc2d4a09b2
we reduced the default buffer size to 32KB. This mostly works fine except at
high throughput where we hit zero window very quickly and the TCP receive
buffer moderation is not able to grow the window. This can be seen in the
benchmarks where with a 32KB buffer and 100 connections downloading a 10MB
file we get about 30 requests/s vs the 1MB buffer gives us about 53 requests/s.
A proper fix requires a few changes to when we send a zero window as well as
when we decide to send a zero window update. Today we consider available space
below 1MSS as zero and send an update when it crosses 1MSS of available space.
This is way too low and results in the window staying very small once we hit
a zero window condition as we keep sending updates with size barely over 1MSS.
Linux and BSD are smarter about this and use different thresholds. We should
separately update our logic to match linux or BSD so that we don't send
window updates that are really tiny or wait until we drop below 1MSS to
advertise a zero window.
PiperOrigin-RevId: 324087019
|
|
When sending packets to a known network's broadcast address, use the
broadcast MAC address.
Test:
- stack_test.TestOutgoingSubnetBroadcast
- udp_test.TestOutgoingSubnetBroadcast
PiperOrigin-RevId: 324062407
|
|
PiperOrigin-RevId: 323715260
|
|
TCP now tracks the overhead of the segment structure itself in it's out-of-order
queue (pending). This is required to ensure that a malicious sender sending 1
byte out-of-order segments cannot queue like 1000's of segments which bloat up
memory usage.
We also reduce the default receive window to 32KB. With TCP moderation there is
no need to keep this window at 1MB which means that for new connections the
default out-of-order queue will be small unless the application actually reads
the data that is being sent. This prevents a sender from just maliciously
filling up pending buf with lots of tiny out-of-order segments.
PiperOrigin-RevId: 323450913
|
|
Changes the API of tcpip.Clock to also provide a method for scheduling and
rescheduling work after a specified duration. This change also implements the
AfterFunc method for existing implementations of tcpip.Clock.
This is the groundwork required to mock time within tests. All references to
CancellableTimer has been replaced with the tcpip.Job interface, allowing for
custom implementations of scheduling work.
This is a BREAKING CHANGE for clients that implement their own tcpip.Clock or
use tcpip.CancellableTimer. Migration plan:
1. Add AfterFunc(d, f) to tcpip.Clock
2. Replace references of tcpip.CancellableTimer with tcpip.Job
3. Replace calls to tcpip.CancellableTimer#StopLocked with tcpip.Job#Cancel
4. Replace calls to tcpip.CancellableTimer#Reset with tcpip.Job#Schedule
5. Replace calls to tcpip.NewCancellableTimer with tcpip.NewJob.
PiperOrigin-RevId: 322906897
|
|
PiperOrigin-RevId: 322853192
|
|
Fixes #3334
PiperOrigin-RevId: 322846384
|
|
Previously, ICMP destination unreachable datagrams were ignored by TCP
endpoints. This caused connect to hang when an intermediate router
couldn't find a route to the host.
This manifested as a Kokoro error when Docker IPv6 was enabled. The Ruby
image test would try to install the sinatra gem and hang indefinitely
attempting to use an IPv6 address.
Fixes #3079.
|
|
Updates #173
PiperOrigin-RevId: 322665518
|
|
Updates #173
PiperOrigin-RevId: 321690756
|
|
Packet sockets also seem to allow double binding and do not return an error on
linux. This was tested by running the syscall test in a linux namespace as root
and the current test DoubleBind fails@HEAD.
Passes after this change.
Updates #173
PiperOrigin-RevId: 321445137
|
|
As in Linux, we must periodically clean up unused connections.
PiperOrigin-RevId: 321003353
|
|
Updates #2746
PiperOrigin-RevId: 320757963
|
|
RFC-1122 (and others) specify that UDP should not receive
datagrams that have a source address that is a multicast address.
Packets should never be received FROM a multicast address.
See also, RFC 768: 'User Datagram Protocol'
J. Postel, ISI, 28 August 1980
A UDP datagram received with an invalid IP source address
(e.g., a broadcast or multicast address) must be discarded
by UDP or by the IP layer (see rfc 1122 Section 3.2.1.3).
This CL does not address TCP or broadcast which is more complicated.
Also adds a test for both ipv6 and ipv4 UDP.
Fixes #3154
PiperOrigin-RevId: 320547674
|