Age | Commit message (Collapse) | Author |
|
RACK (Recent Acknowledgement) is a new loss detection
algorithm in TCP. These are the fields which should be
stored on connections to implement RACK algorithm.
PiperOrigin-RevId: 324948703
|
|
This will help manage memory consumption by IP reassembly when
receiving IP fragments on multiple network endpoints. Previously,
each endpoint would cap memory consumption at 4MB, but with this
change, each IP stack will cap memory consumption at 4MB.
No behaviour changes.
PiperOrigin-RevId: 324913904
|
|
context is passed to DecRef() and Release() which is
needed for SO_LINGER implementation.
PiperOrigin-RevId: 324672584
|
|
Prevent fragments with different source-destination pairs from
conflicting with each other.
Test:
- ipv6_test.TestReceiveIPv6Fragments
- ipv4_test.TestReceiveIPv6Fragments
PiperOrigin-RevId: 324283246
|
|
Envoy (#170) uses this to get the original destination of redirected
packets.
|
|
CurrentConnected counter is incorrectly decremented on close of an
endpoint which is still not connected.
Fixes #3443
PiperOrigin-RevId: 324155171
|
|
In
https://github.com/google/gvisor/commit/ca6bded95dbce07f9683904b4b768dfc2d4a09b2
we reduced the default buffer size to 32KB. This mostly works fine except at
high throughput where we hit zero window very quickly and the TCP receive
buffer moderation is not able to grow the window. This can be seen in the
benchmarks where with a 32KB buffer and 100 connections downloading a 10MB
file we get about 30 requests/s vs the 1MB buffer gives us about 53 requests/s.
A proper fix requires a few changes to when we send a zero window as well as
when we decide to send a zero window update. Today we consider available space
below 1MSS as zero and send an update when it crosses 1MSS of available space.
This is way too low and results in the window staying very small once we hit
a zero window condition as we keep sending updates with size barely over 1MSS.
Linux and BSD are smarter about this and use different thresholds. We should
separately update our logic to match linux or BSD so that we don't send
window updates that are really tiny or wait until we drop below 1MSS to
advertise a zero window.
PiperOrigin-RevId: 324087019
|
|
Allow configuring fragmentation.Fragmentation with a fragment
block size which will be enforced when processing fragments. Also
validate arguments when processing fragments.
Test:
- fragmentation.TestErrors
- ipv6_test.TestReceiveIPv6Fragments
- ipv4_test.TestReceiveIPv6Fragments
PiperOrigin-RevId: 324081521
|
|
This change implements the Neighbor Unreachability Detection (NUD) state
machine, as per RFC 4861 [1]. The state machine operates on a single neighbor
in the local network. This requires the state machine to be implemented on each
entry of the neighbor table.
This change also adds, but does not expose, several APIs. The first API is for
performing basic operations on the neighbor table:
- Create a static entry
- List all entries
- Delete all entries
- Remove an entry by address
The second API is used for changing the NUD protocol constants on a per-NIC
basis to allow Neighbor Discovery to operate over links with widely varying
performance characteristics. See [RFC 4861 Section 10][2] for the list of
constants.
Finally, the last API is for allowing users to subscribe to NUD state changes.
See [RFC 4861 Appendix C][3] for the list of edges.
[1]: https://tools.ietf.org/html/rfc4861
[2]: https://tools.ietf.org/html/rfc4861#section-10
[3]: https://tools.ietf.org/html/rfc4861#appendix-C
Tests:
pkg/tcpip/stack:stack_test
- TestNeighborCacheAddStaticEntryThenOverflow
- TestNeighborCacheClear
- TestNeighborCacheClearThenOverflow
- TestNeighborCacheConcurrent
- TestNeighborCacheDuplicateStaticEntryWithDifferentLinkAddress
- TestNeighborCacheDuplicateStaticEntryWithSameLinkAddress
- TestNeighborCacheEntry
- TestNeighborCacheEntryNoLinkAddress
- TestNeighborCacheGetConfig
- TestNeighborCacheKeepFrequentlyUsed
- TestNeighborCacheNotifiesWaker
- TestNeighborCacheOverflow
- TestNeighborCacheOverwriteWithStaticEntryThenOverflow
- TestNeighborCacheRemoveEntry
- TestNeighborCacheRemoveEntryThenOverflow
- TestNeighborCacheRemoveStaticEntry
- TestNeighborCacheRemoveStaticEntryThenOverflow
- TestNeighborCacheRemoveWaker
- TestNeighborCacheReplace
- TestNeighborCacheResolutionFailed
- TestNeighborCacheResolutionTimeout
- TestNeighborCacheSetConfig
- TestNeighborCacheStaticResolution
- TestEntryAddsAndClearsWakers
- TestEntryDelayToProbe
- TestEntryDelayToReachableWhenSolicitedOverrideConfirmation
- TestEntryDelayToReachableWhenUpperLevelConfirmation
- TestEntryDelayToStaleWhenConfirmationWithDifferentAddress
- TestEntryDelayToStaleWhenProbeWithDifferentAddress
- TestEntryFailedGetsDeleted
- TestEntryIncompleteToFailed
- TestEntryIncompleteToIncompleteDoesNotChangeUpdatedAt
- TestEntryIncompleteToReachable
- TestEntryIncompleteToReachableWithRouterFlag
- TestEntryIncompleteToStale
- TestEntryInitiallyUnknown
- TestEntryProbeToFailed
- TestEntryProbeToReachableWhenSolicitedConfirmationWithSameAddress
- TestEntryProbeToReachableWhenSolicitedOverrideConfirmation
- TestEntryProbeToStaleWhenConfirmationWithDifferentAddress
- TestEntryProbeToStaleWhenProbeWithDifferentAddress
- TestEntryReachableToStaleWhenConfirmationWithDifferentAddress
- TestEntryReachableToStaleWhenConfirmationWithDifferentAddressAndOverride
- TestEntryReachableToStaleWhenProbeWithDifferentAddress
- TestEntryReachableToStaleWhenTimeout
- TestEntryStaleToDelay
- TestEntryStaleToReachableWhenSolicitedOverrideConfirmation
- TestEntryStaleToStaleWhenOverrideConfirmation
- TestEntryStaleToStaleWhenProbeUpdateAddress
- TestEntryStaysDelayWhenOverrideConfirmationWithSameAddress
- TestEntryStaysProbeWhenOverrideConfirmationWithSameAddress
- TestEntryStaysReachableWhenConfirmationWithRouterFlag
- TestEntryStaysReachableWhenProbeWithSameAddress
- TestEntryStaysStaleWhenProbeWithSameAddress
- TestEntryUnknownToIncomplete
- TestEntryUnknownToStale
- TestEntryUnknownToUnknownWhenConfirmationWithUnknownAddress
pkg/tcpip/stack:stack_x_test
- TestDefaultNUDConfigurations
- TestNUDConfigurationFailsForNotSupported
- TestNUDConfigurationsBaseReachableTime
- TestNUDConfigurationsDelayFirstProbeTime
- TestNUDConfigurationsMaxMulticastProbes
- TestNUDConfigurationsMaxRandomFactor
- TestNUDConfigurationsMaxUnicastProbes
- TestNUDConfigurationsMinRandomFactor
- TestNUDConfigurationsRetransmitTimer
- TestNUDConfigurationsUnreachableTime
- TestNUDStateReachableTime
- TestNUDStateRecomputeReachableTime
- TestSetNUDConfigurationFailsForBadNICID
- TestSetNUDConfigurationFailsForNotSupported
[1]: https://tools.ietf.org/html/rfc4861
[2]: https://tools.ietf.org/html/rfc4861#section-10
[3]: https://tools.ietf.org/html/rfc4861#appendix-C
Updates #1889
Updates #1894
Updates #1895
Updates #1947
Updates #1948
Updates #1949
Updates #1950
PiperOrigin-RevId: 324070795
|
|
When sending packets to a known network's broadcast address, use the
broadcast MAC address.
Test:
- stack_test.TestOutgoingSubnetBroadcast
- udp_test.TestOutgoingSubnetBroadcast
PiperOrigin-RevId: 324062407
|
|
PiperOrigin-RevId: 323715260
|
|
The previous implementation of LinkAddressRequest only supported sending
broadcast ARP requests and multicast Neighbor Solicitations. The ability to
send these packets as unicast is required for Neighbor Unreachability
Detection.
Tests:
pkg/tcpip/network/arp:arp_test
- TestLinkAddressRequest
pkg/tcpip/network/ipv6:ipv6_test
- TestLinkAddressRequest
Updates #1889
Updates #1894
Updates #1895
Updates #1947
Updates #1948
Updates #1949
Updates #1950
PiperOrigin-RevId: 323451569
|
|
TCP now tracks the overhead of the segment structure itself in it's out-of-order
queue (pending). This is required to ensure that a malicious sender sending 1
byte out-of-order segments cannot queue like 1000's of segments which bloat up
memory usage.
We also reduce the default receive window to 32KB. With TCP moderation there is
no need to keep this window at 1MB which means that for new connections the
default out-of-order queue will be small unless the application actually reads
the data that is being sent. This prevents a sender from just maliciously
filling up pending buf with lots of tiny out-of-order segments.
PiperOrigin-RevId: 323450913
|
|
Changes the API of tcpip.Clock to also provide a method for scheduling and
rescheduling work after a specified duration. This change also implements the
AfterFunc method for existing implementations of tcpip.Clock.
This is the groundwork required to mock time within tests. All references to
CancellableTimer has been replaced with the tcpip.Job interface, allowing for
custom implementations of scheduling work.
This is a BREAKING CHANGE for clients that implement their own tcpip.Clock or
use tcpip.CancellableTimer. Migration plan:
1. Add AfterFunc(d, f) to tcpip.Clock
2. Replace references of tcpip.CancellableTimer with tcpip.Job
3. Replace calls to tcpip.CancellableTimer#StopLocked with tcpip.Job#Cancel
4. Replace calls to tcpip.CancellableTimer#Reset with tcpip.Job#Schedule
5. Replace calls to tcpip.NewCancellableTimer with tcpip.NewJob.
PiperOrigin-RevId: 322906897
|
|
PiperOrigin-RevId: 322882426
|
|
PiperOrigin-RevId: 322853192
|
|
Fixes #3334
PiperOrigin-RevId: 322846384
|
|
Previously, ICMP destination unreachable datagrams were ignored by TCP
endpoints. This caused connect to hang when an intermediate router
couldn't find a route to the host.
This manifested as a Kokoro error when Docker IPv6 was enabled. The Ruby
image test would try to install the sinatra gem and hang indefinitely
attempting to use an IPv6 address.
Fixes #3079.
|
|
Fixes a NAT bug that manifested as:
- A SYN was sent from gVisor to another host, unaffected by iptables.
- The corresponding SYN/ACK was NATted by a PREROUTING REDIRECT rule
despite being part of the existing connection.
- The socket that sent the SYN never received the SYN/ACK and thus a
connection could not be established.
We handle this (as Linux does) by tracking all connections, inserting a
no-op conntrack rule for new connections with no rules of their own.
Needed for istio support (#170).
|
|
For iptables users, Check() is a hot path called for every packet one or more
times. Let's avoid a bunch of map lookups.
PiperOrigin-RevId: 322678699
|
|
Updates #173
PiperOrigin-RevId: 322665518
|
|
Updates #173
PiperOrigin-RevId: 321690756
|
|
PiperOrigin-RevId: 321620517
|
|
This is no longer necessary, as we always set NetworkHeader before calling
iptables.Check.
PiperOrigin-RevId: 321461978
|
|
Now it calls pkt.Data.ToView() when writing the packet. This may require
copying when the packet is large, which puts the worse case in an even worse
situation.
This sent out in a separate preparation change as it requires syscall filter
changes. This change will be followed by the change for the adoption of the new
PacketHeader API.
PiperOrigin-RevId: 321447003
|
|
Packet sockets also seem to allow double binding and do not return an error on
linux. This was tested by running the syscall test in a linux namespace as root
and the current test DoubleBind fails@HEAD.
Passes after this change.
Updates #173
PiperOrigin-RevId: 321445137
|
|
gVisor incorrectly returns the wrong ARP type for SIOGIFHWADDR. This breaks
tcpdump as it tries to interpret the packets incorrectly.
Similarly, SIOCETHTOOL is used by tcpdump to query interface properties which
fails with an EINVAL since we don't implement it. For now change it to return
EOPNOTSUPP to indicate that we don't support the query rather than return
EINVAL.
NOTE: ARPHRD types for link endpoints are distinct from NIC capabilities
and NIC flags. In Linux all 3 exist eg. ARPHRD types are stored in dev->type
field while NIC capabilities are more like the device features which can be
queried using SIOCETHTOOL but not modified and NIC Flags are fields that can
be modified from user space. eg. NIC status (UP/DOWN/MULTICAST/BROADCAST) etc.
Updates #2746
PiperOrigin-RevId: 321436525
|
|
PiperOrigin-RevId: 321053634
|
|
PiperOrigin-RevId: 321035635
|
|
As in Linux, we must periodically clean up unused connections.
PiperOrigin-RevId: 321003353
|
|
sleep.Waker's fields are modified as values.
PiperOrigin-RevId: 320873451
|
|
Updates #2746
PiperOrigin-RevId: 320757963
|
|
RFC-1122 (and others) specify that UDP should not receive
datagrams that have a source address that is a multicast address.
Packets should never be received FROM a multicast address.
See also, RFC 768: 'User Datagram Protocol'
J. Postel, ISI, 28 August 1980
A UDP datagram received with an invalid IP source address
(e.g., a broadcast or multicast address) must be discarded
by UDP or by the IP layer (see rfc 1122 Section 3.2.1.3).
This CL does not address TCP or broadcast which is more complicated.
Also adds a test for both ipv6 and ipv4 UDP.
Fixes #3154
PiperOrigin-RevId: 320547674
|
|
Updates #2746
Fixes #3158
PiperOrigin-RevId: 320497190
|
|
PiperOrigin-RevId: 320250773
|
|
RFC 6864 imposes various restrictions on the uniqueness of the IPv4
Identification field for non-atomic datagrams, defined as an IP datagram that
either can be fragmented (DF=0) or is already a fragment (MF=1 or positive
fragment offset). In order to be compliant, the ID field is assigned for all
non-atomic datagrams.
Add a TCP unit test that induces retransmissions and checks that the IPv4
ID field is unique every time. Add basic handling of the IP_MTU_DISCOVER
socket option so that the option can be used to disable PMTU discovery,
effectively setting DF=0. Attempting to set the sockopt to anything other
than disabled will fail because PMTU discovery is currently not implemented,
and the default behavior matches that of disabled.
PiperOrigin-RevId: 320081842
|
|
The current convention is when a header is set to pkt.XxxHeader field, it
gets removed from pkt.Data. ICMP does not currently follow this convention.
PiperOrigin-RevId: 320078606
|
|
Updates #2746
PiperOrigin-RevId: 319887810
|
|
PiperOrigin-RevId: 319882171
|
|
stack_x_test: 2m -> 20s
tcp_x_test: 80s -> 25s
PiperOrigin-RevId: 319828101
|
|
PiperOrigin-RevId: 319770124
|
|
Avoid a race where an arbitrary goroutine scheduling delay can cause the
processor to miss events and hang indefinitely.
Reduce allocations by storing processors by-value in the dispatcher, and
by using a single WaitGroup rather than one per processor.
PiperOrigin-RevId: 319665861
|
|
The application can choose to initiate a non-blocking connect and
later block on a read, when the endpoint is still in SYN-SENT state.
PiperOrigin-RevId: 319311016
|
|
a) When GSO is in use we should not cap the segment to maxPayloadSize in
sender.maybeSendSegment as the GSO logic will cap the segment to the correct
size. Without this the host GSO is not used as we end up breaking up large
segments into small MSS sized segments before writing the packets to the
host.
b) The check to not split a segment due to it not fitting in the receiver window
when there are pending segments is incorrect as segments in writeList can be
really large as we just take the write call's buffer size and create a single
large segment. So a write of say 128KB will just be 1 segment in the
writeList.
The linux code checks if 1 MSS sized segments fits in the receiver's window
and if not then does not split the current segment. gVisor's check was
incorrect that it was checking if the whole segment which could be >>> 1 MSS
would fit in the receiver's window. This was causing us to prematurely stop
sending and falling back to retransmit timer/probe from the other end to send
data.
This was seen when running HTTPD benchmarks where @ HEAD when sending large
files the benchmark was taking forever to run.
The tcp_splitseg_mss_test.go is being deleted as the test as written doesn't
test what is intended correctly. This is because GSO is enabled by default and
the reason the MSS+1 sized segment is sent is because GSO is in use. A proper
test will require disabling GSO on linux and netstack which is going to take a
bit of work in packetimpact to do it correctly.
Separately a new test probably should be written that verifies that a segment >
availableWindow is not split if the availableWindow is < 1 MSS.
Fixes #3107
PiperOrigin-RevId: 319172089
|
|
...by calling (*tcp.endpoint).EndpointState only once when possible.
Avoid wrapping (*sleep.Waker).Assert in a useless func while I'm here.
PiperOrigin-RevId: 319074149
|
|
IPv6 raw sockets never include the IPv6 header.
PiperOrigin-RevId: 318582989
|
|
SO_NO_CHECK is used to skip the UDP checksum generation on a TX socket
(UDP checksum is optional on IPv4).
Test:
- TestNoChecksum
- SoNoCheckOffByDefault (UdpSocketTest)
- SoNoCheck (UdpSocketTest)
Fixes #3055
PiperOrigin-RevId: 318575215
|
|
- Split connTrackForPacket into 2 functions instead of switching on flag
- Replace hash with struct keys.
- Remove prefixes where possible
- Remove unused connStatus, timeout
- Flatten ConnTrack struct a bit - some intermediate structs had no meaning
outside of the context of their parent.
- Protect conn.tcb with a mutex
- Remove redundant error checking (e.g. when is pkt.NetworkHeader valid)
- Clarify that HandlePacket and CreateConnFor are the expected entrypoints for
ConnTrack
PiperOrigin-RevId: 318407168
|
|
Linux controls socket send/receive buffers using a few sysctl variables
- net.core.rmem_default
- net.core.rmem_max
- net.core.wmem_max
- net.core.wmem_default
- net.ipv4.tcp_rmem
- net.ipv4.tcp_wmem
The first 4 control the default socket buffer sizes for all sockets
raw/packet/tcp/udp and also the maximum permitted socket buffer that can be
specified in setsockopt(SOL_SOCKET, SO_(RCV|SND)BUF,...).
The last two control the TCP auto-tuning limits and override the default
specified in rmem_default/wmem_default as well as the max limits.
Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it
to limit the maximum size in setsockopt() as well as uses it for raw/udp
sockets.
This changelist introduces the other 4 and updates the udp/raw sockets to use
the newly introduced variables. The values for min/max match the current
tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is
updated to match the linux value of 212KiB up from the really low current value
of 32 KiB.
Updates #3043
Fixes #3043
PiperOrigin-RevId: 318089805
|
|
For TCP sockets, SO_REUSEADDR relaxes the rules for binding addresses.
gVisor/netstack already supported a behavior similar to SO_REUSEADDR, but did
not allow disabling it. This change brings the SO_REUSEADDR behavior closer to
the behavior implemented by Linux and adds a new SO_REUSEADDR disabled
behavior. Like Linux, SO_REUSEADDR is now disabled by default.
PiperOrigin-RevId: 317984380
|