Age | Commit message (Collapse) | Author |
|
Fixes a NAT bug that manifested as:
- A SYN was sent from gVisor to another host, unaffected by iptables.
- The corresponding SYN/ACK was NATted by a PREROUTING REDIRECT rule
despite being part of the existing connection.
- The socket that sent the SYN never received the SYN/ACK and thus a
connection could not be established.
We handle this (as Linux does) by tracking all connections, inserting a
no-op conntrack rule for new connections with no rules of their own.
Needed for istio support (#170).
|
|
For iptables users, Check() is a hot path called for every packet one or more
times. Let's avoid a bunch of map lookups.
PiperOrigin-RevId: 322678699
|
|
Updates #173
PiperOrigin-RevId: 322665518
|
|
Updates #173
PiperOrigin-RevId: 321690756
|
|
PiperOrigin-RevId: 321620517
|
|
This is no longer necessary, as we always set NetworkHeader before calling
iptables.Check.
PiperOrigin-RevId: 321461978
|
|
Now it calls pkt.Data.ToView() when writing the packet. This may require
copying when the packet is large, which puts the worse case in an even worse
situation.
This sent out in a separate preparation change as it requires syscall filter
changes. This change will be followed by the change for the adoption of the new
PacketHeader API.
PiperOrigin-RevId: 321447003
|
|
Packet sockets also seem to allow double binding and do not return an error on
linux. This was tested by running the syscall test in a linux namespace as root
and the current test DoubleBind fails@HEAD.
Passes after this change.
Updates #173
PiperOrigin-RevId: 321445137
|
|
gVisor incorrectly returns the wrong ARP type for SIOGIFHWADDR. This breaks
tcpdump as it tries to interpret the packets incorrectly.
Similarly, SIOCETHTOOL is used by tcpdump to query interface properties which
fails with an EINVAL since we don't implement it. For now change it to return
EOPNOTSUPP to indicate that we don't support the query rather than return
EINVAL.
NOTE: ARPHRD types for link endpoints are distinct from NIC capabilities
and NIC flags. In Linux all 3 exist eg. ARPHRD types are stored in dev->type
field while NIC capabilities are more like the device features which can be
queried using SIOCETHTOOL but not modified and NIC Flags are fields that can
be modified from user space. eg. NIC status (UP/DOWN/MULTICAST/BROADCAST) etc.
Updates #2746
PiperOrigin-RevId: 321436525
|
|
PiperOrigin-RevId: 321053634
|
|
PiperOrigin-RevId: 321035635
|
|
As in Linux, we must periodically clean up unused connections.
PiperOrigin-RevId: 321003353
|
|
sleep.Waker's fields are modified as values.
PiperOrigin-RevId: 320873451
|
|
Updates #2746
PiperOrigin-RevId: 320757963
|
|
RFC-1122 (and others) specify that UDP should not receive
datagrams that have a source address that is a multicast address.
Packets should never be received FROM a multicast address.
See also, RFC 768: 'User Datagram Protocol'
J. Postel, ISI, 28 August 1980
A UDP datagram received with an invalid IP source address
(e.g., a broadcast or multicast address) must be discarded
by UDP or by the IP layer (see rfc 1122 Section 3.2.1.3).
This CL does not address TCP or broadcast which is more complicated.
Also adds a test for both ipv6 and ipv4 UDP.
Fixes #3154
PiperOrigin-RevId: 320547674
|
|
Updates #2746
Fixes #3158
PiperOrigin-RevId: 320497190
|
|
PiperOrigin-RevId: 320250773
|
|
RFC 6864 imposes various restrictions on the uniqueness of the IPv4
Identification field for non-atomic datagrams, defined as an IP datagram that
either can be fragmented (DF=0) or is already a fragment (MF=1 or positive
fragment offset). In order to be compliant, the ID field is assigned for all
non-atomic datagrams.
Add a TCP unit test that induces retransmissions and checks that the IPv4
ID field is unique every time. Add basic handling of the IP_MTU_DISCOVER
socket option so that the option can be used to disable PMTU discovery,
effectively setting DF=0. Attempting to set the sockopt to anything other
than disabled will fail because PMTU discovery is currently not implemented,
and the default behavior matches that of disabled.
PiperOrigin-RevId: 320081842
|
|
The current convention is when a header is set to pkt.XxxHeader field, it
gets removed from pkt.Data. ICMP does not currently follow this convention.
PiperOrigin-RevId: 320078606
|
|
Updates #2746
PiperOrigin-RevId: 319887810
|
|
PiperOrigin-RevId: 319882171
|
|
stack_x_test: 2m -> 20s
tcp_x_test: 80s -> 25s
PiperOrigin-RevId: 319828101
|
|
PiperOrigin-RevId: 319770124
|
|
Avoid a race where an arbitrary goroutine scheduling delay can cause the
processor to miss events and hang indefinitely.
Reduce allocations by storing processors by-value in the dispatcher, and
by using a single WaitGroup rather than one per processor.
PiperOrigin-RevId: 319665861
|
|
The application can choose to initiate a non-blocking connect and
later block on a read, when the endpoint is still in SYN-SENT state.
PiperOrigin-RevId: 319311016
|
|
a) When GSO is in use we should not cap the segment to maxPayloadSize in
sender.maybeSendSegment as the GSO logic will cap the segment to the correct
size. Without this the host GSO is not used as we end up breaking up large
segments into small MSS sized segments before writing the packets to the
host.
b) The check to not split a segment due to it not fitting in the receiver window
when there are pending segments is incorrect as segments in writeList can be
really large as we just take the write call's buffer size and create a single
large segment. So a write of say 128KB will just be 1 segment in the
writeList.
The linux code checks if 1 MSS sized segments fits in the receiver's window
and if not then does not split the current segment. gVisor's check was
incorrect that it was checking if the whole segment which could be >>> 1 MSS
would fit in the receiver's window. This was causing us to prematurely stop
sending and falling back to retransmit timer/probe from the other end to send
data.
This was seen when running HTTPD benchmarks where @ HEAD when sending large
files the benchmark was taking forever to run.
The tcp_splitseg_mss_test.go is being deleted as the test as written doesn't
test what is intended correctly. This is because GSO is enabled by default and
the reason the MSS+1 sized segment is sent is because GSO is in use. A proper
test will require disabling GSO on linux and netstack which is going to take a
bit of work in packetimpact to do it correctly.
Separately a new test probably should be written that verifies that a segment >
availableWindow is not split if the availableWindow is < 1 MSS.
Fixes #3107
PiperOrigin-RevId: 319172089
|
|
...by calling (*tcp.endpoint).EndpointState only once when possible.
Avoid wrapping (*sleep.Waker).Assert in a useless func while I'm here.
PiperOrigin-RevId: 319074149
|
|
IPv6 raw sockets never include the IPv6 header.
PiperOrigin-RevId: 318582989
|
|
SO_NO_CHECK is used to skip the UDP checksum generation on a TX socket
(UDP checksum is optional on IPv4).
Test:
- TestNoChecksum
- SoNoCheckOffByDefault (UdpSocketTest)
- SoNoCheck (UdpSocketTest)
Fixes #3055
PiperOrigin-RevId: 318575215
|
|
- Split connTrackForPacket into 2 functions instead of switching on flag
- Replace hash with struct keys.
- Remove prefixes where possible
- Remove unused connStatus, timeout
- Flatten ConnTrack struct a bit - some intermediate structs had no meaning
outside of the context of their parent.
- Protect conn.tcb with a mutex
- Remove redundant error checking (e.g. when is pkt.NetworkHeader valid)
- Clarify that HandlePacket and CreateConnFor are the expected entrypoints for
ConnTrack
PiperOrigin-RevId: 318407168
|
|
Linux controls socket send/receive buffers using a few sysctl variables
- net.core.rmem_default
- net.core.rmem_max
- net.core.wmem_max
- net.core.wmem_default
- net.ipv4.tcp_rmem
- net.ipv4.tcp_wmem
The first 4 control the default socket buffer sizes for all sockets
raw/packet/tcp/udp and also the maximum permitted socket buffer that can be
specified in setsockopt(SOL_SOCKET, SO_(RCV|SND)BUF,...).
The last two control the TCP auto-tuning limits and override the default
specified in rmem_default/wmem_default as well as the max limits.
Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it
to limit the maximum size in setsockopt() as well as uses it for raw/udp
sockets.
This changelist introduces the other 4 and updates the udp/raw sockets to use
the newly introduced variables. The values for min/max match the current
tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is
updated to match the linux value of 212KiB up from the really low current value
of 32 KiB.
Updates #3043
Fixes #3043
PiperOrigin-RevId: 318089805
|
|
For TCP sockets, SO_REUSEADDR relaxes the rules for binding addresses.
gVisor/netstack already supported a behavior similar to SO_REUSEADDR, but did
not allow disabling it. This change brings the SO_REUSEADDR behavior closer to
the behavior implemented by Linux and adds a new SO_REUSEADDR disabled
behavior. Like Linux, SO_REUSEADDR is now disabled by default.
PiperOrigin-RevId: 317984380
|
|
... and unify logic for detached netsted endpoints.
sniffer.go caused crashes if a packet delivery is attempted when the dispatcher
is nil.
Extracted the endpoint nesting logic into a common composable type so it can be
used by the Fuchsia Netstack (the pattern is widespread there).
PiperOrigin-RevId: 317682842
|
|
Test:
- TestIncrementChecksumErrors
Fixes #2943
PiperOrigin-RevId: 317348158
|
|
Users that never set iptables rules shouldn't incur the iptables performance
cost. Suggested by Ian (@iangudger).
PiperOrigin-RevId: 317232921
|
|
Metadata was useful for debugging and safety, but enough tests exist that we
should see failures when (de)serialization is broken. It made stack
initialization more cumbersome and it's also getting in the way of ip6tables.
PiperOrigin-RevId: 317210653
|
|
Updates #173,#6
Fixes #2888
PiperOrigin-RevId: 317087652
|
|
When a tcp.timer or tcpip.Route is no longer used, clean up its
resources so that unused memory may be released.
PiperOrigin-RevId: 317046582
|
|
... to help reduce flakes.
When waiting for an event to occur, use a timeout of 10s. When waiting
for an event to not occur, use a timeout of 1s.
Test: Ran test locally w/ run count of 1000 with and without gotsan.
PiperOrigin-RevId: 316998128
|
|
Ensure that CurrentConnected stat is updated on any errors and cleanups
during connected state processing.
Fixes #2968
PiperOrigin-RevId: 316919426
|
|
PiperOrigin-RevId: 316767969
|
|
In passive open cases, we transition to Established state after
initializing endpoint's sender and receiver. With this we lose out
on any updates coming from the ACK that completes the handshake.
This change ensures that we uniformly transition to Established in all
cases and does minor cleanups.
Fixes #2938
PiperOrigin-RevId: 316567014
|
|
I am not really sure what the point of this is, but someone filed a bug about
it, so I assume something relies on it.
PiperOrigin-RevId: 316225127
|
|
Tentative addresses should not be used when finding a route. This change
fixes a bug where a tentative address may have been used.
Test: stack_test.TestDADResolve
PiperOrigin-RevId: 315997624
|
|
On UDP sockets, SO_REUSEADDR allows multiple sockets to bind to the same
address, but only delivers packets to the most recently bound socket. This
differs from the behavior of SO_REUSEADDR on TCP sockets. SO_REUSEADDR for TCP
sockets will likely need an almost completely independent implementation.
SO_REUSEADDR has some odd interactions with the similar SO_REUSEPORT. These
interactions are tested fairly extensively and all but one particularly odd
one (that honestly seems like a bug) behave the same on gVisor and Linux.
PiperOrigin-RevId: 315844832
|
|
Minimum header sizes are already checked in each `case` arm below. Worse, the
ICMP entries in transportProtocolMinSizes are incorrect, and produce false "raw
packet" logs.
PiperOrigin-RevId: 315730073
|
|
PiperOrigin-RevId: 315711208
|
|
After this change e.mu is only promoted to exclusively locked during
route.Resolve. It downgrades back to read-lock afterwards.
This prevents the second RLock() call gets stuck later in the stack.
https://syzkaller.appspot.com/bug?id=065b893bd8d1d04a4e0a1d53c578537cde1efe99
Syzkaller logs does not contain interesting stack traces.
The following stack trace is obtained by running repro locally.
goroutine 53 [semacquire, 3 minutes]:
runtime.gopark(0xfd4278, 0x1896320, 0xc000301912, 0x4)
GOROOT/src/runtime/proc.go:304 +0xe0 fp=0xc0000e25f8 sp=0xc0000e25d8 pc=0x437170
runtime.goparkunlock(...)
GOROOT/src/runtime/proc.go:310
runtime.semacquire1(0xc0001220b0, 0xc00000a300, 0x1, 0x0)
GOROOT/src/runtime/sema.go:144 +0x1c0 fp=0xc0000e2660 sp=0xc0000e25f8 pc=0x4484e0
sync.runtime_Semacquire(0xc0001220b0)
GOROOT/src/runtime/sema.go:56 +0x42 fp=0xc0000e2690 sp=0xc0000e2660 pc=0x448132
gvisor.dev/gvisor/pkg/sync.(*RWMutex).RLock(...)
pkg/sync/rwmutex_unsafe.go:76
gvisor.dev/gvisor/pkg/tcpip/transport/udp.(*endpoint).HandleControlPacket(0xc000122000, 0x7ee5, 0xc00053c16c, 0x4, 0x5e21, 0xc00053c224, 0x4, 0x1, 0x0, 0xc00007ed00)
pkg/tcpip/transport/udp/endpoint.go:1345 +0x169 fp=0xc0000e26d8 sp=0xc0000e2690 pc=0x9843f9
......
gvisor.dev/gvisor/pkg/tcpip/transport/udp.(*protocol).HandleUnknownDestinationPacket(0x18bb5a0, 0xc000556540, 0x5e21, 0xc00053c16c, 0x4, 0x7ee5, 0xc00053c1ec, 0x4, 0xc00007e680, 0x4)
pkg/tcpip/transport/udp/protocol.go:143 +0xb9a fp=0xc0000e8260 sp=0xc0000e7510 pc=0x9859ba
......
gvisor.dev/gvisor/pkg/tcpip/transport/udp.sendUDP(0xc0001220d0, 0xc00053ece0, 0x1, 0x1, 0x883, 0x1405e217ee5, 0x11100a0, 0xc000592000, 0xf88780)
pkg/tcpip/transport/udp/endpoint.go:924 +0x3b0 fp=0xc0000ed390 sp=0xc0000ec750 pc=0x981af0
gvisor.dev/gvisor/pkg/tcpip/transport/udp.(*endpoint).write(0xc000122000, 0x11104e0, 0xc00020a460, 0x0, 0x0, 0x0, 0x0, 0x0)
pkg/tcpip/transport/udp/endpoint.go:510 +0x4ad fp=0xc0000ed658 sp=0xc0000ed390 pc=0x97f2dd
PiperOrigin-RevId: 315590041
|
|
NDP packets are sent periodically from NDP timers. These timers do not
hold the NIC lock when sending packets as the packet write operation
may take some time. While the lock is not held, the NIC may be removed
by some other goroutine. This change handles that scenario gracefully.
Test: stack_test.TestRemoveNICWhileHandlingRSTimer
PiperOrigin-RevId: 315524143
|
|
Netstack has traditionally parsed headers on-demand as a packet moves up the
stack. This is conceptually simple and convenient, but incompatible with
iptables, where headers can be inspected and mangled before even a routing
decision is made.
This changes header parsing to happen early in the incoming packet path, as soon
as the NIC gets the packet from a link endpoint. Even if an invalid packet is
found (e.g. a TCP header of insufficient length), the packet is passed up the
stack for proper stats bookkeeping.
PiperOrigin-RevId: 315179302
|