summaryrefslogtreecommitdiffhomepage
path: root/pkg/tcpip
AgeCommit message (Collapse)Author
2020-09-29iptables: remove unused min/max NAT range fieldsKevin Krakauer
PiperOrigin-RevId: 334531794
2020-09-29Return permanent addresses when NIC is downGhanan Gowripalan
Test: stack_test.TestGetMainNICAddressWhenNICDisabled PiperOrigin-RevId: 334513286
2020-09-29Don't allow broadcast/multicast source addressGhanan Gowripalan
As per relevant IP RFCS (see code comments), broadcast (for IPv4) and multicast addresses are not allowed. Currently checks for these are done at the transport layer, but since it is explicitly forbidden at the IP layers, check for them there. This change also removes the UDP.InvalidSourceAddress stat since there is no longer a need for it. Test: ip_test.TestSourceAddressValidation PiperOrigin-RevId: 334490971
2020-09-29iptables: refactor to make targets extendableKevin Krakauer
Like matchers, targets should use a module-like register/lookup system. This replaces the brittle switch statements we had before. The only behavior change is supporing IPT_GET_REVISION_TARGET. This makes it much easier to add IPv6 redirect in the next change. Updates #3549. PiperOrigin-RevId: 334469418
2020-09-29Don't generate link-local IPv6 for loopbackGhanan Gowripalan
Linux doesn't generate a link-local address for the loopback interface. Test: integration_test.TestInitialLoopbackAddresses PiperOrigin-RevId: 334453182
2020-09-29Discard IP fragments as soon as it expiresToshi Kikuchi
Currently expired IP fragments are discarded only if another fragment for the same IP datagram is received after timeout or the total size of the fragment queue exceeded a predefined value. Test: fragmentation.TestReassemblingTimeout Fixes #3960 PiperOrigin-RevId: 334423710
2020-09-29Trim Network/Transport Endpoint/ProtocolGhanan Gowripalan
* Remove Capabilities and NICID methods from NetworkEndpoint. * Remove linkEP and stack parameters from NetworkProtocol.NewEndpoint. The LinkEndpoint can be fetched from the NetworkInterface. The stack is passed to the NetworkProtocol when it is created so the NetworkEndpoint can get it from its protocol. * Remove stack parameter from TransportProtocol.NewEndpoint. Like the NetworkProtocol/Endpoint, the stack is passed to the TransportProtocol when it is created. PiperOrigin-RevId: 334332721
2020-09-29Move IP state from NIC to NetworkEndpoint/ProtocolGhanan Gowripalan
* Add network address to network endpoints. Hold network-specific state in the NetworkEndpoint instead of the stack. This results in the stack no longer needing to "know" about the network endpoints and special case certain work for various endpoints (e.g. IPv6 DAD). * Provide NetworkEndpoints with an NetworkInterface interface. Instead of just passing the NIC ID of a NIC, pass an interface so the network endpoint may query other information about the NIC such as whether or not it is a loopback device. * Move NDP code and state to the IPv6 package. NDP is IPv6 specific so there is no need for it to live in the stack. * Control forwarding through NetworkProtocols instead of Stack Forwarding should be controlled on a per-network protocol basis so forwarding configurations are now controlled through network protocols. * Remove stack.referencedNetworkEndpoint. Now that addresses are exposed via AddressEndpoint and only one NetworkEndpoint is created per interface, there is no need for a referenced NetworkEndpoint. * Assume network teardown methods are infallible. Fixes #3871, #3916 PiperOrigin-RevId: 334319433
2020-09-28Fix 1 zero window advertisement bug and a TCP test flake.Bhasker Hariharan
In TestReceiveBufferAutoTuning we now send a keep-alive packet to measure the current window rather than a 1 byte segment as the returned window value in the latter case is reduced due to the 1 byte segment now being held in the receive buffer and can cause the test to flake if the segment overheads were to change. In getSendParams in rcv.go we were advertising a non-zero window even if available window space was zero after we received the previous segment. In such a case newWnd and curWnd will be the same and we end up advertising a tiny but non-zero window and this can cause the next segment to be dropped. PiperOrigin-RevId: 334314070
2020-09-28Fix lingering of TCP socket in the initial state.Nayana Bidari
When the socket is set with SO_LINGER and close()'d in the initial state, it should not linger and return immediately. PiperOrigin-RevId: 334263149
2020-09-28Support creating protocol instances with Stack refGhanan Gowripalan
Network or transport protocols may want to reach the stack. Support this by letting the stack create the protocol instances so it can pass a reference to itself at protocol creation time. Note, protocols do not yet use the stack in this CL but later CLs will make use of the stack from protocols. PiperOrigin-RevId: 334260210
2020-09-26Remove generic ICMP errorsGhanan Gowripalan
Generic ICMP errors were required because the transport dispatcher was given the responsibility of sending ICMP errors in response to transport packet delivery failures. Instead, the transport dispatcher should let network layer know it failed to deliver a packet (and why) and let the network layer make the decision as to what error to send (if any). Fixes #4068 PiperOrigin-RevId: 333962333
2020-09-24Remove useless endpoint constructionTamir Duberstein
PiperOrigin-RevId: 333591566
2020-09-24Change segment/pending queue to use receive buffer limits.Bhasker Hariharan
segment_queue today has its own standalone limit of MaxUnprocessedSegments but this can be a problem in UnlockUser() we do not release the lock till there are segments to be processed. What can happen is as handleSegments dequeues packets more keep getting queued and we will never release the lock. This can keep happening even if the receive buffer is full because nothing can read() till we release the lock. Further having a separate limit for pending segments makes it harder to track memory usage etc. Unifying the limits makes it easier to reason about memory in use and makes the overall buffer behaviour more consistent. PiperOrigin-RevId: 333508122
2020-09-23Remove unused field from neighborEntryGhanan Gowripalan
PiperOrigin-RevId: 333405169
2020-09-23Extract ICMP error sender from UDPJulian Elischer
Store transport protocol number on packet buffers for use in ICMP error generation. Updates #2211. PiperOrigin-RevId: 333252762
2020-09-22Refactor testutil.TestEndpoint and use it instead of limitedEPArthur Sfez
The new testutil.MockLinkEndpoint implementation is not composed by channel.Channel anymore because none of its features were used. PiperOrigin-RevId: 333167753
2020-09-22Move stack.fakeClock into a separate packageToshi Kikuchi
PiperOrigin-RevId: 333138701
2020-09-20Merge pull request #3651 from ianlewis:ip-forwardinggVisor bot
PiperOrigin-RevId: 332760843
2020-09-18Count packets dropped by iptables in IPStatsKevin Krakauer
PiperOrigin-RevId: 332486383
2020-09-18Enqueue TCP sends arriving in SYN_SENT state.Mithun Iyer
TCP needs to enqueue any send requests arriving when the connection is in SYN_SENT state. The data should be sent out soon after completion of the connection handshake. Fixes #3995 PiperOrigin-RevId: 332482041
2020-09-18Use common parsing utilities when sniffingGhanan Gowripalan
Extract parsing utilities so they can be used by the sniffer. Fixes #3930 PiperOrigin-RevId: 332401880
2020-09-17Test IPv4 WritePackets statsKevin Krakauer
IPv6 tests will be added in another CL along with ip6tables. PiperOrigin-RevId: 332389102
2020-09-17ip6tables: filter table supportKevin Krakauer
`ip6tables -t filter` is now usable. NAT support will come in a future CL. #3549 PiperOrigin-RevId: 332381801
2020-09-17{Set,Get} SO_LINGER on all endpoints.Nayana Bidari
SO_LINGER is a socket level option and should be stored on all endpoints even though it is used to linger only for TCP endpoints. PiperOrigin-RevId: 332369252
2020-09-16Automated rollback of changelist 329526153Nayana Bidari
PiperOrigin-RevId: 332097286
2020-09-16Bind loopback subnets' lifetime to perm addressGhanan Gowripalan
The lifetime of addreses in a loopback interface's associated subnets should be bound to their respective permanent addresses. This change also fixes a race when the stack attempts to get an IPv4 rereferencedNetworkEndpoint for an address in an associated subnet on a loopback interface. Before this change, the stack would only check if an IPv4 address is contained in an associated subnet while holding a read lock but wouldn't do this same check after releasing the read lock for a write lock to create a temporary address. This may cause the stack to bind the lifetime of the address to a new (temporary) endpoint instead of the associated subnet's permanent address. Test: integration_test.TestLoopbackSubnetLifetimeBoundToAddr PiperOrigin-RevId: 332094719
2020-09-16Gracefully translate unknown errno.Ting-Yu Wang
Neither POSIX.1 nor Linux defines an upperbound for errno. PiperOrigin-RevId: 332085017
2020-09-16Receive broadcast packets on interested endpointsGhanan Gowripalan
When a broadcast packet is received by the stack, the packet should be delivered to each endpoint that may be interested in the packet. This includes all any address and specified broadcast address listeners. Test: integration_test.TestReuseAddrAndBroadcast PiperOrigin-RevId: 332060652
2020-09-15Move reusable IPv4 test code into a testutil module and refactor itArthur Sfez
The refactor aims to simplify the package, by replacing the Go channel with a PacketBuffer slice. This code will be reused by tests for IPv6 fragmentation. PiperOrigin-RevId: 331860411
2020-09-15Don't conclude broadcast from route destinationGhanan Gowripalan
The routing table (in its current) form should not be used to make decisions about whether a remote address is a broadcast address or not (for IPv4). Note, a destination subnet does not always map to a network. E.g. RouterA may have a route to 192.168.0.0/22 through RouterB, but RouterB may be configured with 4x /24 subnets on 4 different interfaces. See https://github.com/google/gvisor/issues/3938. PiperOrigin-RevId: 331819868
2020-09-14Store multicast memberships in a setTamir Duberstein
This is simpler and more performant. PiperOrigin-RevId: 331639978
2020-09-14Test RST handling in TIME_WAIT.Mithun Iyer
gVisor stack ignores RSTs when in TIME_WAIT which is not the default Linux behavior. Add a packetimpact test to test the same. Also update code comments to reflect the rationale for the current gVisor behavior. PiperOrigin-RevId: 331629879
2020-09-12Cap reassembled IPv6 packets at 65535 octetsToshi Kikuchi
IPv4 can accept 65536-octet reassembled packets. Test: - ipv4_test.TestInvalidFragments - ipv4_test.TestReceiveFragments - ipv6.TestInvalidIPv6Fragments - ipv6.TestReceiveIPv6Fragments Fixes #3770 PiperOrigin-RevId: 331382977
2020-09-08Increase resolution timeout for TestCacheResolutionSam Balana
Fixes pkg/tcpip/stack:stack_test flake experienced while running TestCacheResolution with gotsan. This occurs when the test-runner takes longer than the resolution timeout to call linkAddrCache.get. In this test we don't care about the resolution timeout, so set it to the maximum and rely on test-runner timeouts to avoid deadlocks. PiperOrigin-RevId: 330566250
2020-09-08Fix data race in tcp.GetSockOpt.Bhasker Hariharan
e.ID can't be read without holding e.mu. GetSockOpt was reading e.ID when looking up OriginalDst without holding e.mu. PiperOrigin-RevId: 330562293
2020-09-08Improve type safety for transport protocol optionsGhanan Gowripalan
The existing implementation for TransportProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for transport protocol options that may be set or queried which transport protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. RELNOTES: n/a PiperOrigin-RevId: 330559811
2020-09-03Use fine-grained mutex for stack.cleanupEndpoints.Bhasker Hariharan
stack.cleanupEndpoints is protected by the stack.mu but that can cause contention as the stack mutex is already acquired in a lot of hot paths during new endpoint creation /cleanup etc. Moving this to a fine grained mutex should reduce contention on the stack.mu. PiperOrigin-RevId: 330026151
2020-09-03Use atomic.Value for Stack.tcpProbeFunc.Jamie Liu
b/166980357#comment56 shows: - 837 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).StartTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop - 695 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).CompleteTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop - 3882 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).GetTCPProbe gvisor/pkg/tcpip/transport/tcp/tcp.newEndpoint gvisor/pkg/tcpip/transport/tcp/tcp.(*protocol).NewEndpoint gvisor/pkg/tcpip/stack/stack.(*Stack).NewEndpoint All of these are contending on Stack.mu. Stack.StartTransportEndpointCleanup() and Stack.CompleteTransportEndpointCleanup() insert/delete TransportEndpoints in a map (Stack.cleanupEndpoints), and the former also does endpoint unregistration while holding Stack.mu, so it's not immediately clear how feasible it is to replace the map with a mutex-less implementation or how much doing so would help. However, Stack.GetTCPProbe() just reads a function object (Stack.tcpProbeFunc) that is almost always nil (as far as I can tell, Stack.AddTCPProbe() is only called in tests), and it's called for every new TCP endpoint. So converting it to an atomic.Value should significantly reduce contention on Stack.mu, improving TCP endpoint creation latency and allowing TCP endpoint cleanup to proceed. PiperOrigin-RevId: 330004140
2020-09-02Fix Accept to not return error for sockets in accept queue.Bhasker Hariharan
Accept on gVisor will return an error if a socket in the accept queue was closed before Accept() was called. Linux will return the new fd even if the returned socket is already closed by the peer say due to a RST being sent by the peer. This seems to be intentional in linux more details on the github issue. Fixes #3780 PiperOrigin-RevId: 329828404
2020-09-01Fix handling of unacceptable ACKs during close.Mithun Iyer
On receiving an ACK with unacceptable ACK number, in a closing state, TCP, needs to reply back with an ACK with correct seq and ack numbers and remain in same state. This change is as per RFC793 page 37, but with a difference that it does not apply to ESTABLISHED state, just as in Linux. Also add more tests to check for OTW sequence number and unacceptable ack numbers in these states. Fixes #3785 PiperOrigin-RevId: 329616283
2020-09-01Automated rollback of changelist 328350576Nayana Bidari
PiperOrigin-RevId: 329526153
2020-08-28Don't bind loopback to all IPs in an IPv6 subnetGhanan Gowripalan
An earlier change considered the loopback bound to all addresses in an assigned subnet. This should have only be done for IPv4 to maintain compatability with Linux: ``` $ ip addr show dev lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3062ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes ^C --- 2001:db8::2 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2030ms $ sudo ip addr add 2001:db8::1/64 dev lo $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes 64 bytes from 2001:db8::1: icmp_seq=1 ttl=64 time=0.055 ms 64 bytes from 2001:db8::1: icmp_seq=2 ttl=64 time=0.074 ms 64 bytes from 2001:db8::1: icmp_seq=3 ttl=64 time=0.073 ms 64 bytes from 2001:db8::1: icmp_seq=4 ttl=64 time=0.071 ms ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3075ms rtt min/avg/max/mdev = 0.055/0.068/0.074/0.007 ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes From 2001:db8::1 icmp_seq=1 Destination unreachable: No route From 2001:db8::1 icmp_seq=2 Destination unreachable: No route From 2001:db8::1 icmp_seq=3 Destination unreachable: No route From 2001:db8::1 icmp_seq=4 Destination unreachable: No route ^C --- 2001:db8::2 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3070ms ``` Test: integration_test.TestLoopbackAcceptAllInSubnet PiperOrigin-RevId: 329011566
2020-08-28Improve type safety for network protocol optionsGhanan Gowripalan
The existing implementation for NetworkProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for network protocol options that may be set or queried which network protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. PiperOrigin-RevId: 328980359
2020-08-28fix panic when calling SO_ORIGINAL_DST without initializing iptablesKevin Krakauer
Reported-by: syzbot+074ec22c42305725b79f@syzkaller.appspotmail.com PiperOrigin-RevId: 328963899
2020-08-28Use a single NetworkEndpoint per addressGhanan Gowripalan
This change was already done as of https://github.com/google/gvisor/commit/1736b2208f but https://github.com/google/gvisor/commit/a174aa7597 conflicted with that change and it was missed in reviews. This change fixes the conflict. PiperOrigin-RevId: 328920372
2020-08-27Improve type safety for socket optionsGhanan Gowripalan
The existing implementation for {G,S}etSockOpt take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for socket options that may be set or queried which socket option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. Fixes #3714. RELNOTES: n/a PiperOrigin-RevId: 328832161
2020-08-27Add function to get error from a tcpip.EndpointGhanan Gowripalan
In an upcoming CL, socket option types are made to implement a marker interface with pointer receivers. Since this results in calling methods of an interface with a pointer, we incur an allocation when attempting to get an Endpoint's last error with the current implementation. When calling the method of an interface, the compiler is unable to determine what the interface implementation does with the pointer (since calling a method on an interface uses virtual dispatch at runtime so the compiler does not know what the interface method will do) so it allocates on the heap to be safe incase an implementation continues to hold the pointer after the functioon returns (the reference escapes the scope of the object). In the example below, the compiler does not know what b.foo does with the reference to a it allocates a on the heap as the reference to a may escape the scope of a. ``` var a int var b someInterface b.foo(&a) ``` This change removes the opportunity for that allocation. RELNOTES: n/a PiperOrigin-RevId: 328796559
2020-08-27ip6tables: (de)serialize ip6tables structsKevin Krakauer
More implementation+testing to follow. #3549. PiperOrigin-RevId: 328770160
2020-08-25Use new reference count utility throughout gvisor.Dean Deng
This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted soon anyway. The following structs now use the new tool, with leak check enabled: devpts:rootInode fuse:inode kernfs:Dentry kernfs:dir kernfs:readonlyDir kernfs:StaticDirectory proc:fdDirInode proc:fdInfoDirInode proc:subtasksInode proc:taskInode proc:tasksInode vfs:FileDescription vfs:MountNamespace vfs:Filesystem sys:dir kernel:FSContext kernel:ProcessGroup kernel:Session shm:Shm mm:aioMappable mm:SpecialMappable transport:queue And the following use the template, but because they currently are not leak checked, a TODO is left instead of enabling leak check in this patch: kernel:FDTable tun:tunEndpoint Updates #1486. PiperOrigin-RevId: 328460377