summaryrefslogtreecommitdiffhomepage
path: root/pkg/tcpip/stack
AgeCommit message (Collapse)Author
2020-09-23Merge release-20200914.0-137-g99decaadd (automated)gVisor bot
2020-09-23Extract ICMP error sender from UDPJulian Elischer
Store transport protocol number on packet buffers for use in ICMP error generation. Updates #2211. PiperOrigin-RevId: 333252762
2020-09-22Move stack.fakeClock into a separate packageToshi Kikuchi
PiperOrigin-RevId: 333138701
2020-09-21Merge release-20200907.0-157-gca3087472 (automated)gVisor bot
2020-09-20Merge pull request #3651 from ianlewis:ip-forwardinggVisor bot
PiperOrigin-RevId: 332760843
2020-09-18Merge release-20200907.0-145-gbd69afdcd (automated)gVisor bot
2020-09-18Count packets dropped by iptables in IPStatsKevin Krakauer
PiperOrigin-RevId: 332486383
2020-09-18Merge release-20200907.0-135-g0b8d306e6 (automated)gVisor bot
2020-09-17ip6tables: filter table supportKevin Krakauer
`ip6tables -t filter` is now usable. NAT support will come in a future CL. #3549 PiperOrigin-RevId: 332381801
2020-09-16Merge release-20200907.0-66-g29ce0ad16 (automated)gVisor bot
2020-09-16Bind loopback subnets' lifetime to perm addressGhanan Gowripalan
The lifetime of addreses in a loopback interface's associated subnets should be bound to their respective permanent addresses. This change also fixes a race when the stack attempts to get an IPv4 rereferencedNetworkEndpoint for an address in an associated subnet on a loopback interface. Before this change, the stack would only check if an IPv4 address is contained in an associated subnet while holding a read lock but wouldn't do this same check after releasing the read lock for a write lock to create a temporary address. This may cause the stack to bind the lifetime of the address to a new (temporary) endpoint instead of the associated subnet's permanent address. Test: integration_test.TestLoopbackSubnetLifetimeBoundToAddr PiperOrigin-RevId: 332094719
2020-09-16Merge release-20200907.0-60-g87c5c0ad2 (automated)gVisor bot
2020-09-16Receive broadcast packets on interested endpointsGhanan Gowripalan
When a broadcast packet is received by the stack, the packet should be delivered to each endpoint that may be interested in the packet. This includes all any address and specified broadcast address listeners. Test: integration_test.TestReuseAddrAndBroadcast PiperOrigin-RevId: 332060652
2020-09-15Merge release-20200907.0-44-gd3880b76c (automated)gVisor bot
2020-09-15Don't conclude broadcast from route destinationGhanan Gowripalan
The routing table (in its current) form should not be used to make decisions about whether a remote address is a broadcast address or not (for IPv4). Note, a destination subnet does not always map to a network. E.g. RouterA may have a route to 192.168.0.0/22 through RouterB, but RouterB may be configured with 4x /24 subnets on 4 different interfaces. See https://github.com/google/gvisor/issues/3938. PiperOrigin-RevId: 331819868
2020-09-08Increase resolution timeout for TestCacheResolutionSam Balana
Fixes pkg/tcpip/stack:stack_test flake experienced while running TestCacheResolution with gotsan. This occurs when the test-runner takes longer than the resolution timeout to call linkAddrCache.get. In this test we don't care about the resolution timeout, so set it to the maximum and rely on test-runner timeouts to avoid deadlocks. PiperOrigin-RevId: 330566250
2020-09-08Merge release-20200818.0-127-gd35f07b36 (automated)gVisor bot
2020-09-08Improve type safety for transport protocol optionsGhanan Gowripalan
The existing implementation for TransportProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for transport protocol options that may be set or queried which transport protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. RELNOTES: n/a PiperOrigin-RevId: 330559811
2020-09-04Merge release-20200818.0-121-g805861ca3 (automated)gVisor bot
2020-09-03Use fine-grained mutex for stack.cleanupEndpoints.Bhasker Hariharan
stack.cleanupEndpoints is protected by the stack.mu but that can cause contention as the stack mutex is already acquired in a lot of hot paths during new endpoint creation /cleanup etc. Moving this to a fine grained mutex should reduce contention on the stack.mu. PiperOrigin-RevId: 330026151
2020-09-03Merge release-20200818.0-120-g76e51c8b9 (automated)gVisor bot
2020-09-03Use atomic.Value for Stack.tcpProbeFunc.Jamie Liu
b/166980357#comment56 shows: - 837 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).StartTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop - 695 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).CompleteTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop - 3882 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).GetTCPProbe gvisor/pkg/tcpip/transport/tcp/tcp.newEndpoint gvisor/pkg/tcpip/transport/tcp/tcp.(*protocol).NewEndpoint gvisor/pkg/tcpip/stack/stack.(*Stack).NewEndpoint All of these are contending on Stack.mu. Stack.StartTransportEndpointCleanup() and Stack.CompleteTransportEndpointCleanup() insert/delete TransportEndpoints in a map (Stack.cleanupEndpoints), and the former also does endpoint unregistration while holding Stack.mu, so it's not immediately clear how feasible it is to replace the map with a mutex-less implementation or how much doing so would help. However, Stack.GetTCPProbe() just reads a function object (Stack.tcpProbeFunc) that is almost always nil (as far as I can tell, Stack.AddTCPProbe() is only called in tests), and it's called for every new TCP endpoint. So converting it to an atomic.Value should significantly reduce contention on Stack.mu, improving TCP endpoint creation latency and allowing TCP endpoint cleanup to proceed. PiperOrigin-RevId: 330004140
2020-09-02Fix Accept to not return error for sockets in accept queue.Bhasker Hariharan
Accept on gVisor will return an error if a socket in the accept queue was closed before Accept() was called. Linux will return the new fd even if the returned socket is already closed by the peer say due to a RST being sent by the peer. This seems to be intentional in linux more details on the github issue. Fixes #3780 PiperOrigin-RevId: 329828404
2020-08-28Merge release-20200818.0-85-gd5787f628 (automated)gVisor bot
2020-08-28Don't bind loopback to all IPs in an IPv6 subnetGhanan Gowripalan
An earlier change considered the loopback bound to all addresses in an assigned subnet. This should have only be done for IPv4 to maintain compatability with Linux: ``` $ ip addr show dev lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 0 received, 100% packet loss, time 3062ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes ^C --- 2001:db8::2 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2030ms $ sudo ip addr add 2001:db8::1/64 dev lo $ ping 2001:db8::1 PING 2001:db8::1(2001:db8::1) 56 data bytes 64 bytes from 2001:db8::1: icmp_seq=1 ttl=64 time=0.055 ms 64 bytes from 2001:db8::1: icmp_seq=2 ttl=64 time=0.074 ms 64 bytes from 2001:db8::1: icmp_seq=3 ttl=64 time=0.073 ms 64 bytes from 2001:db8::1: icmp_seq=4 ttl=64 time=0.071 ms ^C --- 2001:db8::1 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3075ms rtt min/avg/max/mdev = 0.055/0.068/0.074/0.007 ms $ ping 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes From 2001:db8::1 icmp_seq=1 Destination unreachable: No route From 2001:db8::1 icmp_seq=2 Destination unreachable: No route From 2001:db8::1 icmp_seq=3 Destination unreachable: No route From 2001:db8::1 icmp_seq=4 Destination unreachable: No route ^C --- 2001:db8::2 ping statistics --- 4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3070ms ``` Test: integration_test.TestLoopbackAcceptAllInSubnet PiperOrigin-RevId: 329011566
2020-08-28Merge release-20200818.0-83-gbdd5996a7 (automated)gVisor bot
2020-08-28Improve type safety for network protocol optionsGhanan Gowripalan
The existing implementation for NetworkProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for network protocol options that may be set or queried which network protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. PiperOrigin-RevId: 328980359
2020-08-28Merge release-20200818.0-81-gb3ff31d04 (automated)gVisor bot
2020-08-28fix panic when calling SO_ORIGINAL_DST without initializing iptablesKevin Krakauer
Reported-by: syzbot+074ec22c42305725b79f@syzkaller.appspotmail.com PiperOrigin-RevId: 328963899
2020-08-28Merge release-20200818.0-79-g8ae0ab722 (automated)gVisor bot
2020-08-28Use a single NetworkEndpoint per addressGhanan Gowripalan
This change was already done as of https://github.com/google/gvisor/commit/1736b2208f but https://github.com/google/gvisor/commit/a174aa7597 conflicted with that change and it was missed in reviews. This change fixes the conflict. PiperOrigin-RevId: 328920372
2020-08-27Improve type safety for socket optionsGhanan Gowripalan
The existing implementation for {G,S}etSockOpt take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for socket options that may be set or queried which socket option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. Fixes #3714. RELNOTES: n/a PiperOrigin-RevId: 328832161
2020-08-27Add function to get error from a tcpip.EndpointGhanan Gowripalan
In an upcoming CL, socket option types are made to implement a marker interface with pointer receivers. Since this results in calling methods of an interface with a pointer, we incur an allocation when attempting to get an Endpoint's last error with the current implementation. When calling the method of an interface, the compiler is unable to determine what the interface implementation does with the pointer (since calling a method on an interface uses virtual dispatch at runtime so the compiler does not know what the interface method will do) so it allocates on the heap to be safe incase an implementation continues to hold the pointer after the functioon returns (the reference escapes the scope of the object). In the example below, the compiler does not know what b.foo does with the reference to a it allocates a on the heap as the reference to a may escape the scope of a. ``` var a int var b someInterface b.foo(&a) ``` This change removes the opportunity for that allocation. RELNOTES: n/a PiperOrigin-RevId: 328796559
2020-08-27Merge release-20200818.0-68-g01a35a2f1 (automated)gVisor bot
2020-08-27ip6tables: (de)serialize ip6tables structsKevin Krakauer
More implementation+testing to follow. #3549. PiperOrigin-RevId: 328770160
2020-08-25Merge release-20200818.0-43-ga174aa759 (automated)gVisor bot
2020-08-25Add option to replace linkAddrCache with neighborCacheSam Balana
This change adds an option to replace the current implementation of ARP through linkAddrCache, with an implementation of NUD through neighborCache. Switching to using NUD for both ARP and NDP is beneficial for the reasons described by RFC 4861 Section 3.1: "[Using NUD] significantly improves the robustness of packet delivery in the presence of failing routers, partially failing or partitioned links, or nodes that change their link-layer addresses. For instance, mobile nodes can move off-link without losing any connectivity due to stale ARP caches." "Unlike ARP, Neighbor Unreachability Detection detects half-link failures and avoids sending traffic to neighbors with which two-way connectivity is absent." Along with these changes exposes the API for querying and operating the neighbor cache. Operations include: - Create a static entry - List all entries - Delete all entries - Remove an entry by address This also exposes the API to change the NUD protocol constants on a per-NIC basis to allow Neighbor Discovery to operate over links with widely varying performance characteristics. See [RFC 4861 Section 10][1] for the list of constants. Finally, an API for subscribing to NUD state changes is exposed through NUDDispatcher. See [RFC 4861 Appendix C][3] for the list of edges. Tests: pkg/tcpip/network/arp:arp_test + TestDirectRequest pkg/tcpip/network/ipv6:ipv6_test + TestLinkResolution + TestNDPValidation + TestNeighorAdvertisementWithTargetLinkLayerOption + TestNeighorSolicitationResponse + TestNeighorSolicitationWithSourceLinkLayerOption + TestRouterAdvertValidation pkg/tcpip/stack:stack_test + TestCacheWaker + TestForwardingWithFakeResolver + TestForwardingWithFakeResolverManyPackets + TestForwardingWithFakeResolverManyResolutions + TestForwardingWithFakeResolverPartialTimeout + TestForwardingWithFakeResolverTwoPackets + TestIPv6SourceAddressSelectionScopeAndSameAddress [1]: https://tools.ietf.org/html/rfc4861#section-10 [2]: https://tools.ietf.org/html/rfc4861#appendix-C Fixes #1889 Fixes #1894 Fixes #1895 Fixes #1947 Fixes #1948 Fixes #1949 Fixes #1950 PiperOrigin-RevId: 328365034
2020-08-24Merge release-20200818.0-32-g339d266be (automated)gVisor bot
2020-08-24Consider loopback bound to all addresses in subnetGhanan Gowripalan
When a loopback interface is configurd with an address and associated subnet, the loopback should treat all addresses in that subnet as an address it owns. This is mimicking linux behaviour as seen below: ``` $ ip addr show dev lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever $ ping 192.0.2.1 PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data. ^C --- 192.0.2.1 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 1018ms $ ping 192.0.2.2 PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data. ^C --- 192.0.2.2 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2039ms $ sudo ip addr add 192.0.2.1/24 dev lo $ ip addr show dev lo 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet 192.0.2.1/24 scope global lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever $ ping 192.0.2.1 PING 192.0.2.1 (192.0.2.1) 56(84) bytes of data. 64 bytes from 192.0.2.1: icmp_seq=1 ttl=64 time=0.131 ms 64 bytes from 192.0.2.1: icmp_seq=2 ttl=64 time=0.046 ms 64 bytes from 192.0.2.1: icmp_seq=3 ttl=64 time=0.048 ms ^C --- 192.0.2.1 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2042ms rtt min/avg/max/mdev = 0.046/0.075/0.131/0.039 ms $ ping 192.0.2.2 PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data. 64 bytes from 192.0.2.2: icmp_seq=1 ttl=64 time=0.131 ms 64 bytes from 192.0.2.2: icmp_seq=2 ttl=64 time=0.069 ms 64 bytes from 192.0.2.2: icmp_seq=3 ttl=64 time=0.049 ms 64 bytes from 192.0.2.2: icmp_seq=4 ttl=64 time=0.035 ms ^C --- 192.0.2.2 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3049ms rtt min/avg/max/mdev = 0.035/0.071/0.131/0.036 ms ``` Test: integration_test.TestLoopbackAcceptAllInSubnet PiperOrigin-RevId: 328188546
2020-08-20Merge release-20200810.0-74-g129018ab3 (automated)gVisor bot
2020-08-20Consistent precondition formattingMichael Pratt
Our "Preconditions:" blocks are very useful to determine the input invariants, but they are bit inconsistent throughout the codebase, which makes them harder to read (particularly cases with 5+ conditions in a single paragraph). I've reformatted all of the cases to fit in simple rules: 1. Cases with a single condition are placed on a single line. 2. Cases with multiple conditions are placed in a bulleted list. This format has been added to the style guide. I've also mentioned "Postconditions:", though those are much less frequently used, and all uses already match this style. PiperOrigin-RevId: 327687465
2020-08-17Merge branch 'master' into ip-forwardingIan Lewis
- Merges aleksej-paschenko's with HEAD - Adds vfs2 support for ip_forward
2020-08-17Merge release-20200810.0-40-ge3c4bbd10 (automated)gVisor bot
2020-08-17Remove address range functionsGhanan Gowripalan
Should have been removed in cl/326791119 https://github.com/google/gvisor/commit/9a7b5830aa063895f67ca0fdf653a46906374613 PiperOrigin-RevId: 327074156
2020-08-15Merge release-20200810.0-36-g9a7b5830a (automated)gVisor bot
2020-08-15Don't support address rangesGhanan Gowripalan
Previously the netstack supported assignment of a range of addresses. This feature is not used so remove it. PiperOrigin-RevId: 326791119
2020-08-15Merge release-20200810.0-35-g1736b2208 (automated)gVisor bot
2020-08-14Use a single NetworkEndpoint per NIC per protocolGhanan Gowripalan
The NetworkEndpoint does not need to be created for each address. Most of the work the NetworkEndpoint does is address agnostic. PiperOrigin-RevId: 326759605
2020-08-13Merge release-20200810.0-23-g47515f475 (automated)gVisor bot
2020-08-13Migrate to PacketHeader API for PacketBuffer.Ting-Yu Wang
Formerly, when a packet is constructed or parsed, all headers are set by the client code. This almost always involved prepending to pk.Header buffer or trimming pk.Data portion. This is known to prone to bugs, due to the complexity and number of the invariants assumed across netstack to maintain. In the new PacketHeader API, client will call Push()/Consume() method to construct/parse an outgoing/incoming packet. All invariants, such as slicing and trimming, are maintained by the API itself. NewPacketBuffer() is introduced to create new PacketBuffer. Zero value is no longer valid. PacketBuffer now assumes the packet is a concatenation of following portions: * LinkHeader * NetworkHeader * TransportHeader * Data Any of them could be empty, or zero-length. PiperOrigin-RevId: 326507688