summaryrefslogtreecommitdiffhomepage
path: root/src/device.c
AgeCommit message (Collapse)Author
2021-02-18queueing: get rid of per-peer ring buffersJason A. Donenfeld
Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18device: do not generate ICMP for non-IP packetsJason A. Donenfeld
If skb->protocol doesn't match the actual skb->data header, it's probably not a good idea to pass it off to icmp{,v6}_ndo_send, which is expecting to reply to a valid IP packet. So this commit has that early mismatch case jump to a later error label. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-21global: avoid double unlikely() notation when using IS_ERR()Antonio Quartulli
The definition of IS_ERR() already applies the unlikely() notation when checking the error status of the passed pointer. For this reason there is no need to have the same notation outside of IS_ERR() itself. Clean up code by removing redundant notation. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-30device: implement header_ops->parse_protocol for AF_PACKETJason A. Donenfeld
WireGuard uses skb->protocol to determine packet type, and bails out if it's not set or set to something it's not expecting. For AF_PACKET injection, we need to support its call chain of: packet_sendmsg -> packet_snd -> packet_parse_headers -> dev_parse_header_protocol -> parse_protocol Without a valid parse_protocol, this returns zero, and wireguard then rejects the skb. So, this wires up the ip_tunnel handler for layer 3 packets for that case. Reported-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-22device: avoid circular netns referencesJason A. Donenfeld
Before, we took a reference to the creating netns if the new netns was different. This caused issues with circular references, with two wireguard interfaces swapping namespaces. The solution is to rather not take any extra references at all, but instead simply invalidate the creating netns pointer when that netns is deleted. In order to prevent this from happening again, this commit improves the rough object leak tracking by allowing it to account for created and destroyed interfaces, aside from just peers and keys. That then makes it possible to check for the object leak when having two interfaces take a reference to each others' namespaces. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-17wireguard: queueing: account for skb->protocol==0Jason A. Donenfeld
We carry out checks to the effect of: if (skb->protocol != wg_examine_packet_protocol(skb)) goto err; By having wg_skb_examine_untrusted_ip_hdr return 0 on failure, this means that the check above still passes in the case where skb->protocol is zero, which is possible to hit with AF_PACKET: struct sockaddr_pkt saddr = { .spkt_device = "wg0" }; unsigned char buffer[5] = { 0 }; sendto(socket(AF_PACKET, SOCK_PACKET, /* skb->protocol = */ 0), buffer, sizeof(buffer), 0, (const struct sockaddr *)&saddr, sizeof(saddr)); Additional checks mean that this isn't actually a problem in the code base, but I could imagine it becoming a problem later if the function is used more liberally. I would prefer to fix this by having wg_examine_packet_protocol return a 32-bit ~0 value on failure, which will never match any value of skb->protocol, which would simply change the generated code from a mov to a movzx. However, sparse complains, and adding __force casts doesn't seem like a good idea, so instead we just add a simple helper function to check for the zero return value. Since wg_examine_packet_protocol itself gets inlined, this winds up not adding an additional branch to the generated code, since the 0 return value already happens in a mergable branch. Reported-by: Fabian Freyer <fabianfreyer@radicallyopensecurity.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-14send: account for mtu=0 devicesJason A. Donenfeld
It turns out there's an easy way to get packets queued up while still having an MTU of zero, and that's via persistent keep alive. This commit makes sure that in whatever condition, we don't wind up dividing by zero. Note that an MTU of zero for a wireguard interface is something quasi-valid, so I don't think the correct fix is to limit it via min_mtu. This can be reproduced easily with: ip link add wg0 type wireguard ip link add wg1 type wireguard ip link set wg0 up mtu 0 ip link set wg1 up wg set wg0 private-key <(wg genkey) wg set wg1 listen-port 1 private-key <(wg genkey) peer $(wg show wg0 public-key) wg set wg0 peer $(wg show wg1 public-key) persistent-keepalive 1 endpoint 127.0.0.1:1 However, while min_mtu=0 seems fine, it makes sense to restrict the max_mtu. This commit also restricts the maximum MTU to the greatest number for which rounding up to the padding multiple won't overflow a signed integer. Packets this large were always rejected anyway eventually, due to checks deeper in, but it seems more sound not to even let the administrator configure something that won't work anyway. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-12netns: ensure that icmp src address is correct with natJason A. Donenfeld
This is a small test to ensure that icmp_ndo_send is actually doing the right with with regards to the source address. It tests this by ensuring that the error comes back along the right path. Also, backport the new ndo function for this. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-12-05device: prepare skb_list_walk_safe for upstreamingJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-07-11noise: immediately rekey all peers after changing device private keyJason A. Donenfeld
Reported-by: Derrick Pallas <derrick@pallas.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-06-25global: switch to coarse ktimeJason A. Donenfeld
Coarse ktime is broken until [1] in 5.2 and kernels without the backport, so we use fallback code there. The fallback code has also been improved significantly. It now only uses slower clocks on kernels < 3.17, at the expense of some accuracy we're not overly concerned about. [1] https://lore.kernel.org/lkml/tip-e3ff9c3678b4d80e22d2557b68726174578eaf52@git.kernel.org/ Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-03-25device: use skb accessor functions where possibleJason A. Donenfeld
Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-03-17global: the _bh variety of rcu helpers have been unifiedJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-25peer: only synchronize_rcu_bh and traverse trie once when removing all peersJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-03hashtables: decouple hashtable allocations from the main device allocationSultan Alsawaf
The hashtable allocations are quite large, and cause the device allocation in the net framework to stall sometimes while it tries to find a contiguous region that can fit the device struct: [<0000000000000000>] __switch_to+0x94/0xb8 [<0000000000000000>] __alloc_pages_nodemask+0x764/0x7e8 [<0000000000000000>] kmalloc_order+0x20/0x40 [<0000000000000000>] __kmalloc+0x144/0x1a0 [<0000000000000000>] alloc_netdev_mqs+0x5c/0x368 [<0000000000000000>] rtnl_create_link+0x48/0x180 [<0000000000000000>] rtnl_newlink+0x410/0x708 [<0000000000000000>] rtnetlink_rcv_msg+0x190/0x1f8 [<0000000000000000>] netlink_rcv_skb+0x4c/0xf8 [<0000000000000000>] rtnetlink_rcv+0x30/0x40 [<0000000000000000>] netlink_unicast+0x18c/0x208 [<0000000000000000>] netlink_sendmsg+0x19c/0x348 [<0000000000000000>] sock_sendmsg+0x3c/0x58 [<0000000000000000>] ___sys_sendmsg+0x290/0x2b0 [<0000000000000000>] __sys_sendmsg+0x58/0xa0 [<0000000000000000>] SyS_sendmsg+0x10/0x20 [<0000000000000000>] el0_svc_naked+0x34/0x38 [<0000000000000000>] 0xffffffffffffffff To fix the allocation stalls, decouple the hashtable allocations from the device allocation and allocate the hashtables with kvmalloc's implicit __GFP_NORETRY so that the allocations fall back to vmalloc with little resistance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-03queueing: more reasonable allocator function conventionJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-01-07global: update copyrightJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-27device: do not clear keys on sleep for PM_AUTOSLEEPJason A. Donenfeld
This way other devices that use Android style wakelocks will also have the same semantics. We also move this logic into the handler so that it's slightly cleaner and gives us some opportunity to leave a normal comment. Suggested-by: Theodore Ts'o <tytso@mit.edu> Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-27timers: it is always reasonable to remove a timerJason A. Donenfeld
If struct timer_list has not been setup, it is zeroed, in which case timer_pending is false, so calling del_timer is safe. Calling del_timer is also safe on a timer that has already been del_timer'd. And calling del_timer is safe after a peer is dead, since the whole point of it being dead is that no more timers are created and all contexts eventually stop. Finally del_timer uses a lock, which means it's safe to call it concurrently. Therefore, we do not need any guards around calls to del_timer. While we're at it, we can get rid of the old lingering timers_enabled boolean which wasn't doing anything anyway anymore. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-27send: consider dropped stage packets to be droppedJason A. Donenfeld
Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-09global: give if statements brackets and other cleanupsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08device: use textual error labels alwaysJason A. Donenfeld
Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08global: more nitsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08global: rename struct wireguard_ to struct wg_Jason A. Donenfeld
This required a bit of pruning of our christmas trees. Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08global: prefix functions used in callbacks with wg_Jason A. Donenfeld
Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-02global: prefix all functions with wg_Jason A. Donenfeld
I understand why this must be done, though I'm not so happy about having to do it. In some places, it puts us over 80 chars and we have to break lines up in further ugly ways. And in general, I think this makes things harder to read. Yet another thing we must do to please upstream. Maybe this can be replaced in the future by some kind of automatic module namespacing logic in the linker, or even combined with LTO and aggressive symbol stripping. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-20global: put SPDX identifier on its own lineJason A. Donenfeld
The kernel has very specific rules correlating file type with comment type, and also SPDX identifiers can't be merged with other comments. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-04global: prefer sizeof(*pointer) when possibleJason A. Donenfeld
Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-03global: satisfy check_patch.pl errorsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-28global: run through clang-formatJason A. Donenfeld
This is the worst commit in the whole repo, making the code much less readable, but so it goes with upstream maintainers. We are now woefully wrapped at 80 columns. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-04send: switch handshake stamp to an atomicJason A. Donenfeld
Rather than abusing the handshake lock, we're much better off just using a boring atomic64 for this. It's simpler and performs better. Also, while we're at it, we set the handshake stamp both before and after the calculations, in case the calculations block for a really long time waiting for the RNG to initialize. Otherwise it's possible that when the RNG finally initializes, two handshakes are sent back to back, which isn't sensible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-31device: adjust commentJason A. Donenfeld
Suggested-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-18device: destroy workqueue before freeing queueJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-04device: print daddr not saddr in missing peer errorJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-23global: use fast boottime instead of normal boottimeJason A. Donenfeld
Generally if we're inaccurate by a few nanoseconds, it doesn't matter. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-23global: use ktime boottime instead of jiffiesJason A. Donenfeld
Since this is a network protocol, expirations need to be accounted for, even across system suspend. On real systems, this isn't a problem, since we're clearing all keys before suspend. But on Android, where we don't do that, this is something of a problem. So, we switch to using boottime instead of jiffies. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-05-31device: do not assume dst is always validJason A. Donenfeld
The new flow offloading feature at the moment does not set the dst. We have a patch pending to fix this upstream, but in the meantime, work around it here. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-05-18device: remove allowedips before individual peersJason A. Donenfeld
This avoids an O(n^2) traversal in favor of an O(n) one. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-04-15send: account for route-based MTUJason A. Donenfeld
It might be that a particular route has a different MTU than the interface, via `ip route add ... dev wg0 mtu 1281`, for example. In this case, it's important that we don't accidently pad beyond the end of the MTU. We accomplish that in this patch by carrying forward the MTU from the dst if it exists. We also add a unit test for this issue. Reported-by: Roman Mamedov <rm.wg@romanrm.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-01-30device: let udev know what kind of device we areJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-01-03global: year bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-12-09global: add SPDX tags to all filesGreg Kroah-Hartman
It's good to have SPDX identifiers in all files as the Linux kernel developers are working to add these identifiers to all files. Update all files with the correct SPDX license identifier based on the license text of the project or based on the license in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Modified-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-29device: clear last handshake timer on ifdownJason A. Donenfeld
Otherwise new handshakes might not occur immediately when the interface goes up and down. Also initialize peers to having a proper zeroed handshake jiffies. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-28device: do not clear keys during sleep on AndroidJason A. Donenfeld
This makes sense for the security model of laptops, but not for clicking phones on and off, where we actually want to be able to handle incoming packets. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-17device: uninitialize socket first in destructionJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-17socket: only free socket after successful creation of newJason A. Donenfeld
When an interface is down, the socket port can change freely. A socket will be allocated when the interface comes up, and if a socket can't be allocated, the interface doesn't come up. However, a socket port can change while the interface is up. In this case, if a new socket with a new port cannot be allocated, it's important to keep the interface in a consistent state. The choices are either to bring down the interface or to preserve the old socket. This patch implements the latter. Reported-by: Marc-Antoine Perennou <keruspe@exherbo.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-10allowedips: rename from routingtableJason A. Donenfeld
Makes it more clear that this _not_ a routing table replacement. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-10device: wait for all peers to be freed before destroyingJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-03device: please lockdepJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31device: expand scope of destruct lockJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>