summaryrefslogtreecommitdiffhomepage
path: root/src/queueing.h
AgeCommit message (Collapse)Author
2021-02-18queueing: get rid of per-peer ring buffersJason A. Donenfeld
Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-30queueing: make use of ip_tunnel_parse_protocolJason A. Donenfeld
Now that wg_examine_packet_protocol has been added for general consumption as ip_tunnel_parse_protocol, it's possible to remove wg_examine_packet_protocol and simply use the new ip_tunnel_parse_protocol function directly. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-20compat: backport renamed/missing skb hash membersJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-19queueing: preserve flow hash across packet scrubbingJason A. Donenfeld
It's important that we clear most header fields during encapsulation and decapsulation, because the packet is substantially changed, and we don't want any info leak or logic bug due to an accidental correlation. But, for encapsulation, it's wrong to clear skb->hash, since it's used by fq_codel and flow dissection in general. Without it, classification does not proceed as usual. This change might make it easier to estimate the number of innerflows by examining clustering of out of order packets, but this shouldn't open up anything that can't already be inferred otherwise (e.g. syn packet size inference), and fq_codel can be disabled anyway. Furthermore, it might be the case that the hash isn't used or queried at all until after wireguard transmits the encrypted UDP packet, which means skb->hash might still be zero at this point, and thus no hash taken over the inner packet data. In order to address this situation, we force a calculation of skb->hash before encrypting packet data. Of course this means that fq_codel might transmit packets slightly more out of order than usual. Toke did some testing on beefy machines with high quantities of parallel flows and found that increasing the reply-attack counter to 8192 takes care of the most pathological cases pretty well. Reported-by: Dave Taht <dave.taht@gmail.com> Reviewed-and-tested-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-28queueing: backport skb_reset_redirect change from 5.6Jason A. Donenfeld
This backports upstream commit 2c64605b590edadb3fb46d1ec6badb49e940b479. It makes no difference for us, but it's nice to keep this code in sync with upstream as much as possible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-17wireguard: queueing: account for skb->protocol==0Jason A. Donenfeld
We carry out checks to the effect of: if (skb->protocol != wg_examine_packet_protocol(skb)) goto err; By having wg_skb_examine_untrusted_ip_hdr return 0 on failure, this means that the check above still passes in the case where skb->protocol is zero, which is possible to hit with AF_PACKET: struct sockaddr_pkt saddr = { .spkt_device = "wg0" }; unsigned char buffer[5] = { 0 }; sendto(socket(AF_PACKET, SOCK_PACKET, /* skb->protocol = */ 0), buffer, sizeof(buffer), 0, (const struct sockaddr *)&saddr, sizeof(saddr)); Additional checks mean that this isn't actually a problem in the code base, but I could imagine it becoming a problem later if the function is used more liberally. I would prefer to fix this by having wg_examine_packet_protocol return a 32-bit ~0 value on failure, which will never match any value of skb->protocol, which would simply change the generated code from a mov to a movzx. However, sparse complains, and adding __force casts doesn't seem like a good idea, so instead we just add a simple helper function to check for the zero return value. Since wg_examine_packet_protocol itself gets inlined, this winds up not adding an additional branch to the generated code, since the 0 return value already happens in a mergable branch. Reported-by: Fabian Freyer <fabianfreyer@radicallyopensecurity.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-07-02receive: queue dead packets to napi queue instead of empty rx_queueJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-27queueing: net-next has changed signature of skb_probe_transport_headerJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-03queueing: more reasonable allocator function conventionJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-01-07global: update copyrightJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-27send: consider dropped stage packets to be droppedJason A. Donenfeld
Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08global: more nitsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08global: rename struct wireguard_ to struct wg_Jason A. Donenfeld
This required a bit of pruning of our christmas trees. Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-02global: prefix all functions with wg_Jason A. Donenfeld
I understand why this must be done, though I'm not so happy about having to do it. In some places, it puts us over 80 chars and we have to break lines up in further ugly ways. And in general, I think this makes things harder to read. Yet another thing we must do to please upstream. Maybe this can be replaced in the future by some kind of automatic module namespacing logic in the linker, or even combined with LTO and aggressive symbol stripping. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-20global: put SPDX identifier on its own lineJason A. Donenfeld
The kernel has very specific rules correlating file type with comment type, and also SPDX identifiers can't be merged with other comments. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-28global: run through clang-formatJason A. Donenfeld
This is the worst commit in the whole repo, making the code much less readable, but so it goes with upstream maintainers. We are now woefully wrapped at 80 columns. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-02queueing: ensure strictly ordered loads and storesJason A. Donenfeld
We don't want a consumer to read plaintext when it's supposed to be reading ciphertext, which means we need to synchronize across cores. Suggested-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-01queueing: document double-adding and reference conditionsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-01queueing: keep reference to peer after setting atomic state bitJason A. Donenfeld
After we atomic_set, the peer is allowed to be freed, which means if we want to continue to reference it, we need to bump the reference count. This was introduced a few commits ago by b713ab0e when implementing some simplification suggestions. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-31peer: simplify rcu reference countsJason A. Donenfeld
Use RCU reference counts only when we must, and otherwise use a more reasonably named function. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-08receive: use NAPI on the receive pathJonathan Neuschäfer
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> [Jason: fixed up the flushing of the rx_queue in peer_remove] Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-05-13queueing: preserve pfmemalloc header bitJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-04-15send: account for route-based MTUJason A. Donenfeld
It might be that a particular route has a different MTU than the interface, via `ip route add ... dev wg0 mtu 1281`, for example. In this case, it's important that we don't accidently pad beyond the end of the MTU. We accomplish that in this patch by carrying forward the MTU from the dst if it exists. We also add a unit test for this issue. Reported-by: Roman Mamedov <rm.wg@romanrm.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-02-20queueing: skb_reset: mark as xnetJason A. Donenfeld
This was avoided for a long time, because I wanted the packet to be charged to the original socket for as long as possible. However, this broke net_cls, which looks at skb->sk for additional late-stage routing decisions. So we had no choice but to ensure that skb->sk is NULL by the time of xmit, and this means calling the skb destructor. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-01-03global: year bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-12-09global: add SPDX tags to all filesGreg Kroah-Hartman
It's good to have SPDX identifiers in all files as the Linux kernel developers are working to add these identifiers to all files. Update all files with the correct SPDX license identifier based on the license text of the project or based on the license in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Modified-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31global: style nitsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31global: accept decent check_patch.pl suggestionsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-05queueing: cleanup skb_paddingJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-05queueing: move from ctx to cbJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-05receive: do not store endpoint in ctxJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-05queueing: use ptr_ring instead of linked listsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-03global: add space around variable declarationsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-03global: use _WG prefix for include guardsJason A. Donenfeld
Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-25queueing: more standard init/uninit namesJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-19queueing: rename cpumask functionJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-19queueing: clean up worthless helperJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-19queueing: no need to memzero structJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-18queue: entirely rework parallel systemJason A. Donenfeld
This removes our dependency on padata and moves to a different mode of multiprocessing that is more efficient. This began as Samuel Holland's GSoC project and was gradually reworked/redesigned/rebased into this present commit, which is a combination of his initial contribution and my subsequent rewriting and redesigning. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>