wireguard-linux-compat - WireGuard Linux compat

Age	Commit message (Collapse)	Author
2021-02-18	queueing: get rid of per-peer ring buffers	Jason A. Donenfeld
	Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we already have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-30	queueing: make use of ip_tunnel_parse_protocol	Jason A. Donenfeld
	Now that wg_examine_packet_protocol has been added for general consumption as ip_tunnel_parse_protocol, it's possible to remove wg_examine_packet_protocol and simply use the new ip_tunnel_parse_protocol function directly. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-06-29	receive: account for napi_gro_receive never returning GRO_DROP	Jason A. Donenfeld
	The napi_gro_receive function no longer returns GRO_DROP ever, making handling GRO_DROP dead code. This commit removes that dead code. Further, it's not even clear that device drivers have any business in taking action after passing off received packets; that's arguably out of their hands. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-19	noise: separate receive counter from send counter	Jason A. Donenfeld
	In "queueing: preserve flow hash across packet scrubbing", we were required to slightly increase the size of the receive replay counter to something still fairly small, but an increase nonetheless. It turns out that we can recoup some of the additional memory overhead by splitting up the prior union type into two distinct types. Before, we used the same "noise_counter" union for both sending and receiving, with sending just using a simple atomic64_t, while receiving used the full replay counter checker. This meant that most of the memory being allocated for the sending counter was being wasted. Since the old "noise_counter" type increased in size in the prior commit, now is a good time to split up that union type into a distinct "noise_replay_ counter" for receiving and a boring atomic64_t for sending, each using neither more nor less memory than required. Also, since sometimes the replay counter is accessed without necessitating additional accesses to the bitmap, we can reduce cache misses by hoisting the always-necessary lock above the bitmap in the struct layout. We also change a "noise_replay_counter" stack allocation to kmalloc in a -DDEBUG selftest so that KASAN doesn't trigger a stack frame warning. All and all, removing a bit of abstraction in this commit makes the code simpler and smaller, in addition to the motivating memory usage recuperation. For example, passing around raw "noise_symmetric_key" structs is something that really only makes sense within noise.c, in the one place where the sending and receiving keys can safely be thought of as the same type of object; subsequent to that, it's important that we uniformly access these through keypair->{sending,receiving}, where their distinct roles are always made explicit. So this patch allows us to draw that distinction clearly as well. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-19	queueing: preserve flow hash across packet scrubbing	Jason A. Donenfeld
	It's important that we clear most header fields during encapsulation and decapsulation, because the packet is substantially changed, and we don't want any info leak or logic bug due to an accidental correlation. But, for encapsulation, it's wrong to clear skb->hash, since it's used by fq_codel and flow dissection in general. Without it, classification does not proceed as usual. This change might make it easier to estimate the number of innerflows by examining clustering of out of order packets, but this shouldn't open up anything that can't already be inferred otherwise (e.g. syn packet size inference), and fq_codel can be disabled anyway. Furthermore, it might be the case that the hash isn't used or queried at all until after wireguard transmits the encrypted UDP packet, which means skb->hash might still be zero at this point, and thus no hash taken over the inner packet data. In order to address this situation, we force a calculation of skb->hash before encrypting packet data. Of course this means that fq_codel might transmit packets slightly more out of order than usual. Toke did some testing on beefy machines with high quantities of parallel flows and found that increasing the reply-attack counter to 8192 takes care of the most pathological cases pretty well. Reported-by: Dave Taht <dave.taht@gmail.com> Reviewed-and-tested-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-05-05	send/receive: use explicit unlikely branch instead of implicit coalescing	Jason A. Donenfeld
	It's very unlikely that send will become true. It's nearly always false between 0 and 120 seconds of a session, and in most cases becomes true only between 120 and 121 seconds before becoming false again. So, unlikely(send) is clearly the right option here. What happened before was that we had this complex boolean expression with multiple likely and unlikely clauses nested. Since this is evaluated left-to-right anyway, the whole thing got converted to unlikely. So, we can clean this up to better represent what's going on. The generated code is the same. Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-04-28	receive: use tunnel helpers for decapsulating ECN markings	Toke Høiland-Jørgensen
	WireGuard currently only propagates ECN markings on tunnel decap according to the old RFC3168 specification. However, the spec has since been updated in RFC6040 to recommend slightly different decapsulation semantics. This was implemented in the kernel as a set of common helpers for ECN decapsulation, so let's just switch over WireGuard to using those, so it can benefit from this enhancement and any future tweaks. We do not drop packets with invalid ECN marking combinations, because WireGuard is frequently used to work around broken ISPs, which could be doing that. Reported-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> Cc: Dave Taht <dave.taht@gmail.com> Cc: Rodney W. Grimes <ietf@gndrsh.dnsmgr.net> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-17	receive: remove dead code from default packet type case	Jason A. Donenfeld
	The situation in which we wind up hitting the default case here indicates a major bug in earlier parsing code. It is not a usual thing that should ever happen, which means a "friendly" message for it doesn't make sense. Rather, replace this with a WARN_ON, just like we do earlier in the file for a similar situation, so that somebody sends us a bug report and we can fix it. Reported-by: Fabian Freyer <fabianfreyer@radicallyopensecurity.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-17	wireguard: queueing: account for skb->protocol==0	Jason A. Donenfeld
	We carry out checks to the effect of: if (skb->protocol != wg_examine_packet_protocol(skb)) goto err; By having wg_skb_examine_untrusted_ip_hdr return 0 on failure, this means that the check above still passes in the case where skb->protocol is zero, which is possible to hit with AF_PACKET: struct sockaddr_pkt saddr = { .spkt_device = "wg0" }; unsigned char buffer[5] = { 0 }; sendto(socket(AF_PACKET, SOCK_PACKET, /* skb->protocol = / 0), buffer, sizeof(buffer), 0, (const struct sockaddr )&saddr, sizeof(saddr)); Additional checks mean that this isn't actually a problem in the code base, but I could imagine it becoming a problem later if the function is used more liberally. I would prefer to fix this by having wg_examine_packet_protocol return a 32-bit ~0 value on failure, which will never match any value of skb->protocol, which would simply change the generated code from a mov to a movzx. However, sparse complains, and adding __force casts doesn't seem like a good idea, so instead we just add a simple helper function to check for the zero return value. Since wg_examine_packet_protocol itself gets inlined, this winds up not adding an additional branch to the generated code, since the 0 return value already happens in a mergable branch. Reported-by: Fabian Freyer <fabianfreyer@radicallyopensecurity.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-02-13	receive: reset last_under_load to zero	Jason A. Donenfeld
	This is a small optimization that prevents more expensive comparisons from happening when they are no longer necessary, by clearing the last_under_load variable whenever we wind up in a state where we were under load but we no longer are. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Suggested-by: Matt Dunwoodie <ncon@noconroy.net>
2019-12-12	global: fix up spelling	Josh Soref
	Signed-off-by: Josh Soref <jsoref@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-12-05	chacha20poly1305: port to sgmitter for 5.5	Jason A. Donenfeld
	I'm not totally comfortable with these changes yet, and it'll require some more scrutiny. But it's a start. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-07-02	receive: queue dead packets to napi queue instead of empty rx_queue	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-06-25	global: switch to coarse ktime	Jason A. Donenfeld
	Coarse ktime is broken until [1] in 5.2 and kernels without the backport, so we use fallback code there. The fallback code has also been improved significantly. It now only uses slower clocks on kernels < 3.17, at the expense of some accuracy we're not overly concerned about. [1] https://lore.kernel.org/lkml/tip-e3ff9c3678b4d80e22d2557b68726174578eaf52@git.kernel.org/ Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-03	hashtables: decouple hashtable allocations from the main device allocation	Sultan Alsawaf
	The hashtable allocations are quite large, and cause the device allocation in the net framework to stall sometimes while it tries to find a contiguous region that can fit the device struct: [<0000000000000000>] __switch_to+0x94/0xb8 [<0000000000000000>] __alloc_pages_nodemask+0x764/0x7e8 [<0000000000000000>] kmalloc_order+0x20/0x40 [<0000000000000000>] __kmalloc+0x144/0x1a0 [<0000000000000000>] alloc_netdev_mqs+0x5c/0x368 [<0000000000000000>] rtnl_create_link+0x48/0x180 [<0000000000000000>] rtnl_newlink+0x410/0x708 [<0000000000000000>] rtnetlink_rcv_msg+0x190/0x1f8 [<0000000000000000>] netlink_rcv_skb+0x4c/0xf8 [<0000000000000000>] rtnetlink_rcv+0x30/0x40 [<0000000000000000>] netlink_unicast+0x18c/0x208 [<0000000000000000>] netlink_sendmsg+0x19c/0x348 [<0000000000000000>] sock_sendmsg+0x3c/0x58 [<0000000000000000>] ___sys_sendmsg+0x290/0x2b0 [<0000000000000000>] __sys_sendmsg+0x58/0xa0 [<0000000000000000>] SyS_sendmsg+0x10/0x20 [<0000000000000000>] el0_svc_naked+0x34/0x38 [<0000000000000000>] 0xffffffffffffffff To fix the allocation stalls, decouple the hashtable allocations from the device allocation and allocate the hashtables with kvmalloc's implicit __GFP_NORETRY so that the allocations fall back to vmalloc with little resistance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-01-07	global: update copyright	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-11-13	global: various formatting tweeks	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-11-05	compat: csum_levels is new in 3.18 but backported to RHEL	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-27	receive: assume all levels have been checksumed, not just outer	Jason A. Donenfeld
	This means we do less computation on encapsulated payloads. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-25	global: do not allow compiler to reorder is_valid or is_dead	Jason A. Donenfeld
	Suggested-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-09	global: give if statements brackets and other cleanups	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08	global: more nits	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08	global: rename struct wireguard_ to struct wg_	Jason A. Donenfeld
	This required a bit of pruning of our christmas trees. Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08	global: prefix functions used in callbacks with wg_	Jason A. Donenfeld
	Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-07	global: style nits	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-06	global: rename include'd C files to be .c	Jason A. Donenfeld
	This is done by 259 other files in the kernel tree: linux $ rg '#include.*\.c' -l \| wc -l 259 Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-02	global: prefix all functions with wg_	Jason A. Donenfeld
	I understand why this must be done, though I'm not so happy about having to do it. In some places, it puts us over 80 chars and we have to break lines up in further ugly ways. And in general, I think this makes things harder to read. Yet another thing we must do to please upstream. Maybe this can be replaced in the future by some kind of automatic module namespacing logic in the linker, or even combined with LTO and aggressive symbol stripping. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-20	global: put SPDX identifier on its own line	Jason A. Donenfeld
	The kernel has very specific rules correlating file type with comment type, and also SPDX identifiers can't be merged with other comments. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-17	crypto: pass simd by reference	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-16	global: remove non-essential inline annotations	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-16	send/receive: reduce number of sg entries	Jason A. Donenfeld
	This reduces stack usage to quell warnings on powerpc. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-03	crypto: import zinc	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-28	global: run through clang-format	Jason A. Donenfeld
	This is the worst commit in the whole repo, making the code much less readable, but so it goes with upstream maintainers. We are now woefully wrapped at 80 columns. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-06	crypto: move simd context to specific type	Jason A. Donenfeld
	Suggested-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-03	peer: ensure destruction doesn't race	Jason A. Donenfeld
	Completely rework peer removal to ensure peers don't jump between contexts and create races. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-02	queueing: ensure strictly ordered loads and stores	Jason A. Donenfeld
	We don't want a consumer to read plaintext when it's supposed to be reading ciphertext, which means we need to synchronize across cores. Suggested-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-31	peer: simplify rcu reference counts	Jason A. Donenfeld
	Use RCU reference counts only when we must, and otherwise use a more reasonably named function. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-24	receive: check against proper return value type	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-12	receive: use gro call instead of plain call	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-11	receive: account for zero or negative budget	Jason A. Donenfeld
	Suggested-by: Thomas Gschwantner <tharre3@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-08	receive: use NAPI on the receive path	Jonathan Neuschäfer
	Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> [Jason: fixed up the flushing of the rx_queue in peer_remove] Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-04	receive: style	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-23	global: use fast boottime instead of normal boottime	Jason A. Donenfeld
	Generally if we're inaccurate by a few nanoseconds, it doesn't matter. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-23	global: use ktime boottime instead of jiffies	Jason A. Donenfeld
	Since this is a network protocol, expirations need to be accounted for, even across system suspend. On real systems, this isn't a problem, since we're clearing all keys before suspend. But on Android, where we don't do that, this is something of a problem. So, we switch to using boottime instead of jiffies. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-22	receive: don't toggle bh	Jason A. Donenfeld
	This had a bad performance impact. We'll probably need to revisit this later, but for now, let's not introduce a regression. Reported-by: Lonnie Abelbeck <lonnie@abelbeck.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-19	receive: drop handshake packets if rng is not initialized	Jason A. Donenfeld
	Otherwise it's too easy to trigger cookie reply messages. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-17	simd: encapsulate fpu amortization into nice functions	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-16	queueing: re-enable preemption periodically to lower latency	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-16	queueing: remove useless spinlocks on sc	Jason A. Donenfeld
	Since these are the only consumers, there's no need for locking. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-01-03	global: year bump	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>