wireguard-linux-compat - WireGuard Linux compat

Age	Commit message (Collapse)	Author
2021-06-04	peer: allocate in kmem_cache	Jason A. Donenfeld
	With deployments having upwards of 600k peers now, this somewhat heavy structure could benefit from more fine-grained allocations. Specifically, instead of using a 2048-byte slab for a 1544-byte object, we can now use 1544-byte objects directly, thus saving almost 25% per-peer, or with 600k peers, that's a savings of 303 MiB. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02	global: use synchronize_net rather than synchronize_rcu	Jason A. Donenfeld
	Many of the synchronization points are sometimes called under the rtnl lock, which means we should use synchronize_net rather than synchronize_rcu. Under the hood, this expands to using the expedited flavor of function in the event that rtnl is held, in order to not stall other concurrent changes. This fixes some very, very long delays when removing multiple peers at once, which would cause some operations to take several minutes. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18	queueing: get rid of per-peer ring buffers	Jason A. Donenfeld
	Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we already have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27	compat: backport kfree_sensitive and switch to it	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-18	noise: error out precomputed DH during handshake rather than config	Jason A. Donenfeld
	We precompute the static-static ECDH during configuration time, in order to save an expensive computation later when receiving network packets. However, not all ECDH computations yield a contributory result. Prior, we were just not letting those peers be added to the interface. However, this creates a strange inconsistency, since it was still possible to add other weird points, like a valid public key plus a low-order point, and, like points that result in zeros, a handshake would not complete. In order to make the behavior more uniform and less surprising, simply allow all peers to be added. Then, we'll error out later when doing the crypto if there's an issue. This also adds more separation between the crypto layer and the configuration layer. Discussed-with: Mathias Hall-Andersen <mathias@hall-andersen.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-08-05	netlink: skip peers with invalid keys	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-07-11	noise: immediately rekey all peers after changing device private key	Jason A. Donenfeld
	Reported-by: Derrick Pallas <derrick@pallas.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-06-28	peer: use LIST_HEAD macro	Jason A. Donenfeld
	Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-06-25	global: switch to coarse ktime	Jason A. Donenfeld
	Coarse ktime is broken until [1] in 5.2 and kernels without the backport, so we use fallback code there. The fallback code has also been improved significantly. It now only uses slower clocks on kernels < 3.17, at the expense of some accuracy we're not overly concerned about. [1] https://lore.kernel.org/lkml/tip-e3ff9c3678b4d80e22d2557b68726174578eaf52@git.kernel.org/ Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-03-25	peerlookup: rename from hashtables	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-03-17	global: the _bh variety of rcu helpers have been unified	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-26	allowedips: maintain per-peer list of allowedips	Jason A. Donenfeld
	This makes `wg show` and `wg showconf` and the like significantly faster, since we don't have to iterate through every node of the trie for every single peer. It also makes netlink cursor resumption much less problematic, since we're just iterating through a list, rather than having to save a traversal stack. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-25	peer: only synchronize_rcu_bh and traverse trie once when removing all peers	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-03	hashtables: decouple hashtable allocations from the main device allocation	Sultan Alsawaf
	The hashtable allocations are quite large, and cause the device allocation in the net framework to stall sometimes while it tries to find a contiguous region that can fit the device struct: [<0000000000000000>] __switch_to+0x94/0xb8 [<0000000000000000>] __alloc_pages_nodemask+0x764/0x7e8 [<0000000000000000>] kmalloc_order+0x20/0x40 [<0000000000000000>] __kmalloc+0x144/0x1a0 [<0000000000000000>] alloc_netdev_mqs+0x5c/0x368 [<0000000000000000>] rtnl_create_link+0x48/0x180 [<0000000000000000>] rtnl_newlink+0x410/0x708 [<0000000000000000>] rtnetlink_rcv_msg+0x190/0x1f8 [<0000000000000000>] netlink_rcv_skb+0x4c/0xf8 [<0000000000000000>] rtnetlink_rcv+0x30/0x40 [<0000000000000000>] netlink_unicast+0x18c/0x208 [<0000000000000000>] netlink_sendmsg+0x19c/0x348 [<0000000000000000>] sock_sendmsg+0x3c/0x58 [<0000000000000000>] ___sys_sendmsg+0x290/0x2b0 [<0000000000000000>] __sys_sendmsg+0x58/0xa0 [<0000000000000000>] SyS_sendmsg+0x10/0x20 [<0000000000000000>] el0_svc_naked+0x34/0x38 [<0000000000000000>] 0xffffffffffffffff To fix the allocation stalls, decouple the hashtable allocations from the device allocation and allocate the hashtables with kvmalloc's implicit __GFP_NORETRY so that the allocations fall back to vmalloc with little resistance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-01-07	global: update copyright	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-27	send: consider dropped stage packets to be dropped	Jason A. Donenfeld
	Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-25	peer: another peer_remove cleanup	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08	global: more nits	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08	global: rename struct wireguard_ to struct wg_	Jason A. Donenfeld
	This required a bit of pruning of our christmas trees. Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-02	global: prefix all functions with wg_	Jason A. Donenfeld
	I understand why this must be done, though I'm not so happy about having to do it. In some places, it puts us over 80 chars and we have to break lines up in further ugly ways. And in general, I think this makes things harder to read. Yet another thing we must do to please upstream. Maybe this can be replaced in the future by some kind of automatic module namespacing logic in the linker, or even combined with LTO and aggressive symbol stripping. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-20	global: put SPDX identifier on its own line	Jason A. Donenfeld
	The kernel has very specific rules correlating file type with comment type, and also SPDX identifiers can't be merged with other comments. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-04	global: always find OOM unlikely	Jason A. Donenfeld
	Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-04	global: prefer sizeof(*pointer) when possible	Jason A. Donenfeld
	Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-28	global: run through clang-format	Jason A. Donenfeld
	This is the worst commit in the whole repo, making the code much less readable, but so it goes with upstream maintainers. We are now woefully wrapped at 80 columns. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-04	send: switch handshake stamp to an atomic	Jason A. Donenfeld
	Rather than abusing the handshake lock, we're much better off just using a boring atomic64 for this. It's simpler and performs better. Also, while we're at it, we set the handshake stamp both before and after the calculations, in case the calculations block for a really long time waiting for the RNG to initialize. Otherwise it's possible that when the RNG finally initializes, two handshakes are sent back to back, which isn't sensible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-03	peer: ensure destruction doesn't race	Jason A. Donenfeld
	Completely rework peer removal to ensure peers don't jump between contexts and create races. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-01	peer: ensure resources are freed when creation fails	Jason A. Donenfeld
	And in general tighten up the logic of peer creation. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-31	peer: simplify rcu reference counts	Jason A. Donenfeld
	Use RCU reference counts only when we must, and otherwise use a more reasonably named function. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-18	recieve: disable NAPI busy polling	Thomas Gschwantner
	This avoids adding one reference per peer to the napi_hash hashtable, as normally done by netif_napi_add(). Since we potentially could have up to 2^20 peers this would make busy polling very slow globally. This approach is preferable to having only a single napi struct because we get one gro_list per peer, which means packets can be combined nicely even if we have a large number of peers. This is also done by gro_cells_init() in net/core/gro_cells.c . Signed-off-by: Thomas Gschwantner <tharre3@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-08	receive: use NAPI on the receive path	Jonathan Neuschäfer
	Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> [Jason: fixed up the flushing of the rx_queue in peer_remove] Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-23	global: use fast boottime instead of normal boottime	Jason A. Donenfeld
	Generally if we're inaccurate by a few nanoseconds, it doesn't matter. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-23	global: use ktime boottime instead of jiffies	Jason A. Donenfeld
	Since this is a network protocol, expirations need to be accounted for, even across system suspend. On real systems, this isn't a problem, since we're clearing all keys before suspend. But on Android, where we don't do that, this is something of a problem. So, we switch to using boottime instead of jiffies. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-01-03	global: year bump	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-12-09	global: add SPDX tags to all files	Greg Kroah-Hartman
	It's good to have SPDX identifiers in all files as the Linux kernel developers are working to add these identifiers to all files. Update all files with the correct SPDX license identifier based on the license text of the project or based on the license in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Modified-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-29	device: clear last handshake timer on ifdown	Jason A. Donenfeld
	Otherwise new handshakes might not occur immediately when the interface goes up and down. Also initialize peers to having a proper zeroed handshake jiffies. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-10	allowedips: rename from routingtable	Jason A. Donenfeld
	Makes it more clear that this _not_ a routing table replacement. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-02	global: revert checkpatch.pl changes	Jason A. Donenfeld
	These changes were suggested by checkpatch.pl, but actually cause big problems depending on the options. Revert. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31	global: style nits	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31	global: infuriating kernel iterator style	Jason A. Donenfeld
	One types: for (i = 0 ... So one should also type: for_each_obj (obj ... But the upstream kernel style guidelines are insane, and so we must instead do: for_each_obj(obj ... Ugly, but one must choose his battles wisely. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31	peer: store total number of peers instead of iterating	Jason A. Donenfeld
	This is faster, since it means adding a new peer is O(1) instead of O(n). It's also safe to do because we're holding the device_update_lock on both the ++ and the --. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31	global: accept decent check_patch.pl suggestions	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-09	routingtable: only use device's mutex, not a special rt one	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-05	queueing: use ptr_ring instead of linked lists	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-03	global: add space around variable declarations	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-02	noise: use spinlock for rotating keys	Jason A. Donenfeld
	This should only really be contended in extremely exceptional cases, so changing from a mutex to a spinlock is likely fine. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-02	peer: remove from RCU lists when the kref is zero	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-02	peer: ensure that lookup tables are added last	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-02	netlink: switch from ioctl to netlink for configuration	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-24	timers: convert to use netif_running	Jason A. Donenfeld
	Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-18	queue: entirely rework parallel system	Jason A. Donenfeld
	This removes our dependency on padata and moves to a different mode of multiprocessing that is more efficient. This began as Samuel Holland's GSoC project and was gradually reworked/redesigned/rebased into this present commit, which is a combination of his initial contribution and my subsequent rewriting and redesigning. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>