summaryrefslogtreecommitdiffhomepage
path: root/src/peer.c
AgeCommit message (Collapse)Author
2021-06-04peer: allocate in kmem_cacheJason A. Donenfeld
With deployments having upwards of 600k peers now, this somewhat heavy structure could benefit from more fine-grained allocations. Specifically, instead of using a 2048-byte slab for a 1544-byte object, we can now use 1544-byte objects directly, thus saving almost 25% per-peer, or with 600k peers, that's a savings of 303 MiB. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02global: use synchronize_net rather than synchronize_rcuJason A. Donenfeld
Many of the synchronization points are sometimes called under the rtnl lock, which means we should use synchronize_net rather than synchronize_rcu. Under the hood, this expands to using the expedited flavor of function in the event that rtnl is held, in order to not stall other concurrent changes. This fixes some very, very long delays when removing multiple peers at once, which would cause some operations to take several minutes. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18queueing: get rid of per-peer ring buffersJason A. Donenfeld
Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27compat: backport kfree_sensitive and switch to itJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-03-18noise: error out precomputed DH during handshake rather than configJason A. Donenfeld
We precompute the static-static ECDH during configuration time, in order to save an expensive computation later when receiving network packets. However, not all ECDH computations yield a contributory result. Prior, we were just not letting those peers be added to the interface. However, this creates a strange inconsistency, since it was still possible to add other weird points, like a valid public key plus a low-order point, and, like points that result in zeros, a handshake would not complete. In order to make the behavior more uniform and less surprising, simply allow all peers to be added. Then, we'll error out later when doing the crypto if there's an issue. This also adds more separation between the crypto layer and the configuration layer. Discussed-with: Mathias Hall-Andersen <mathias@hall-andersen.dk> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-08-05netlink: skip peers with invalid keysJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-07-11noise: immediately rekey all peers after changing device private keyJason A. Donenfeld
Reported-by: Derrick Pallas <derrick@pallas.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-06-28peer: use LIST_HEAD macroJason A. Donenfeld
Suggested-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-06-25global: switch to coarse ktimeJason A. Donenfeld
Coarse ktime is broken until [1] in 5.2 and kernels without the backport, so we use fallback code there. The fallback code has also been improved significantly. It now only uses slower clocks on kernels < 3.17, at the expense of some accuracy we're not overly concerned about. [1] https://lore.kernel.org/lkml/tip-e3ff9c3678b4d80e22d2557b68726174578eaf52@git.kernel.org/ Suggested-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-03-25peerlookup: rename from hashtablesJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-03-17global: the _bh variety of rcu helpers have been unifiedJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-26allowedips: maintain per-peer list of allowedipsJason A. Donenfeld
This makes `wg show` and `wg showconf` and the like significantly faster, since we don't have to iterate through every node of the trie for every single peer. It also makes netlink cursor resumption much less problematic, since we're just iterating through a list, rather than having to save a traversal stack. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-25peer: only synchronize_rcu_bh and traverse trie once when removing all peersJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-02-03hashtables: decouple hashtable allocations from the main device allocationSultan Alsawaf
The hashtable allocations are quite large, and cause the device allocation in the net framework to stall sometimes while it tries to find a contiguous region that can fit the device struct: [<0000000000000000>] __switch_to+0x94/0xb8 [<0000000000000000>] __alloc_pages_nodemask+0x764/0x7e8 [<0000000000000000>] kmalloc_order+0x20/0x40 [<0000000000000000>] __kmalloc+0x144/0x1a0 [<0000000000000000>] alloc_netdev_mqs+0x5c/0x368 [<0000000000000000>] rtnl_create_link+0x48/0x180 [<0000000000000000>] rtnl_newlink+0x410/0x708 [<0000000000000000>] rtnetlink_rcv_msg+0x190/0x1f8 [<0000000000000000>] netlink_rcv_skb+0x4c/0xf8 [<0000000000000000>] rtnetlink_rcv+0x30/0x40 [<0000000000000000>] netlink_unicast+0x18c/0x208 [<0000000000000000>] netlink_sendmsg+0x19c/0x348 [<0000000000000000>] sock_sendmsg+0x3c/0x58 [<0000000000000000>] ___sys_sendmsg+0x290/0x2b0 [<0000000000000000>] __sys_sendmsg+0x58/0xa0 [<0000000000000000>] SyS_sendmsg+0x10/0x20 [<0000000000000000>] el0_svc_naked+0x34/0x38 [<0000000000000000>] 0xffffffffffffffff To fix the allocation stalls, decouple the hashtable allocations from the device allocation and allocate the hashtables with kvmalloc's implicit __GFP_NORETRY so that the allocations fall back to vmalloc with little resistance. Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2019-01-07global: update copyrightJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-27send: consider dropped stage packets to be droppedJason A. Donenfeld
Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-25peer: another peer_remove cleanupJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08global: more nitsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-08global: rename struct wireguard_ to struct wg_Jason A. Donenfeld
This required a bit of pruning of our christmas trees. Suggested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-10-02global: prefix all functions with wg_Jason A. Donenfeld
I understand why this must be done, though I'm not so happy about having to do it. In some places, it puts us over 80 chars and we have to break lines up in further ugly ways. And in general, I think this makes things harder to read. Yet another thing we must do to please upstream. Maybe this can be replaced in the future by some kind of automatic module namespacing logic in the linker, or even combined with LTO and aggressive symbol stripping. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-20global: put SPDX identifier on its own lineJason A. Donenfeld
The kernel has very specific rules correlating file type with comment type, and also SPDX identifiers can't be merged with other comments. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-04global: always find OOM unlikelyJason A. Donenfeld
Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-09-04global: prefer sizeof(*pointer) when possibleJason A. Donenfeld
Suggested-by: Sultan Alsawaf <sultanxda@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-28global: run through clang-formatJason A. Donenfeld
This is the worst commit in the whole repo, making the code much less readable, but so it goes with upstream maintainers. We are now woefully wrapped at 80 columns. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-04send: switch handshake stamp to an atomicJason A. Donenfeld
Rather than abusing the handshake lock, we're much better off just using a boring atomic64 for this. It's simpler and performs better. Also, while we're at it, we set the handshake stamp both before and after the calculations, in case the calculations block for a really long time waiting for the RNG to initialize. Otherwise it's possible that when the RNG finally initializes, two handshakes are sent back to back, which isn't sensible. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-03peer: ensure destruction doesn't raceJason A. Donenfeld
Completely rework peer removal to ensure peers don't jump between contexts and create races. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-08-01peer: ensure resources are freed when creation failsJason A. Donenfeld
And in general tighten up the logic of peer creation. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-31peer: simplify rcu reference countsJason A. Donenfeld
Use RCU reference counts only when we must, and otherwise use a more reasonably named function. Reported-by: Jann Horn <jann@thejh.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-18recieve: disable NAPI busy pollingThomas Gschwantner
This avoids adding one reference per peer to the napi_hash hashtable, as normally done by netif_napi_add(). Since we potentially could have up to 2^20 peers this would make busy polling very slow globally. This approach is preferable to having only a single napi struct because we get one gro_list per peer, which means packets can be combined nicely even if we have a large number of peers. This is also done by gro_cells_init() in net/core/gro_cells.c . Signed-off-by: Thomas Gschwantner <tharre3@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-07-08receive: use NAPI on the receive pathJonathan Neuschäfer
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> [Jason: fixed up the flushing of the rx_queue in peer_remove] Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-23global: use fast boottime instead of normal boottimeJason A. Donenfeld
Generally if we're inaccurate by a few nanoseconds, it doesn't matter. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-06-23global: use ktime boottime instead of jiffiesJason A. Donenfeld
Since this is a network protocol, expirations need to be accounted for, even across system suspend. On real systems, this isn't a problem, since we're clearing all keys before suspend. But on Android, where we don't do that, this is something of a problem. So, we switch to using boottime instead of jiffies. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2018-01-03global: year bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-12-09global: add SPDX tags to all filesGreg Kroah-Hartman
It's good to have SPDX identifiers in all files as the Linux kernel developers are working to add these identifiers to all files. Update all files with the correct SPDX license identifier based on the license text of the project or based on the license in the file itself. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Modified-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-29device: clear last handshake timer on ifdownJason A. Donenfeld
Otherwise new handshakes might not occur immediately when the interface goes up and down. Also initialize peers to having a proper zeroed handshake jiffies. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-10allowedips: rename from routingtableJason A. Donenfeld
Makes it more clear that this _not_ a routing table replacement. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-11-02global: revert checkpatch.pl changesJason A. Donenfeld
These changes were suggested by checkpatch.pl, but actually cause big problems depending on the options. Revert. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31global: style nitsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31global: infuriating kernel iterator styleJason A. Donenfeld
One types: for (i = 0 ... So one should also type: for_each_obj (obj ... But the upstream kernel style guidelines are insane, and so we must instead do: for_each_obj(obj ... Ugly, but one must choose his battles wisely. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31peer: store total number of peers instead of iteratingJason A. Donenfeld
This is faster, since it means adding a new peer is O(1) instead of O(n). It's also safe to do because we're holding the device_update_lock on both the ++ and the --. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-31global: accept decent check_patch.pl suggestionsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-09routingtable: only use device's mutex, not a special rt oneJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-05queueing: use ptr_ring instead of linked listsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-03global: add space around variable declarationsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-02noise: use spinlock for rotating keysJason A. Donenfeld
This should only really be contended in extremely exceptional cases, so changing from a mutex to a spinlock is likely fine. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-02peer: remove from RCU lists when the kref is zeroJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-02peer: ensure that lookup tables are added lastJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-10-02netlink: switch from ioctl to netlink for configurationJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-24timers: convert to use netif_runningJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2017-09-18queue: entirely rework parallel systemJason A. Donenfeld
This removes our dependency on padata and moves to a different mode of multiprocessing that is more efficient. This began as Samuel Holland's GSoC project and was gradually reworked/redesigned/rebased into this present commit, which is a combination of his initial contribution and my subsequent rewriting and redesigning. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>