summaryrefslogtreecommitdiffhomepage
AgeCommit message (Collapse)Author
2021-08-08crypto: curve25519-x86_64: solve register constraints with reserved registersHEADmasterMathias Krause
The register constraints for the inline assembly in fsqr() and fsqr2() are pretty tight on what the compiler may assign to the remaining three register variables. The clobber list only allows the following to be used: RDI, RSI, RBP and R12. With RAP reserving R12 and a kernel having CONFIG_FRAME_POINTER=y, claiming RBP, there are only two registers left so the compiler rightfully complains about impossible constraints. Provide alternatives that'll allow a memory reference for 'out' to solve the allocation constraint dilemma for this configuration. Signed-off-by: Mathias Krause <minipli@grsecurity.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-08-08compat: account for grsecurity backports and changesMathias Krause
grsecurity kernels tend to carry additional backports and changes, like commit b60b87fc2996 ("netlink: add ethernet address policy types") or the SYM_FUNC_* changes. RAP nowadays hooks the latter, therefore no diversion to RAP_ENTRY is needed any more. Instead of relying on the kernel version test, also test for the macros we're about to define to not already be defined to account for these additional changes in the grsecurity patch without breaking compatibility to the older public ones. Also test for CONFIG_PAX instead of RAP_PLUGIN for the timer API related changes as these don't depend on the RAP plugin to be enabled but just a PaX/grsecurity patch to be applied. While there is no preprocessor knob for the latter, use CONFIG_PAX as this will likely be enabled in every kernel that uses the patch. Signed-off-by: Mathias Krause <minipli@grsecurity.net> [zx2c4: small changes to include a header nearby a macro def test] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-15compat: account for latest c8s backportsJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06version: bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06qemu: increase default dmesg log sizeJason A. Donenfeld
The selftests currently parse the kernel log at the end to track potential memory leaks. With these tests now reading off the end of the buffer, due to recent optimizations, some creation messages were lost, making the tests think that there was a free without an alloc. Fix this by increasing the kernel log size. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-06qemu: add disgusting hacks for RHEL 8Jason A. Donenfeld
Red Hat does awful things to their kernel for RHEL 8, such that it doesn't even compile in most configurations. This is utter craziness, and their response to me sending patches to fix this stuff has been to stonewall for months on end and then do nothing. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: add missing __rcu annotation to satisfy sparseJason A. Donenfeld
A __rcu annotation got lost during refactoring, which caused sparse to become enraged. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: free empty intermediate nodes when removing single nodeJason A. Donenfeld
When removing single nodes, it's possible that that node's parent is an empty intermediate node, in which case, it too should be removed. Otherwise the trie fills up and never is fully emptied, leading to gradual memory leaks over time for tries that are modified often. There was originally code to do this, but was removed during refactoring in 2016 and never reworked. Now that we have proper parent pointers from the previous commits, we can implement this properly. In order to reduce branching and expensive comparisons, we want to keep the double pointer for parent assignment (which lets us easily chain up to the root), but we still need to actually get the parent's base address. So encode the bit number into the last two bits of the pointer, and pack and unpack it as needed. This is a little bit clumsy but is the fastest and less memory wasteful of the compromises. Note that we align the root struct here to a minimum of 4, because it's embedded into a larger struct, and we're relying on having the bottom two bits for our flag, which would only be 16-bit aligned on m68k. The existing macro-based helpers were a bit unwieldy for adding the bit packing to, so this commit replaces them with safer and clearer ordinary functions. We add a test to the randomized/fuzzer part of the selftests, to free the randomized tries by-peer, refuzz it, and repeat, until it's supposed to be empty, and then then see if that actually resulted in the whole thing being emptied. That combined with kmemcheck should hopefully make sure this commit is doing what it should. Along the way this resulted in various other cleanups of the tests and fixes for recent graphviz. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: allocate nodes in kmem_cacheJason A. Donenfeld
The previous commit moved from O(n) to O(1) for removal, but in the process introduced an additional pointer member to a struct that increased the size from 60 to 68 bytes, putting nodes in the 128-byte slab. With deployed systems having as many as 2 million nodes, this represents a significant doubling in memory usage (128 MiB -> 256 MiB). Fix this by using our own kmem_cache, that's sized exactly right. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: remove nodes in O(1)Jason A. Donenfeld
Previously, deleting peers would require traversing the entire trie in order to rebalance nodes and safely free them. This meant that removing 1000 peers from a trie with a half million nodes would take an extremely long time, during which we're holding the rtnl lock. Large-scale users were reporting 200ms latencies added to the networking stack as a whole every time their userspace software would queue up significant removals. That's a serious situation. This commit fixes that by maintaining a double pointer to the parent's bit pointer for each node, and then using the already existing node list belonging to each peer to go directly to the node, fix up its pointers, and free it with RCU. This means removal is O(1) instead of O(n), and we don't use gobs of stack. The removal algorithm has the same downside as the code that it fixes: it won't collapse needlessly long runs of fillers. We can enhance that in the future if it ever becomes a problem. This commit documents that limitation with a TODO comment in code, a small but meaningful improvement over the prior situation. Currently the biggest flaw, which the next commit addresses, is that because this increases the node size on 64-bit machines from 60 bytes to 68 bytes. 60 rounds up to 64, but 68 rounds up to 128. So we wind up using twice as much memory per node, because of power-of-two allocations, which is a big bummer. We'll need to figure something out there. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04allowedips: initialize list head in selftestJason A. Donenfeld
The randomized trie tests weren't initializing the dummy peer list head, resulting in a NULL pointer dereference when used. Fix this by initializing it in the randomized trie test, just like we do for the static unit test. While we're at it, all of the other strings like this have the word "self-test", so add it to the missing place here. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-04peer: allocate in kmem_cacheJason A. Donenfeld
With deployments having upwards of 600k peers now, this somewhat heavy structure could benefit from more fine-grained allocations. Specifically, instead of using a 2048-byte slab for a 1544-byte object, we can now use 1544-byte objects directly, thus saving almost 25% per-peer, or with 600k peers, that's a savings of 303 MiB. This also makes wireguard's memory usage more transparent in tools like slabtop and /proc/slabinfo. Suggested-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02global: use synchronize_net rather than synchronize_rcuJason A. Donenfeld
Many of the synchronization points are sometimes called under the rtnl lock, which means we should use synchronize_net rather than synchronize_rcu. Under the hood, this expands to using the expedited flavor of function in the event that rtnl is held, in order to not stall other concurrent changes. This fixes some very, very long delays when removing multiple peers at once, which would cause some operations to take several minutes. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02kbuild: do not use -O3Jason A. Donenfeld
Apparently, various versions of gcc have O3-related miscompiles. Looking at the difference between -O2 and -O3 for gcc 11 doesn't indicate miscompiles, but the difference also doesn't seem so significant for performance that it's worth risking. Link: https://lore.kernel.org/lkml/CAHk-=wjuoGyxDhAF8SsrTkN0-YfCx7E6jUN3ikC_tn2AKWTTsA@mail.gmail.com/ Link: https://lore.kernel.org/lkml/CAHmME9otB5Wwxp7H8bR_i2uH2esEMvoBMC8uEXBMH9p0q1s6Bw@mail.gmail.com/ Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-06-02netns: make sure rp_filter is disabled on vethcJason A. Donenfeld
Some distros may enable strict rp_filter by default, which will prevent vethc from receiving the packets with an unroutable reverse path address. Reported-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-24version: bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-23Revert "compat: skb_mark_not_on_list will be backported to Ubuntu 18.04"Thadeu Lima de Souza Cascardo
This reverts commit cad80597c7947f0def83caf8cb56aff0149c83a8. Because this commit has not been backported so far, due to the implications of building Ubuntu's backport of wireguard in a timely manner. For now, reverting this fix would allow wireguard-linux-compat CI to work on Ubuntu 18.04. A different fix or the same one can be applied again when the time is right. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-04-22compat: update and improve detection of CentOS Stream 8Peter Georg
CentOS Stream 8 by now (4.18.0-301.1.el8) reports RHEL_MINOR=5. The current RHEL 8 minor release is still 3. RHEL 8.4 is in beta. Replace equal comparison by greater equal to (hopefully) be a little bit more future proof. Signed-off-by: Peter Georg <peter.georg@physik.uni-regensburg.de> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-03-07compat: icmp_ndo_send functions were backported extensivelyJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19version: bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19qemu: bump default kernel versionJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-19compat: zero out skb->cb before icmpJason A. Donenfeld
This corresponds to the fancier upstream commit that's still on lkml, which passes a zeroed ip_options struct to __icmp_send. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18compat: skb_mark_not_on_list will be backported to Ubuntu 18.04Thadeu Lima de Souza Cascardo
linux commit 22f6bbb7bcfcef0b373b0502a7ff390275c575dd ("net: use skb_list_del_init() to remove from RX sublists") will be backported to Ubuntu 18.04 default kernel, which is based on linux 4.15. Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18queueing: get rid of per-peer ring buffersJason A. Donenfeld
Having two ring buffers per-peer means that every peer results in two massive ring allocations. On an 8-core x86_64 machine, this commit reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which is an 90% reduction. Ninety percent! With some single-machine deployments approaching 500,000 peers, we're talking about a reduction from 7 gigs of memory down to 700 megs of memory. In order to get rid of these per-peer allocations, this commit switches to using a list-based queueing approach. Currently GSO fragments are chained together using the skb->next pointer (the skb_list_* singly linked list approach), so we form the per-peer queue around the unused skb->prev pointer (which sort of makes sense because the links are pointing backwards). Use of skb_queue_* is not possible here, because that is based on doubly linked lists and spinlocks. Multiple cores can write into the queue at any given time, because its writes occur in the start_xmit path or in the udp_recv path. But reads happen in a single workqueue item per-peer, amounting to a multi-producer, single-consumer paradigm. The MPSC queue is implemented locklessly and never blocks. However, it is not linearizable (though it is serializable), with a very tight and unlikely race on writes, which, when hit (some tiny fraction of the 0.15% of partial adds on a fully loaded 16-core x86_64 system), causes the queue reader to terminate early. However, because every packet sent queues up the same workqueue item after it is fully added, the worker resumes again, and stopping early isn't actually a problem, since at that point the packet wouldn't have yet been added to the encryption queue. These properties allow us to avoid disabling interrupts or spinning. The design is based on Dmitry Vyukov's algorithm [1]. Performance-wise, ordinarily list-based queues aren't preferable to ringbuffers, because of cache misses when following pointers around. However, we *already* have to follow the adjacent pointers when working through fragments, so there shouldn't actually be any change there. A potential downside is that dequeueing is a bit more complicated, but the ptr_ring structure used prior had a spinlock when dequeueing, so all and all the difference appears to be a wash. Actually, from profiling, the biggest performance hit, by far, of this commit winds up being atomic_add_unless(count, 1, max) and atomic_ dec(count), which account for the majority of CPU time, according to perf. In that sense, the previous ring buffer was superior in that it could check if it was full by head==tail, which the list-based approach cannot do. But all and all, this enables us to get massive memory savings, allowing WireGuard to scale for real world deployments, without taking much of a performance hit. [1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18device: do not generate ICMP for non-IP packetsJason A. Donenfeld
If skb->protocol doesn't match the actual skb->data header, it's probably not a good idea to pass it off to icmp{,v6}_ndo_send, which is expecting to reply to a valid IP packet. So this commit has that early mismatch case jump to a later error label. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-18selftests: test multiple parallel streamsJason A. Donenfeld
In order to test ndo_start_xmit being called in parallel, explicitly add separate tests, which should all run on different cores. This should help tease out bugs associated with queueing up packets from different cores in parallel. Currently, it hasn't found those types of bugs, but given future planned work, this is a useful regression to avoid. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-08peer: put frequently used members above cache linesJason A. Donenfeld
The is_dead boolean is checked for every single packet, while the internal_id member is used basically only for pr_debug messages. So it makes sense to hoist up is_dead into some space formerly unused by a struct hole, while demoting internal_api to below the lowest struct cache line. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-07compat: redefine version constants for sublevel>=256Jason A. Donenfeld
With the 4.4.256 and 4.9.256 kernels, the previous calculation for integer comparison overflowed. This commit redefines the broken constants to have more space for the sublevel. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-02-07compat: remove unused version.h headersJason A. Donenfeld
We don't need this in all files, and it just complicates things. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-24version: bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-24compat: skb_mark_not_on_list was backported to 4.14Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2021-01-13compat: SYM_FUNC_* was backported to c8sJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-21version: bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-21socket: remove bogus __be32 annotationJann Horn
The endpoint->src_if4 has nothing to do with fixed-endian numbers; remove the bogus annotation. This was introduced in https://git.zx2c4.com/wireguard-monolithic-historical/commit?id=14e7d0a499a676ec55176c0de2f9fcbd34074a82 in the historical WireGuard repo because the old code used to zero-initialize multiple members as follows: endpoint->src4.s_addr = endpoint->src_if4 = fl.saddr = 0; Because fl.saddr is fixed-endian and an assignment returns a value with the type of its left operand, this meant that sparse detected an assignment between values of different endianness. Since then, this assignment was already split up into separate statements; just the cast survived. Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-21global: avoid double unlikely() notation when using IS_ERR()Antonio Quartulli
The definition of IS_ERR() already applies the unlikely() notation when checking the error status of the passed pointer. For this reason there is no need to have the same notation outside of IS_ERR() itself. Clean up code by removing redundant notation. Signed-off-by: Antonio Quartulli <a@unstable.cc> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-19simd: detect -rt kernels >= 5.4Jason A. Donenfeld
The 5.4 series of -rt kernels moved from PREEMPT_RT_BASE/PREEMPT_RT_FULL to PREEMPT_RT, so we have to account for it here. Otherwise users get scheduling-while-atomic splats. Reported-by: Erik Schuitema <erik@essd.nl> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-16gitignore: ignore intermediary build fileL.W.Reek
Signed-off-by: L.W.Reek <syphyr@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-12-14compat: drop rhel 8.2, add rhel 8.4 supportJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-12version: bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-12qemu: bump default testing versionJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-12compat: SYM_FUNC_{START,END} were backported to 5.4Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-11-04qemu: drop build support for rhel 8.2Jason A. Donenfeld
This reverts commit feb89cab65c6ab1a6cbeeaaeb11b1a174772cea8. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-10-29netns: check that route_me_harder packets use the right skJason A. Donenfeld
If netfilter changes the packet mark, the packet is rerouted. The ip_route_me_harder family of functions fails to use the right sk, opting to instead use skb->sk, resulting in a routing loop when used with tunnels. Fixing this inside of the compat layer with skb_orphan would work but would cause other problems, by disabling TSQ, so instead we warn if the calling kernel hasn't yet backported the fix for this. Reported-by: Chen Minqiang <ptpt52@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-09-09noise: take lock when removing handshake entry from tableJason A. Donenfeld
Eric reported that syzkaller found a race of this variety: CPU 1 CPU 2 -------------------------------------------|--------------------------------------- wg_index_hashtable_replace(old, ...) | if (hlist_unhashed(&old->index_hash)) | | wg_index_hashtable_remove(old) | hlist_del_init_rcu(&old->index_hash) | old->index_hash.pprev = NULL hlist_replace_rcu(&old->index_hash, ...) | *old->index_hash.pprev | Syzbot wasn't actually able to reproduce this more than once or create a reproducer, because the race window between checking "hlist_unhashed" and calling "hlist_replace_rcu" is just so small. Adding an mdelay(5) or similar there helps make this demonstrable using this simple script: #!/bin/bash set -ex trap 'kill $pid1; kill $pid2; ip link del wg0; ip link del wg1' EXIT ip link add wg0 type wireguard ip link add wg1 type wireguard wg set wg0 private-key <(wg genkey) listen-port 9999 wg set wg1 private-key <(wg genkey) peer $(wg show wg0 public-key) endpoint 127.0.0.1:9999 persistent-keepalive 1 wg set wg0 peer $(wg show wg1 public-key) ip link set wg0 up yes link set wg1 up | ip -force -batch - & pid1=$! yes link set wg1 down | ip -force -batch - & pid2=$! wait The fundumental underlying problem is that we permit calls to wg_index_ hashtable_remove(handshake.entry) without requiring the caller to take the handshake mutex that is intended to protect members of handshake during mutations. This is consistently the case with calls to wg_index_ hashtable_insert(handshake.entry) and wg_index_hashtable_replace( handshake.entry), but it's missing from a pertinent callsite of wg_ index_hashtable_remove(handshake.entry). So, this patch makes sure that mutex is taken. The original code was a little bit funky though, in the form of: remove(handshake.entry) lock(), memzero(handshake.some_members), unlock() remove(handshake.entry) The original intention of that double removal pattern outside the lock appears to be some attempt to prevent insertions that might happen while locks are dropped during expensive crypto operations, but actually, all callers of wg_index_hashtable_insert(handshake.entry) take the write lock and then explicitly check handshake.state, as they should, which the aforementioned memzero clears, which means an insertion should already be impossible. And regardless, the original intention was necessarily racy, since it wasn't guaranteed that something else would run after the unlock() instead of after the remove(). So, from a soundness perspective, it seems positive to remove what looks like a hack at best. The crash from both syzbot and from the script above is as follows: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 7395 Comm: kworker/0:3 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: wg-kex-wg1 wg_packet_handshake_receive_worker RIP: 0010:hlist_replace_rcu include/linux/rculist.h:505 [inline] RIP: 0010:wg_index_hashtable_replace+0x176/0x330 drivers/net/wireguard/peerlookup.c:174 Code: 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 44 01 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 10 48 89 c6 48 c1 ee 03 <80> 3c 0e 00 0f 85 06 01 00 00 48 85 d2 4c 89 28 74 47 e8 a3 4f b5 RSP: 0018:ffffc90006a97bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888050ffc4f8 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88808e04e010 RBP: ffff88808e04e000 R08: 0000000000000001 R09: ffff8880543d0000 R10: ffffed100a87a000 R11: 000000000000016e R12: ffff8880543d0000 R13: ffff88808e04e008 R14: ffff888050ffc508 R15: ffff888050ffc500 FS: 0000000000000000(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f5505db0 CR3: 0000000097cf7000 CR4: 00000000001526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: wg_noise_handshake_begin_session+0x752/0xc9a drivers/net/wireguard/noise.c:820 wg_receive_handshake_packet drivers/net/wireguard/receive.c:183 [inline] wg_packet_handshake_receive_worker+0x33b/0x730 drivers/net/wireguard/receive.c:220 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Note that this fixes the same issue as the previous commit, but in a more direct way. Upstream, the commit message of that previous commit has been changed to: wireguard: peerlookup: take lock before checking hash in replace operation Eric's suggested fix for the previous commit's mentioned race condition was to simply take the table->lock in wg_index_hashtable_replace(). The table->lock of the hash table is supposed to protect the bucket heads, not the entires, but actually, since all the mutator functions are already taking it, it makes sense to take it too for the test to hlist_unhashed, as a defense in depth measure, so that it no longer races with deletions, regardless of what other locks are protecting individual entries. This is sensible from a performance perspective because, as Eric pointed out, the case of being unhashed is already the unlikely case, so this won't add common contention. And comparing instructions, this basically doesn't make much of a difference other than pushing and popping %r13, used by the new `bool ret`. More generally, I like the idea of locking consistency across table mutator functions, and this might let me rest slightly easier at night. Since we've already tagged it, we're not going to change it at this point, but I include mention of it here for reference. Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-09-08version: bumpJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-09-08peerlookup: take lock before checking hash in replace operationJason A. Donenfeld
Eric reported that syzkaller found a race of this variety: CPU 1 CPU 2 -------------------------------------------|--------------------------------------- wg_index_hashtable_replace(old, ...) | if (hlist_unhashed(&old->index_hash)) | | wg_index_hashtable_remove(old) | hlist_del_init_rcu(&old->index_hash) | old->index_hash.pprev = NULL hlist_replace_rcu(&old->index_hash, ...) | *old->index_hash.pprev | The table->lock of the hash table is supposed to protect the bucket heads, not the entires, but actually, since all the mutator functions are already taking it, it makes sense to take it too for the test to hlist_unhashed, so that it no longer races with deletions. This is fine because, as Eric pointed out, the case of being unhashed is already the unlikely case, so this won't add common contention. And comparing instructions, this basically doesn't make much of a difference other than pushing and popping %r13, used by the new `bool ret`. The syzkaller crash is as follows: general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007] CPU: 0 PID: 7395 Comm: kworker/0:3 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: wg-kex-wg1 wg_packet_handshake_receive_worker RIP: 0010:hlist_replace_rcu include/linux/rculist.h:505 [inline] RIP: 0010:wg_index_hashtable_replace+0x176/0x330 drivers/net/wireguard/peerlookup.c:174 Code: 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 44 01 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 10 48 89 c6 48 c1 ee 03 <80> 3c 0e 00 0f 85 06 01 00 00 48 85 d2 4c 89 28 74 47 e8 a3 4f b5 RSP: 0018:ffffc90006a97bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888050ffc4f8 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88808e04e010 RBP: ffff88808e04e000 R08: 0000000000000001 R09: ffff8880543d0000 R10: ffffed100a87a000 R11: 000000000000016e R12: ffff8880543d0000 R13: ffff88808e04e008 R14: ffff888050ffc508 R15: ffff888050ffc500 FS: 0000000000000000(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f5505db0 CR3: 0000000097cf7000 CR4: 00000000001526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: wg_noise_handshake_begin_session+0x752/0xc9a drivers/net/wireguard/noise.c:820 wg_receive_handshake_packet drivers/net/wireguard/receive.c:183 [inline] wg_packet_handshake_receive_worker+0x33b/0x730 drivers/net/wireguard/receive.c:220 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Modules linked in: ---[ end trace 0d737db78b72da84 ]--- RIP: 0010:hlist_replace_rcu include/linux/rculist.h:505 [inline] RIP: 0010:wg_index_hashtable_replace+0x176/0x330 drivers/net/wireguard/peerlookup.c:174 Code: 00 fc ff df 48 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 44 01 00 00 48 b9 00 00 00 00 00 fc ff df 48 8b 45 10 48 89 c6 48 c1 ee 03 <80> 3c 0e 00 0f 85 06 01 00 00 48 85 d2 4c 89 28 74 47 e8 a3 4f b5 RSP: 0018:ffffc90006a97bf8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888050ffc4f8 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88808e04e010 RBP: ffff88808e04e000 R08: 0000000000000001 R09: ffff8880543d0000 R10: ffffed100a87a000 R11: 000000000000016e R12: ffff8880543d0000 R13: ffff88808e04e008 R14: ffff888050ffc508 R15: ffff888050ffc500 FS: 0000000000000000(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f5505db0 CR3: 0000000097cf7000 CR4: 00000000001526f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27compat: backport NLA policy macrosJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27netlink: consistently use NLA_POLICY_MIN_LEN()Johannes Berg
Change places that open-code NLA_POLICY_MIN_LEN() to use the macro instead, giving us flexibility in how we handle the details of the macro. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27netlink: consistently use NLA_POLICY_EXACT_LEN()Johannes Berg
Change places that open-code NLA_POLICY_EXACT_LEN() to use the macro instead, giving us flexibility in how we handle the details of the macro. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2020-08-27compat: backport kfree_sensitive and switch to itJason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>