Age | Commit message (Collapse) | Author |
|
Having two ring buffers per-peer means that every peer results in two
massive ring allocations. On an 8-core x86_64 machine, this commit
reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
is an 90% reduction. Ninety percent! With some single-machine
deployments approaching 500,000 peers, we're talking about a reduction
from 7 gigs of memory down to 700 megs of memory.
In order to get rid of these per-peer allocations, this commit switches
to using a list-based queueing approach. Currently GSO fragments are
chained together using the skb->next pointer (the skb_list_* singly
linked list approach), so we form the per-peer queue around the unused
skb->prev pointer (which sort of makes sense because the links are
pointing backwards). Use of skb_queue_* is not possible here, because
that is based on doubly linked lists and spinlocks. Multiple cores can
write into the queue at any given time, because its writes occur in the
start_xmit path or in the udp_recv path. But reads happen in a single
workqueue item per-peer, amounting to a multi-producer, single-consumer
paradigm.
The MPSC queue is implemented locklessly and never blocks. However, it
is not linearizable (though it is serializable), with a very tight and
unlikely race on writes, which, when hit (some tiny fraction of the
0.15% of partial adds on a fully loaded 16-core x86_64 system), causes
the queue reader to terminate early. However, because every packet sent
queues up the same workqueue item after it is fully added, the worker
resumes again, and stopping early isn't actually a problem, since at
that point the packet wouldn't have yet been added to the encryption
queue. These properties allow us to avoid disabling interrupts or
spinning. The design is based on Dmitry Vyukov's algorithm [1].
Performance-wise, ordinarily list-based queues aren't preferable to
ringbuffers, because of cache misses when following pointers around.
However, we *already* have to follow the adjacent pointers when working
through fragments, so there shouldn't actually be any change there. A
potential downside is that dequeueing is a bit more complicated, but the
ptr_ring structure used prior had a spinlock when dequeueing, so all and
all the difference appears to be a wash.
Actually, from profiling, the biggest performance hit, by far, of this
commit winds up being atomic_add_unless(count, 1, max) and atomic_
dec(count), which account for the majority of CPU time, according to
perf. In that sense, the previous ring buffer was superior in that it
could check if it was full by head==tail, which the list-based approach
cannot do.
But all and all, this enables us to get massive memory savings, allowing
WireGuard to scale for real world deployments, without taking much of a
performance hit.
[1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
If skb->protocol doesn't match the actual skb->data header, it's
probably not a good idea to pass it off to icmp{,v6}_ndo_send, which is
expecting to reply to a valid IP packet. So this commit has that early
mismatch case jump to a later error label.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The definition of IS_ERR() already applies the unlikely() notation
when checking the error status of the passed pointer. For this
reason there is no need to have the same notation outside of
IS_ERR() itself.
Clean up code by removing redundant notation.
Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
WireGuard uses skb->protocol to determine packet type, and bails out if
it's not set or set to something it's not expecting. For AF_PACKET
injection, we need to support its call chain of:
packet_sendmsg -> packet_snd -> packet_parse_headers ->
dev_parse_header_protocol -> parse_protocol
Without a valid parse_protocol, this returns zero, and wireguard then
rejects the skb. So, this wires up the ip_tunnel handler for layer 3
packets for that case.
Reported-by: Hans Wippel <ndev@hwipl.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Before, we took a reference to the creating netns if the new netns was
different. This caused issues with circular references, with two
wireguard interfaces swapping namespaces. The solution is to rather not
take any extra references at all, but instead simply invalidate the
creating netns pointer when that netns is deleted.
In order to prevent this from happening again, this commit improves the
rough object leak tracking by allowing it to account for created and
destroyed interfaces, aside from just peers and keys. That then makes it
possible to check for the object leak when having two interfaces take a
reference to each others' namespaces.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
We carry out checks to the effect of:
if (skb->protocol != wg_examine_packet_protocol(skb))
goto err;
By having wg_skb_examine_untrusted_ip_hdr return 0 on failure, this
means that the check above still passes in the case where skb->protocol
is zero, which is possible to hit with AF_PACKET:
struct sockaddr_pkt saddr = { .spkt_device = "wg0" };
unsigned char buffer[5] = { 0 };
sendto(socket(AF_PACKET, SOCK_PACKET, /* skb->protocol = */ 0),
buffer, sizeof(buffer), 0, (const struct sockaddr *)&saddr, sizeof(saddr));
Additional checks mean that this isn't actually a problem in the code
base, but I could imagine it becoming a problem later if the function is
used more liberally.
I would prefer to fix this by having wg_examine_packet_protocol return a
32-bit ~0 value on failure, which will never match any value of
skb->protocol, which would simply change the generated code from a mov
to a movzx. However, sparse complains, and adding __force casts doesn't
seem like a good idea, so instead we just add a simple helper function
to check for the zero return value. Since wg_examine_packet_protocol
itself gets inlined, this winds up not adding an additional branch to
the generated code, since the 0 return value already happens in a
mergable branch.
Reported-by: Fabian Freyer <fabianfreyer@radicallyopensecurity.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
It turns out there's an easy way to get packets queued up while still
having an MTU of zero, and that's via persistent keep alive. This commit
makes sure that in whatever condition, we don't wind up dividing by
zero. Note that an MTU of zero for a wireguard interface is something
quasi-valid, so I don't think the correct fix is to limit it via
min_mtu. This can be reproduced easily with:
ip link add wg0 type wireguard
ip link add wg1 type wireguard
ip link set wg0 up mtu 0
ip link set wg1 up
wg set wg0 private-key <(wg genkey)
wg set wg1 listen-port 1 private-key <(wg genkey) peer $(wg show wg0 public-key)
wg set wg0 peer $(wg show wg1 public-key) persistent-keepalive 1 endpoint 127.0.0.1:1
However, while min_mtu=0 seems fine, it makes sense to restrict the
max_mtu. This commit also restricts the maximum MTU to the greatest
number for which rounding up to the padding multiple won't overflow a
signed integer. Packets this large were always rejected anyway
eventually, due to checks deeper in, but it seems more sound not to even
let the administrator configure something that won't work anyway.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This is a small test to ensure that icmp_ndo_send is actually doing the
right with with regards to the source address. It tests this by
ensuring that the error comes back along the right path.
Also, backport the new ndo function for this.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Reported-by: Derrick Pallas <derrick@pallas.us>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Coarse ktime is broken until [1] in 5.2 and kernels without the
backport, so we use fallback code there.
The fallback code has also been improved significantly. It now only uses
slower clocks on kernels < 3.17, at the expense of some accuracy we're
not overly concerned about.
[1] https://lore.kernel.org/lkml/tip-e3ff9c3678b4d80e22d2557b68726174578eaf52@git.kernel.org/
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The hashtable allocations are quite large, and cause the device allocation in
the net framework to stall sometimes while it tries to find a contiguous region
that can fit the device struct:
[<0000000000000000>] __switch_to+0x94/0xb8
[<0000000000000000>] __alloc_pages_nodemask+0x764/0x7e8
[<0000000000000000>] kmalloc_order+0x20/0x40
[<0000000000000000>] __kmalloc+0x144/0x1a0
[<0000000000000000>] alloc_netdev_mqs+0x5c/0x368
[<0000000000000000>] rtnl_create_link+0x48/0x180
[<0000000000000000>] rtnl_newlink+0x410/0x708
[<0000000000000000>] rtnetlink_rcv_msg+0x190/0x1f8
[<0000000000000000>] netlink_rcv_skb+0x4c/0xf8
[<0000000000000000>] rtnetlink_rcv+0x30/0x40
[<0000000000000000>] netlink_unicast+0x18c/0x208
[<0000000000000000>] netlink_sendmsg+0x19c/0x348
[<0000000000000000>] sock_sendmsg+0x3c/0x58
[<0000000000000000>] ___sys_sendmsg+0x290/0x2b0
[<0000000000000000>] __sys_sendmsg+0x58/0xa0
[<0000000000000000>] SyS_sendmsg+0x10/0x20
[<0000000000000000>] el0_svc_naked+0x34/0x38
[<0000000000000000>] 0xffffffffffffffff
To fix the allocation stalls, decouple the hashtable allocations from the device
allocation and allocate the hashtables with kvmalloc's implicit __GFP_NORETRY
so that the allocations fall back to vmalloc with little resistance.
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This way other devices that use Android style wakelocks will also have
the same semantics. We also move this logic into the handler so that
it's slightly cleaner and gives us some opportunity to leave a normal
comment.
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Sultan Alsawaf <sultanxda@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
If struct timer_list has not been setup, it is zeroed, in which case
timer_pending is false, so calling del_timer is safe. Calling del_timer
is also safe on a timer that has already been del_timer'd. And calling
del_timer is safe after a peer is dead, since the whole point of it
being dead is that no more timers are created and all contexts
eventually stop. Finally del_timer uses a lock, which means it's safe to
call it concurrently.
Therefore, we do not need any guards around calls to del_timer. While
we're at it, we can get rid of the old lingering timers_enabled boolean
which wasn't doing anything anyway anymore.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This required a bit of pruning of our christmas trees.
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
I understand why this must be done, though I'm not so happy about having
to do it. In some places, it puts us over 80 chars and we have to break
lines up in further ugly ways. And in general, I think this makes things
harder to read. Yet another thing we must do to please upstream.
Maybe this can be replaced in the future by some kind of automatic
module namespacing logic in the linker, or even combined with LTO and
aggressive symbol stripping.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The kernel has very specific rules correlating file type with comment
type, and also SPDX identifiers can't be merged with other comments.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Sultan Alsawaf <sultanxda@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This is the worst commit in the whole repo, making the code much less
readable, but so it goes with upstream maintainers.
We are now woefully wrapped at 80 columns.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Rather than abusing the handshake lock, we're much better off just using
a boring atomic64 for this. It's simpler and performs better.
Also, while we're at it, we set the handshake stamp both before and
after the calculations, in case the calculations block for a really long
time waiting for the RNG to initialize. Otherwise it's possible that
when the RNG finally initializes, two handshakes are sent back to back,
which isn't sensible.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Generally if we're inaccurate by a few nanoseconds, it doesn't matter.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Since this is a network protocol, expirations need to be accounted for,
even across system suspend. On real systems, this isn't a problem, since
we're clearing all keys before suspend. But on Android, where we don't
do that, this is something of a problem. So, we switch to using boottime
instead of jiffies.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The new flow offloading feature at the moment does not set the dst. We
have a patch pending to fix this upstream, but in the meantime, work
around it here.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This avoids an O(n^2) traversal in favor of an O(n) one.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
It might be that a particular route has a different MTU than the
interface, via `ip route add ... dev wg0 mtu 1281`, for example. In this
case, it's important that we don't accidently pad beyond the end of the
MTU. We accomplish that in this patch by carrying forward the MTU from
the dst if it exists. We also add a unit test for this issue.
Reported-by: Roman Mamedov <rm.wg@romanrm.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
It's good to have SPDX identifiers in all files as the Linux kernel
developers are working to add these identifiers to all files.
Update all files with the correct SPDX license identifier based on the license
text of the project or based on the license in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of the
full boiler plate text.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Modified-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Otherwise new handshakes might not occur immediately when the interface
goes up and down.
Also initialize peers to having a proper zeroed handshake jiffies.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This makes sense for the security model of laptops, but not for clicking
phones on and off, where we actually want to be able to handle incoming
packets.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
When an interface is down, the socket port can change freely. A socket
will be allocated when the interface comes up, and if a socket can't be
allocated, the interface doesn't come up.
However, a socket port can change while the interface is up. In this
case, if a new socket with a new port cannot be allocated, it's
important to keep the interface in a consistent state. The choices are
either to bring down the interface or to preserve the old socket. This
patch implements the latter.
Reported-by: Marc-Antoine Perennou <keruspe@exherbo.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Makes it more clear that this _not_ a routing table replacement.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|