Age | Commit message (Collapse) | Author |
|
With deployments having upwards of 600k peers now, this somewhat heavy
structure could benefit from more fine-grained allocations.
Specifically, instead of using a 2048-byte slab for a 1544-byte object,
we can now use 1544-byte objects directly, thus saving almost 25%
per-peer, or with 600k peers, that's a savings of 303 MiB. This also
makes wireguard's memory usage more transparent in tools like slabtop
and /proc/slabinfo.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Having two ring buffers per-peer means that every peer results in two
massive ring allocations. On an 8-core x86_64 machine, this commit
reduces the per-peer allocation from 18,688 bytes to 1,856 bytes, which
is an 90% reduction. Ninety percent! With some single-machine
deployments approaching 500,000 peers, we're talking about a reduction
from 7 gigs of memory down to 700 megs of memory.
In order to get rid of these per-peer allocations, this commit switches
to using a list-based queueing approach. Currently GSO fragments are
chained together using the skb->next pointer (the skb_list_* singly
linked list approach), so we form the per-peer queue around the unused
skb->prev pointer (which sort of makes sense because the links are
pointing backwards). Use of skb_queue_* is not possible here, because
that is based on doubly linked lists and spinlocks. Multiple cores can
write into the queue at any given time, because its writes occur in the
start_xmit path or in the udp_recv path. But reads happen in a single
workqueue item per-peer, amounting to a multi-producer, single-consumer
paradigm.
The MPSC queue is implemented locklessly and never blocks. However, it
is not linearizable (though it is serializable), with a very tight and
unlikely race on writes, which, when hit (some tiny fraction of the
0.15% of partial adds on a fully loaded 16-core x86_64 system), causes
the queue reader to terminate early. However, because every packet sent
queues up the same workqueue item after it is fully added, the worker
resumes again, and stopping early isn't actually a problem, since at
that point the packet wouldn't have yet been added to the encryption
queue. These properties allow us to avoid disabling interrupts or
spinning. The design is based on Dmitry Vyukov's algorithm [1].
Performance-wise, ordinarily list-based queues aren't preferable to
ringbuffers, because of cache misses when following pointers around.
However, we *already* have to follow the adjacent pointers when working
through fragments, so there shouldn't actually be any change there. A
potential downside is that dequeueing is a bit more complicated, but the
ptr_ring structure used prior had a spinlock when dequeueing, so all and
all the difference appears to be a wash.
Actually, from profiling, the biggest performance hit, by far, of this
commit winds up being atomic_add_unless(count, 1, max) and atomic_
dec(count), which account for the majority of CPU time, according to
perf. In that sense, the previous ring buffer was superior in that it
could check if it was full by head==tail, which the list-based approach
cannot do.
But all and all, this enables us to get massive memory savings, allowing
WireGuard to scale for real world deployments, without taking much of a
performance hit.
[1] http://www.1024cores.net/home/lock-free-algorithms/queues/intrusive-mpsc-node-based-queue
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The is_dead boolean is checked for every single packet, while the
internal_id member is used basically only for pr_debug messages. So it
makes sense to hoist up is_dead into some space formerly unused by a
struct hole, while demoting internal_api to below the lowest struct
cache line.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This makes `wg show` and `wg showconf` and the like significantly
faster, since we don't have to iterate through every node of the trie
for every single peer. It also makes netlink cursor resumption much less
problematic, since we're just iterating through a list, rather than
having to save a traversal stack.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
If struct timer_list has not been setup, it is zeroed, in which case
timer_pending is false, so calling del_timer is safe. Calling del_timer
is also safe on a timer that has already been del_timer'd. And calling
del_timer is safe after a peer is dead, since the whole point of it
being dead is that no more timers are created and all contexts
eventually stop. Finally del_timer uses a lock, which means it's safe to
call it concurrently.
Therefore, we do not need any guards around calls to del_timer. While
we're at it, we can get rid of the old lingering timers_enabled boolean
which wasn't doing anything anyway anymore.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This required a bit of pruning of our christmas trees.
Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
I understand why this must be done, though I'm not so happy about having
to do it. In some places, it puts us over 80 chars and we have to break
lines up in further ugly ways. And in general, I think this makes things
harder to read. Yet another thing we must do to please upstream.
Maybe this can be replaced in the future by some kind of automatic
module namespacing logic in the linker, or even combined with LTO and
aggressive symbol stripping.
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The kernel has very specific rules correlating file type with comment
type, and also SPDX identifiers can't be merged with other comments.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This is the worst commit in the whole repo, making the code much less
readable, but so it goes with upstream maintainers.
We are now woefully wrapped at 80 columns.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Rather than abusing the handshake lock, we're much better off just using
a boring atomic64 for this. It's simpler and performs better.
Also, while we're at it, we set the handshake stamp both before and
after the calculations, in case the calculations block for a really long
time waiting for the RNG to initialize. Otherwise it's possible that
when the RNG finally initializes, two handshakes are sent back to back,
which isn't sensible.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Completely rework peer removal to ensure peers don't jump between
contexts and create races.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
After we atomic_set, the peer is allowed to be freed, which means if we
want to continue to reference it, we need to bump the reference count.
This was introduced a few commits ago by b713ab0e when implementing some
simplification suggestions.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Use RCU reference counts only when we must, and otherwise use a more
reasonably named function.
Reported-by: Jann Horn <jann@thejh.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com>
[Jason: fixed up the flushing of the rx_queue in peer_remove]
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Generally if we're inaccurate by a few nanoseconds, it doesn't matter.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Since this is a network protocol, expirations need to be accounted for,
even across system suspend. On real systems, this isn't a problem, since
we're clearing all keys before suspend. But on Android, where we don't
do that, this is something of a problem. So, we switch to using boottime
instead of jiffies.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
It's good to have SPDX identifiers in all files as the Linux kernel
developers are working to add these identifiers to all files.
Update all files with the correct SPDX license identifier based on the license
text of the project or based on the license in the file itself. The SPDX
identifier is a legally binding shorthand, which can be used instead of the
full boiler plate text.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Modified-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This gets us nanoseconds instead of microseconds, which is better, and
we can do this pretty much without freaking out existing userspace,
which doesn't actually make use of the nano/micro seconds field:
zx2c4@thinkpad ~ $ cat a.c
void main()
{
puts(sizeof(struct timeval) == sizeof(struct timespec) ? "success" : "failure");
}
zx2c4@thinkpad ~ $ gcc a.c -m64 && ./a.out
success
zx2c4@thinkpad ~ $ gcc a.c -m32 && ./a.out
success
This doesn't solve y2038 problem, but timespec64 isn't yet a thing in
userspace.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This is faster, since it means adding a new peer is O(1) instead of
O(n). It's also safe to do because we're holding the device_update_lock
on both the ++ and the --.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Since the peer list is protected by the device_update_lock, and since
items are removed from the peer list before putting their final
reference, we don't actually need to take a reference when iterating.
This allows us to simplify the macro considerably.
Suggested-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Sultan Alsawaf <sultanxda@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This removes our dependency on padata and moves to a different mode of
multiprocessing that is more efficient.
This began as Samuel Holland's GSoC project and was gradually
reworked/redesigned/rebased into this present commit, which is a
combination of his initial contribution and my subsequent rewriting and
redesigning.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
We store the destination IP of incoming packets as the source IP of
outgoing packets. When we send outgoing packets, we then ask the routing
table for which interface to use and which source address, given our
inputs of the destination address and a suggested source address. This
all is good and fine, since it means we'll successfully reply using the
correct source address, correlating with the destination address for
incoming packets. However, what happens when default routes change? Or
when interface IP addresses change?
Prior to this commit, after getting the response from the routing table
of the source address, destination address, and interface, we would then
make sure that the source address actually belonged to the outbound
interface. If it didn't, we'd reset our source address to zero and
re-ask the routing table, in which case the routing table would then
give us the default IP address for sending that packet. This worked
mostly fine for most purposes, but there was a problem: what if
WireGuard legitimately accepted an inbound packet on a default interface
using an IP of another interface? In this case, falling back to asking
for the default source IP was not a good strategy, since it'd nearly
always mean we'd fail to reply using the right source.
So, this commit changes the algorithm slightly. Rather than falling back
to using the default IP if the preferred source IP doesn't belong to the
outbound interface, we have two checks: we make sure that the source IP
address belongs to _some_ interface on the system, no matter which one
(so long as it's within the network namespace), and we check whether or
not the interface of an incoming packet matches the returned interface
for the outbound traffic. If both these conditions are true, then we
proceed with using this source IP address. If not, we fall back to the
default IP address.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Mathias Hall-Andersen <mathias@hall-andersen.dk>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
If it's time to rekey, and the responder sends a message, the initator
will begin the rekeying when sending his response message. In the worst
case, this response message will actually just be the keepalive. This
generally works well, with the one edge case of the message arriving
less than 10 seconds before key expiration, in which the keepalive is
not sufficient. In this case, we simply rehandshake immediately.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
With the prior behavior, when sending a packet, we checked to see if it
was about time to start a new handshake, and if we were past a certain
time, we started it. For the responder, we made that time a bit further
in the future than for the initiator, to prevent the thundering herd
problem of them both starting at the same time. However, this was
flawed.
If both parties stopped communicating after 2.2 minutes, and then one
party decided to initiate a TCP connection before the 3 minute mark, the
currently open session would be used. However, because it was after the
2.2 minute mark, both peers would try to initiate a handshake upon
sending their first packet. The errant flow was as follows:
1. Peer A sends SYN.
2. Peer A sees that his key is getting old and initiates new handshake.
3. Peer B receives SYN and sends ACK.
4. Peer B sees that his key is getting old and initiates new handshake.
Since these events happened after the 2.2 minute mark, there's no delay
between handshake initiations, and problems begin. The new behavior is
changed to:
1. Peer A sends SYN.
2. Peer A sees that his key is getting old and initiates new handshake.
3. Peer B receives SYN and sends ACK.
4. Peer B sees that his key is getting old and schedules a delayed
handshake for 12.5 seconds in the future.
5. Peer B receives handshake initiation and cancels scheduled handshake.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The C standard states:
A declaration of a parameter as ``array of type'' shall be adjusted to ``qualified pointer to
type'', where the type qualifiers (if any) are those specified within the [ and ] of the
array type derivation. If the keyword static also appears within the [ and ] of the
array type derivation, then for each call to the function, the value of the corresponding
actual argument shall provide access to the first element of an array with at least as many
elements as specified by the size expression.
By changing void func(int array[4]) to void func(int array[static 4]),
we automatically get the compiler checking argument sizes for us, which
is quite nice.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|