Age | Commit message (Collapse) | Author |
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The previous code had been proved in Z3, but this new code from upstream
KreMLin is directly generated from the F*, which is preferable. The
assembly generated is identical.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This is not useful for WireGuard, but for the general use case we
probably want it this way, and the speed difference is mostly lost in
the noise.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Remove signed right shifts. Previously u64_gte_mask was only
correct for x < 2^63.
Z3 script proving correctness:
>>> from z3 import *
>>>
>>> x = BitVec("x", 64)
>>> y = BitVec("y", 64)
>>>
>>> t = LShR(x^((x^y)|((x-y)^y)), 63) - 1
>>>
>>> prove(If(UGE(x, y), BitVecVal(-1, 64), BitVecVal(0, 64)) == t)
proved
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Avoid signed right shift.
Z3 script showing equivalence:
>>> from z3 import *
>>>
>>> x = BitVec("x", 64)
>>> y = BitVec("y", 64)
>>>
>>> # Before
... x_ = ~(x ^ y)
>>> x_ &= x_ << 32
>>> x_ &= x_ << 16
>>> x_ &= x_ << 8
>>> x_ &= x_ << 4
>>> x_ &= x_ << 2
>>> x_ &= x_ << 1
>>> x_ >>= 63
>>>
>>> # After
... y_ = x ^ y
>>> y_ = y_ | -y_
>>> y_ = LShR(y_, 63) - 1
>>>
>>> prove(x_ == y_)
proved
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This causes problems with RAP and KERNEXEC for PaX, as r12 is a
reserved register.
Suggested-by: PaX Team <pageexec@freemail.hu>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
At this stage the value if C[4] is at most ((2^256-1) + 38*(2^256-1)) / 2^256 = 38,
so there is no need to use a wide multiplication.
Change inspired by Andy Polyakov's OpenSSL implementation.
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Correctness can be quickly verified with the following z3py script:
>>> from z3 import *
>>> x = BitVec("x", 256) # any 256-bit value
>>> ref = URem(x, 2**255 - 19) # correct value
>>> t = Extract(255, 255, x); x &= 2**255 - 1; # btrq $63, %3
>>> u = If(t != 0, BitVecVal(38, 256), BitVecVal(19, 256)) # cmovncl %k5, %k4
>>> x += u # addq %4, %0; adcq $0, %1; adcq $0, %2; adcq $0, %3;
>>> t = Extract(255, 255, x); x &= 2**255 - 1; # btrq $63, %3
>>> u = If(t != 0, BitVecVal(0, 256), BitVecVal(19, 256)) # cmovncl %k5, %k4
>>> x -= u # subq %4, %0; sbbq $0, %1; sbbq $0, %2; sbbq $0, %3;
>>> prove(x == ref)
proved
Change inspired by Andy Polyakov's OpenSSL implementation.
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
The wide multiplication by 38 in mul_a24_eltfp25519_1w is redundant:
(2^256-1) * 121666 / 2^256 is at most 121665, and therefore a 64-bit
multiplication can never overflow.
Change inspired by Andy Polyakov's OpenSSL implementation.
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Suggested-by: Shlomi Steinberg <shlomi@shlomisteinberg.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Otherwise these constants will be merged wrong or excluded, and we'll
wind up with wrong calculations. While bfd (the normal kernel linker)
doesn't seem to mind, recent versions of gold do bad things.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Reported-by: Peter Korsgaard <peter@korsgaard.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
In rt kernels, spinlocks call schedule(), which means preemption can't
be disabled. The FPU disables preemption. Hence, we can either
restructure things to move the calls to kernel_fpu_begin/end to be
really close to the actual crypto routines, or we can do the slower
lazier solution of just not using the FPU at all on -rt kernels. This
patch goes with the latter lazy solution.
The reason why we don't place the calls to kernel_fpu_begin/end close to
the crypto routines in the first place is that they're very expensive,
as it usually involves a call to XSAVE. So on sane kernels, we benefit
from only having to call it once.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This reduces memory access and the total opaque size.
Signed-off-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: René van Dorst <opensource@vdorst.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
GCC 8.1 does not know about the invariant `0 <= ctx->num < POLY1305_BLOCK_SIZE`.
This results in a warning that `memcpy(ctx->data + num, inp, len);` may
overflow the `data` field, which is correct for arbitrary values of `num`.
To make the invariant explicit we ensure that `num` is in the required range.
An alternative would be to change `ctx->num` to a 4-bit bitfield at the point
of declaration.
This changes the code from `test ebp, ebp; jz end` to `and ebp, 15; jz
end`, which have identical performance characteristics.
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
We're referencing these constants as one contiguous blob, so if there's
any merging that goes on with other constants elsewhere (such as the
kernel's current poly1305 implementation that we hope to replace), then
these will be reordered and have the wrong values.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Also add cselect optimization.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
It's faster and doesn't use the FPU.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This deals with alignment more easily and also helps squelch a
clang-analyzer warning.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
This reverts commit da4ff396cc5d5e0ff21f9ecbc2f951c048c63fff and adds
some optimizations to hacl64.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
For now, it's faster:
hacl64: 109782 cycles per call
fiat64: 108984 cycles per call
It's quite possible this commit will be reverted with nice changes from
INRIA, though.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
While this has a negative performance impact on x86_64, it has a
positive performance impact on smaller machines, which is where we're
actually using this code. For example, an A53:
Before: fiat32: 228605 cycles per call
After: fiat32: 188307 cycles per call
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|
|
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
|