Age | Commit message (Collapse) | Author |
|
Before we thought that interrupts are always disabled in the kernel
space, but here is a case when goruntime switches on a goroutine which
has been saved in the host mode. On restore, the popf instruction is
used to restore flags and this means that all flags what the goroutine
has in the host mode will be restored in the kernel mode. And in the
host mode, interrupts are always enabled.
The long story short, we can't use the IF flag for determine whether a
tasks is running in user or kernel mode.
This patch reworks the code so that in userspace, the first bit of the
IOPL flag will be always set. This doesn't give any new privilidges for
a task because CPL in userspace is always 3. But then we can use this
flag to distinguish user and kernel modes. The IOPL flag is never set in
the kernel and host modes.
Reported-by: syzbot+5036b325a8eb15c030cf@syzkaller.appspotmail.com
Reported-by: syzbot+034d580e89ad67b8dc75@syzkaller.appspotmail.com
Signed-off-by: Andrei Vagin <avagin@gmail.com>
|
|
PiperOrigin-RevId: 334674481
|
|
PiperOrigin-RevId: 332069743
|
|
OCI configuration includes support for specifying seccomp filters. In runc,
these filter configurations are converted into seccomp BPF programs and loaded
into the kernel via libseccomp. runsc needs to be a static binary so, for
runsc, we cannot rely on a C library and need to implement the functionality
in Go.
The generator added here implements basic support for taking OCI seccomp
configuration and converting it into a seccomp BPF program with the same
behavior as a program generated by libseccomp.
- New conditional operations were added to pkg/seccomp to support operations
available in OCI.
- AllowAny and AllowValue were renamed to MatchAny and EqualTo to better reflect
that syscalls matching the conditionals result in the provided action not
simply SCMP_RET_ALLOW.
- BuildProgram in pkg/seccomp no longer panics if provided an empty list of
rules. It now builds a program with the architecture sanity check only.
- ProgramBuilder now allows adding labels that are unused. However, backwards
jumps are still not permitted.
Fixes #510
PiperOrigin-RevId: 331938697
|
|
Some optimizations in this pr:
1, Move ASID from TTBR0 to TTBR1
2, tlb_flush_all
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
Some CPUs(eg: ampere-emag) can speculate past an ERET instruction and potentially perform
speculative accesses to memory before processing the exception return.
Since the register state is often controlled by a lower privilege level
at the point of an ERET, this could potentially be used as part of a
side-channel attack.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
PiperOrigin-RevId: 330777900
|
|
PiperOrigin-RevId: 328639254
|
|
This immediately revealed an escape analysis violation (!), where
the sync.Map was being used in a context that escapes were not
allowed. This is a relatively minor fix and is included.
PiperOrigin-RevId: 328611237
|
|
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
This enables pre-release testing with 1.16. The intention is to replace these
with a nogo check before the next release.
PiperOrigin-RevId: 328193911
|
|
Our "Preconditions:" blocks are very useful to determine the input invariants,
but they are bit inconsistent throughout the codebase, which makes them harder
to read (particularly cases with 5+ conditions in a single paragraph).
I've reformatted all of the cases to fit in simple rules:
1. Cases with a single condition are placed on a single line.
2. Cases with multiple conditions are placed in a bulleted list.
This format has been added to the style guide.
I've also mentioned "Postconditions:", though those are much less frequently
used, and all uses already match this style.
PiperOrigin-RevId: 327687465
|
|
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
It indicates that the Sentry has changed the state of the thread and
next calls of PullFullState() has to do nothing.
PiperOrigin-RevId: 325567415
|
|
PiperOrigin-RevId: 325546308
|
|
Actually, gvisor has KPTI (Kernel PageTable Isolation) between
gr0 and gr3. But the upper half of the userCR3 contains the
whole sentry kernel which makes the kernel vulnerable to
gr3 APP through CPU bugs.
This patch implement full KPTI functionality for gvisor. It doesn't
map the whole kernel in the upper. It maps only the text section
of the binary and the entry area required by the ISA. The entry area
contains the global idt, the percpu gdt/tss etc. The entry area
packs all these together which is less than 350k for 512 vCPUs.
The text section is normally nonsensitive. It is possible to
map only the entry functions (interrupt handler etc.) only.
But it requires some hacks.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
|
|
kernelEntry is split from CPU that contains minimal CPU-specific
arch state that can be mapped at the upper of the address space.
It is prepared for KPTI for gvisor.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
|
|
m.Get() has guaranteed that if any OS thread TID is in guest,
m.vCPUs[TID] points to the vCPU in which the OS thread TID is running.
So if m.Get() returns with the corrent context in guest,
the vCPU of it must be the same as what Get() returns.
So bluepill() doesn't need to check if the vCPU is matched or not.
The check need to access to %gs register which will not points
to vCPU later when KPTI for gvisor is enabled. We can still
fetch the vCPU pointer from %gs later (when %gs points to kernelEntry),
but it needs the ENTRY_CPU_SELF which is generated by
ring0/offset_amd64.go. So we just simply remove the check.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
|
|
Call jumpToKernel() in sysret()/iret() so that there is less
code and data in the upper half, and, especially, current
goroutine's stack and user regs will not be accessed from the
upper half (also with the help from previous patches which make
less code in userCR3 context).
jumpToUser() will not be needed, because current goroutine's stack
and return value in the stack is lower half address.
It is prepared for KPTI for gvisor.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
|
|
KernelCR3 takes effect as early as possible so that less code
is in the userCR3 environment. It is prepared for the next
patches that make less code and data in the upper half,
which is prepared for KPTI for gvisor.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
|
|
UserCR3 takes effect as late as possible so that less code
is in the userCR3 environment. It is prepared for the next
patches that make less code and data in the upper half,
which is prepared for KPTI for gvisor.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antfin.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
|
|
PiperOrigin-RevId: 324748508
|
|
PiperOrigin-RevId: 324309862
|
|
PiperOrigin-RevId: 324127810
|
|
PiperOrigin-RevId: 324125938
|
|
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
I disabled DAIF(DEBUG, sError, IRQ, FIQ) in guest kernel mode,
and enabled them in guest user mode.
So, I can make sure all DAIF-s come from guest user mode,
and then the case 'TestBounceStress' can passed on Arm64.
Test steps:
1, cd pkg/sentry/platform/kvm
2, bazel test kvm_test --strip=never --test_output=streamed
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
full context switch: add fpsimd load/store support to container
application.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
PiperOrigin-RevId: 323456118
|
|
PiperOrigin-RevId: 323455097
|
|
The subsequent systrap changes will need to import memmap from
the platform package.
PiperOrigin-RevId: 323409486
|
|
We need to correctly distinguish instruction_abort/data_abort for
mem_abort@Arm64.
So, EC/WNR/FSC in esr_el1 should be checked.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
At present, when doing syscall_kvm test, we need to
enable the function of ESR_ELx_SYS64_ISS_SYS_CNTVCT/ESR_ELx_SYS64_ISS_SYS_CNTFRQ to
successfully pass the test.
I set CNTKCTL_EL1.EL0VCTEN==1/CNTKCTL_EL1.EL0PCTEN==1, so that the related cases can passed.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
This patch load/save TLS for the container application.
Related issue: full context-switch supporting for Arm64 #1238
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/2761 from lubinszARM:pr_tls_2 cb5dbca1c9c3f378002406da7a58887f9b5032b3
PiperOrigin-RevId: 322887044
|
|
Support the operation of asid, so that I can optimize tlb performance
by combining with nG.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
PiperOrigin-RevId: 321060717
|
|
Split the kvm ut test cases to pass unit-tests on Arm64.
I will add the tls and full-context test cases for Arm64 later.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
At present, when doing syscall_kvm test, we need to
enable the function of ESR_ELx_SYS64_ISS_SYS_CTR_READ to
successfully pass the test.
I set SCTLR_EL1.UCT==1, so that the related cases can passed.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
There are 3 types of asynchronous exceptions on Arm64: sError, IRQ, FIQ.
In this case, we use the sError injection method in bluepillHandler to force the guest to quit.
So that the test case of "TestBounce" can be passed on Arm64.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
PiperOrigin-RevId: 315812219
|
|
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
This analysis also catches a potential bug, which is a split on mapPhysical.
This would have led to potential guest-exit during Mapping (although this
would have been handled by the now-unecessary retryInGuest loop).
PiperOrigin-RevId: 315025106
|
|
PiperOrigin-RevId: 314186752
|
|
None of the dependencies have changed in 1.15. It may be possible to simplify
some of the wrappers in rawfile following 1.13, but that can come in a later
change.
PiperOrigin-RevId: 313863264
|
|
On amd64, it uses 'HLT' to leave the guest.
Unlike amd64, arm64 can only uses mmio_exit/psci to leave the guest.
So, I designed the HYPERCALL_VMEXIT to be compatible with amd64/arm64.
To keep it simple, I used the address of exception table as the
MMIO base address, so that I can trigger a MMIO-EXIT by forcibly writing this space.
Then, in host user space, I can calculate this address to find out
which hypercall.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
It's a workaround to treat PROT_NONE as RDONLY temporarily.
TODO(gvisor.dev/issue/2686): PROT_NONE should be specially treated.
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
PiperOrigin-RevId: 308472331
|
|
PiperOrigin-RevId: 308347744
|
|
PiperOrigin-RevId: 307941984
|