Age | Commit message (Collapse) | Author |
|
BounceToKernel will make vCPU quit from guest ring3 to guest ring0, but
vCPUWaiter is not cleared when we unlock the vCPU, when next time this vCPU
enter guest mode ring3, vCPU may enter guest mode with vCPUWaiter bit setted,
this will cause the following BounceToKernel to this vCPU hangs at
waitUntilNot.
Halt may workaroud this issue, because halt process will reset vCPU status into
vCPUUser, and notify all waiter for vCPU state change, but if there is no
exception or syscall in this period, BounceToKernel will hang at waitUntilNot.
PiperOrigin-RevId: 256299660
|
|
This renames FDMap to FDTable and drops the kernel.FD type, which had an entire
package to itself and didn't serve much use (it was freely cast between types,
and served as more of an annoyance than providing any protection.)
Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup
operation, and 10-15 ns per concurrent lookup operation of savings.
This also fixes two tangential usage issues with the FDMap. Namely, non-atomic
use of NewFDFrom and associated calls to Remove (that are both racy and fail to
drop the reference on the underlying file.)
PiperOrigin-RevId: 256285890
|
|
Adds support level documentation for all syscalls. Removes the Undocumented
utility function to discourage usage while leaving SupportUndocumented as the
default support level for Syscall structs.
PiperOrigin-RevId: 256281927
|
|
PiperOrigin-RevId: 256234390
|
|
fileOpAt holds references on the Dirents passed as arguments to the callback,
and drops refs when finished, so we don't need to DecRef those Dirents
ourselves
However, all Dirents that we get from FindInode/FindLink must be DecRef'd.
This CL cleans up the ref-counting logic, and fixes some refcount issues in the
process.
PiperOrigin-RevId: 256220882
|
|
It feels like "reticulating splines" is missing from the list of meaningless
syslog messages.
Signed-off-by: Ahmet Alp Balkan <ahmetb@google.com>
|
|
Fix two leaks for connectionless Unix sockets:
* Double connect: Subsequent connects would leak a reference on the previously
connected endpoint.
* Close unconnected: Sockets which were not connected at the time of closure
would leak a reference on their receiver.
PiperOrigin-RevId: 256070451
|
|
This fixes the case when an app tries to create a file that already exists, and
is a symlink to itself. A test was added.
PiperOrigin-RevId: 256044811
|
|
PiperOrigin-RevId: 255711454
|
|
These are unfortunately unused and unmaintained. They can be brought back in
the future if need requires it.
PiperOrigin-RevId: 255697132
|
|
These syscalls require filesystem support that gVisor does not provide, and is
not planning to implement. Their absense should not trigger an event.
PiperOrigin-RevId: 255692871
|
|
PiperOrigin-RevId: 255687771
|
|
PiperOrigin-RevId: 255679453
|
|
Right now, if we can't create a stub process, we will see this error:
panic: unable to activate mm: resource temporarily unavailable
It would be better to know the root cause of this "resource temporarily
unavailable".
PiperOrigin-RevId: 255656831
|
|
PiperOrigin-RevId: 255644277
|
|
Readdir of /proc/x/task/ will get direntry entries
from tasks of specified taskgroup. Now the tasks
slice is unsorted, use sort.SearchInts search entry
from the slice may cause infinity loops.
The fix is sort the slice before search.
This issue could be easily reproduced via following
steps, revise Readdir in pkg/sentry/fs/proc/task.go,
force set taskInts into test slice
[]int{1, 11, 7, 5, 10, 6, 8, 3, 9, 2, 4},
then run docker image and run ls /proc/1/task, the
command will cause infinity loops.
|
|
Get/Set pipe size and ioctl support were missing from
overlayfs. It required moving the pipe.Sizer interface
to fs so that overlay could get access.
Fixes #318
PiperOrigin-RevId: 255511125
|
|
Addresses obvious typos, in the documentation only.
COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/443 from Pixep:fix/documentation-spelling 4d0688164eafaf0b3010e5f4824b35d1e7176d65
PiperOrigin-RevId: 255477779
|
|
Currently, the overlay dirCache is only used for a single logical use of
getdents. i.e., it is discard when the FD is closed or seeked back to
the beginning.
But the initial work of getting the directory contents can be quite
expensive (particularly sorting large directories), so we should keep it
as long as possible.
This is very similar to the readdirCache in fs/gofer.
Since the upper filesystem does not have to allow caching readdir
entries, the new CacheReaddir MountSourceOperations method controls this
behavior.
This caching should be trivially movable to all Inodes if desired,
though that adds an additional copy step for non-overlay Inodes.
(Overlay Inodes already do the extra copy).
PiperOrigin-RevId: 255477592
|
|
PiperOrigin-RevId: 255465635
|
|
The code was wrongly assuming that only read access was
required from the lower overlay when checking for permissions.
This allowed non-writable files to be writable in the overlay.
Fixes #316
PiperOrigin-RevId: 255263686
|
|
If we have a symlink whose target does not exist, creating the symlink (either
via 'creat' or 'open' with O_CREAT flag) should create the target of the
symlink. Previously, gVisor would error with EEXIST in this case
PiperOrigin-RevId: 255232944
|
|
Updates #179
PiperOrigin-RevId: 255081565
|
|
Credentials are immutable and even before these changes we could read them
without locks, but we needed to take a task lock to get a credential object
from a task object.
It is possible to avoid this lock, if we will guarantee that a credential
object will not be changed after setting it on a task.
PiperOrigin-RevId: 254989492
|
|
Makes CLOCK_BOOTTIME available with
* clock_gettime
* timerfd_create
* clock_gettime vDSO
CLOCK_BOOTTIME is implemented as an alias to CLOCK_MONOTONIC.
CLOCK_MONOTONIC already keeps track of time across save
and restore. This is the closest possible behavior to Linux
CLOCK_BOOTIME, as there is no concept of suspend/resume.
Updates google/gvisor#218
|
|
For files with O_APPEND, a file write operation gets a file size and uses it as
offset to call an inode write operation. This means that all other operations
which can change a file size should be blocked while the write operation doesn't
complete.
PiperOrigin-RevId: 254873771
|
|
This prevents a race before PDEATH_SIG can take effect during
a sentry crash.
Discovered and solution by avagin@.
PiperOrigin-RevId: 254871534
|
|
PiperOrigin-RevId: 254854346
|
|
The tracee is stopped early during process exit, when registers are still
available, allowing the tracer to see where the exit occurred, whereas the
normal exit notifi? cation is done after the process is finished exiting.
Without this option, dumpAndPanic fails to get registers.
PiperOrigin-RevId: 254852917
|
|
The previous number was for the arm architecture.
Also change the statx tests to force them to run on gVisor, which would have
caught this issue.
PiperOrigin-RevId: 254846831
|
|
New options are:
runsc debug --strace=off|all|function1,function2
runsc debug --log-level=warning|info|debug
runsc debug --log-packets=true|false
Updates #407
PiperOrigin-RevId: 254843128
|
|
There will be a deadloop when we use getdents to read /proc/{pid}/task
of an exited process
Like this:
Process A is running
Process B: open /proc/{pid of A}/task
Process A exits
Process B: getdents /proc/{pid of A}/task
Then, process B will fall into deadloop, and return "." and ".."
in loops and never ends.
This patch returns ENOENT when use getdents to read /proc/{pid}/task
if the process is just exited.
Signed-off-by: chris.zn <chris.zn@antfin.com>
|
|
We don't have the plumbing for btime yet, so that field is left off. The
returned mask indicates that btime is absent.
Fixes #343
PiperOrigin-RevId: 254575752
|
|
FileMaxOffset is a special case when lseek(d, 0, SEEK_END) has been called.
PiperOrigin-RevId: 254498777
|
|
PiperOrigin-RevId: 254482180
|
|
PiperOrigin-RevId: 254450309
|
|
defer here doesn't improve readability, but we know it slower that
the explicit call.
PiperOrigin-RevId: 254441473
|
|
PiperOrigin-RevId: 254428866
|
|
Otherwise every call to, say, fs.ContextCanAccessFile() in a benchmark
using contexttest allocates new auth.Credentials, a new
auth.UserNamespace, ...
PiperOrigin-RevId: 254261051
|
|
These are the only packages missing docs:
https://godoc.org/gvisor.dev/gvisor
PiperOrigin-RevId: 254261022
|
|
PiperOrigin-RevId: 254253777
|
|
The sendfile syscall's backing doSplice contained a race with regard to
blocking. If the first attempt failed with syserror.ErrWouldBlock and then
the blocking file became ready before registering a waiter, we would just
return the ErrWouldBlock (even if we were supposed to block).
PiperOrigin-RevId: 254114432
|
|
And methods that do more traversals should use the remaining count rather than
resetting.
PiperOrigin-RevId: 254041720
|
|
This allows tasks to have distinct mount namespace, instead of all sharing the
kernel's root mount namespace.
Currently, the only way for a task to get a different mount namespace than the
kernel's root is by explicitly setting a different MountNamespace in
CreateProcessArgs, and nothing does this (yet).
In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a
new container inside runsc.
Note that "MountNamespace" is a poor term for this thing. It's more like a
distinct VFS tree. When we get around to adding real mount namespaces, this
will need a better naem.
PiperOrigin-RevId: 254009310
|
|
Test fails because it's reading 4KB instead of the
expected 64KB. Changed the test to read pipe buffer
size instead of hardcode and added some logging in
case the reason for failure was not pipe buffer size.
PiperOrigin-RevId: 253916040
|
|
sockets, pipes and other non-seekable file descriptors don't
use file.offset, so we don't need to update it.
With this change, we will be able to call file operations
without locking the file.mu mutex. This is already used for
pipes in the splice system call.
PiperOrigin-RevId: 253746644
|
|
When leader of process group (session) exit, the process
group ID (session ID) is holding by other processes in
the process group, so the process group ID (session ID)
can not be reused.
If reusing the process group ID (seession ID) as new process
group ID for new process, this will cause session create
failed, and later runsc crash when access process group.
The fix skip the tid if it is using by a process group
(session) when allocating a new tid.
We could easily reproduce the runsc crash follow
these steps:
1. build test program, and run inside container
int main(int argc, char *argv[])
{
pid_t cpid, spid;
cpid = fork();
if (cpid == -1) {
perror("fork");
exit(EXIT_FAILURE);
}
if (cpid == 0) {
pid_t sid = setsid();
printf("Start New Session %ld\n",sid);
printf("Child PID %ld / PPID %ld / PGID %ld / SID %ld\n",
getpid(),getppid(),getpgid(getpid()),getsid(getpid()));
spid = fork();
if (spid == 0) {
setpgid(getpid(), getpid());
printf("Set GrandSon as New Process Group\n");
printf("GrandSon PID %ld / PPID %ld / PGID %ld / SID %ld\n",
getpid(),getppid(),getpgid(getpid()),getsid(getpid()));
while(1) {
usleep(1);
}
}
sleep(3);
exit(0);
} else {
exit(0);
}
return 0;
}
2. build hello program
int main(int argc, char *argv[])
{
printf("Current PID is %ld\n", (long) getpid());
return 0;
}
3. run script on host which run hello inside container, you can
speed up the test with set TasksLimit as lower value.
for (( i=0; i<65535; i++ ))
do
docker exec <container id> /test/hello
done
4. when hello process reusing the process group of loop process,
runsc will crash.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x79f0c8]
goroutine 612475 [running]:
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*ProcessGroup).decRefWithParent(0x0, 0x0)
pkg/sentry/kernel/sessions.go:160 +0x78
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).exitNotifyLocked(0xc000663500, 0x0)
pkg/sentry/kernel/task_exit.go:672 +0x2b7
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runExitNotify).execute(0x0, 0xc000663500, 0x0, 0x0)
pkg/sentry/kernel/task_exit.go:542 +0xc4
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run(0xc000663500, 0xc)
pkg/sentry/kernel/task_run.go:91 +0x194
created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start
pkg/sentry/kernel/task_start.go:286 +0xfe
|
|
The implementation is similar to linux where we track the number of bytes
consumed by the application to grow the receive buffer of a given TCP endpoint.
This ensures that the advertised window grows at a reasonable rate to accomodate
for the sender's rate and prevents large amounts of data being held in stack
buffers if the application is not actively reading or not reading fast enough.
The original paper that was used to implement the linux receive buffer auto-
tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf
NOTE: Linux does not implement DRS as defined in that paper, it's just a good
reference to understand the solution space.
Updates #230
PiperOrigin-RevId: 253168283
|
|
All functions which allocate objects containing AtomicRefCounts will soon need
a context.
PiperOrigin-RevId: 253147709
|
|
The deadlock can occur when both ends of a connected Unix socket which has
FIOASYNC enabled on at least one end are closed at the same time. One end
notifies that it is closing, calling (*waiter.Queue).Notify which takes
waiter.Queue.mu (as a read lock) and then calls (*FileAsync).Callback, which
takes FileAsync.mu. The other end tries to unregister for notifications by
calling (*FileAsync).Unregister, which takes FileAsync.mu and calls
(*waiter.Queue).EventUnregister which takes waiter.Queue.mu.
This is fixed by moving the calls to waiter.Waitable.EventRegister and
waiter.Waitable.EventUnregister outside of the protection of any mutex used
in (*FileAsync).Callback.
The new test is related, but does not cover this particular situation.
Also fix a data race on FileAsync.e.Callback. (*FileAsync).Callback checked
FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter
calling (*FileAsync).Callback could not and did not. This is fixed by making
FileAsync.e.Callback immutable before passing it to the waiter for the first
time.
Fixes #346
PiperOrigin-RevId: 253138340
|