Age | Commit message (Collapse) | Author |
|
`runsc ps` currently return pid for a task's immediate pid namespace,
which is confusing when there're multiple pid namespaces. We should
return only pids in the root namespace.
Before:
```
1000 1 0 0 ? 02:24 250ms chrome
1000 1 0 0 ? 02:24 40ms dumb-init
1000 1 0 0 ? 02:24 240ms chrome
1000 2 1 0 ? 02:24 2.78s node
```
After:
```
UID PID PPID C TTY STIME TIME CMD
1000 1 0 0 ? 12:35 0s dumb-init
1000 2 1 7 ? 12:35 240ms node
1000 13 2 21 ? 12:35 2.33s chrome
1000 27 13 3 ? 12:35 260ms chrome
```
Signed-off-by: Daniel Dao <dqminh@cloudflare.com>
|
|
PiperOrigin-RevId: 358445320
|
|
Because we lack gVisor-internal cgroups, we take the CPU usage of the entire pod
and divide it proportionally according to sentry-internal usage stats.
This fixes `kubectl top pods`, which gets a pod's CPU usage by summing the usage
of its containers.
Addresses #172.
PiperOrigin-RevId: 355229833
|
|
This fixes reference leaks related to accidentally forgetting to DecRef()
after calling one or the other.
PiperOrigin-RevId: 336918922
|
|
VFS1 and VFS2 host FDs have different dupping behavior,
making error prone to code for both. Change the contract
so that FDs are released as they are used, so the caller
can simple defer a block that closes all remaining files.
This also addresses handling of partial failures.
With this fix, more VFS2 tests can be enabled.
Updates #1487
PiperOrigin-RevId: 330112266
|
|
context is passed to DecRef() and Release() which is
needed for SO_LINGER implementation.
PiperOrigin-RevId: 324672584
|
|
Run vs. exec, VFS1 vs. VFS2 were executable lookup were
slightly different from each other. Combine them all
into the same logic.
PiperOrigin-RevId: 315426443
|
|
PiperOrigin-RevId: 313871804
|
|
Updates #1623, #1487
PiperOrigin-RevId: 309777922
|
|
Using the host-defined file owner matches VFS1. It is more correct to use the
host-defined mode, since the cached value may become out of date. However,
kernfs.Inode.Mode() does not return an error--other filesystems on kernfs are
in-memory so retrieving mode should not fail. Therefore, if the host syscall
fails, we rely on a cached value instead.
Updates #1672.
PiperOrigin-RevId: 303220864
|
|
This saves one pointer dereference per VFS access.
Updates #1623
PiperOrigin-RevId: 295216176
|
|
- Added fsbridge package with interface that can be used to open
and read from VFS1 and VFS2 files.
- Converted ELF loader to use fsbridge
- Added VFS2 types to FSContext
- Added vfs.MountNamespace to ThreadGroup
Updates #1623
PiperOrigin-RevId: 295183950
|
|
runsc debug --ps list all processes with all threads. This option is added to
the debug command but not to the ps command, because it is going to be used for
debug purposes and we want to add any useful information without thinking about
backward compatibility.
This will help to investigate syzkaller issues.
PiperOrigin-RevId: 285013668
|
|
Threadgroups already know their TTY (if they have one), which now contains the
TTY Index, and is returned in the Processes() call.
PiperOrigin-RevId: 284263850
|
|
We can get the mount namespace from the CreateProcessArgs in all cases where we
need it. This also gets rid of kernel.Destroy method, since the only thing it
was doing was DecRefing the mounts.
Removing the need to call kernel.SetRootMountNamespace also allowed for some
more simplifications in the container fs setup code.
PiperOrigin-RevId: 261357060
|
|
The different containers in a sandbox used only one pid
namespace before. This results in that a container can see
the processes in another container in the same sandbox.
This patch use different pid namespace for different containers.
Signed-off-by: chris.zn <chris.zn@antfin.com>
|
|
This keeps all container filesystem completely separate from eachother
(including from the root container filesystem), and allows us to get rid of the
"__runsc_containers__" directory.
It also simplifies container startup/teardown as we don't have to muck around
in the root container's filesystem.
PiperOrigin-RevId: 259613346
|
|
This renames FDMap to FDTable and drops the kernel.FD type, which had an entire
package to itself and didn't serve much use (it was freely cast between types,
and served as more of an annoyance than providing any protection.)
Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup
operation, and 10-15 ns per concurrent lookup operation of savings.
This also fixes two tangential usage issues with the FDMap. Namely, non-atomic
use of NewFDFrom and associated calls to Remove (that are both racy and fail to
drop the reference on the underlying file.)
PiperOrigin-RevId: 256285890
|
|
This can be merged after:
https://github.com/google/gvisor-website/pull/77
or
https://github.com/google/gvisor-website/pull/78
PiperOrigin-RevId: 253132620
|
|
Based on the guidelines at
https://opensource.google.com/docs/releasing/authors/.
1. $ rg -l "Google LLC" | xargs sed -i 's/Google LLC.*/The gVisor Authors./'
2. Manual fixup of "Google Inc" references.
3. Add AUTHORS file. Authors may request to be added to this file.
4. Point netstack AUTHORS to gVisor AUTHORS. Drop CONTRIBUTORS.
Fixes #209
PiperOrigin-RevId: 245823212
Change-Id: I64530b24ad021a7d683137459cafc510f5ee1de9
|
|
PiperOrigin-RevId: 245818639
Change-Id: I03703ef0fb9b6675955637b9fe2776204c545789
|
|
If a background process tries to read from a TTY, linux sends it a SIGTTIN
unless the signal is blocked or ignored, or the process group is an orphan, in
which case the syscall returns EIO.
See drivers/tty/n_tty.c:n_tty_read()=>job_control().
If a background process tries to write a TTY, set the termios, or set the
foreground process group, linux then sends a SIGTTOU. If the signal is ignored
or blocked, linux allows the write. If the process group is an orphan, the
syscall returns EIO.
See drivers/tty/tty_io.c:tty_check_change().
PiperOrigin-RevId: 234044367
Change-Id: I009461352ac4f3f11c5d42c43ac36bb0caa580f9
|
|
Updated error messages so that it doesn't print full Go struct representations
when running a new container in a sandbox. For example, this occurs frequently
when commands are not found when doing a 'kubectl exec'.
PiperOrigin-RevId: 219729141
Change-Id: Ic3a7bc84cd7b2167f495d48a1da241d621d3ca09
|
|
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
|
|
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.
However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.
This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.
One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.
To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.
Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.
Example:
root@:/usr/local/apache2# sleep 100
^C
root@:/usr/local/apache2# sleep 100
^Z
[1]+ Stopped sleep 100
root@:/usr/local/apache2# fg
sleep 100
^C
root@:/usr/local/apache2#
PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
|
|
There was a race where we checked task.Parent() != nil, and then later called
task.Parent() again, assuming that it is not nil. If the task is exiting, the
parent may have been set to nil in between the two calls, causing a panic.
This CL changes the code to only call task.Parent() once.
PiperOrigin-RevId: 215274456
Change-Id: Ib5a537312c917773265ec72016014f7bc59a5f59
|
|
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.
PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
|
|
Old code was returning ID of the thread that created
the child process. It should be returning the ID of
the parent process instead.
PiperOrigin-RevId: 214720910
Change-Id: I95715c535bcf468ecf1ae771cccd04a4cd345b36
|
|
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
code.
- Processes not waited on will consume memory (like a zombie process)
PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
|
|
The contract in ExecArgs says that a reference on ExecArgs.Root must be held
for the lifetime of the struct, but the caller is free to drop the ref after
that.
As a result, proc.Exec must take an additional ref on Root when it constructs
the CreateProcessArgs, since that holds a pointer to Root as well. That ref is
dropped in CreateProcess.
PiperOrigin-RevId: 212828348
Change-Id: I7f44a612f337ff51a02b873b8a845d3119408707
|
|
This is different from the existing -pid-file flag, which saves a host pid.
PiperOrigin-RevId: 212713968
Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0
|
|
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.
Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute. So we have to
do the path lookup much later than we previously were.
PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
|
|
PiperOrigin-RevId: 211999211
Change-Id: I5968dd1a8313d3e49bb6e6614e130107495de41d
|
|
Imported file needs to be closed after it's
been imported.
PiperOrigin-RevId: 211732472
Change-Id: Ia9249210558b77be076bcce465b832a22eed301f
|
|
This CL adds terminal support for "docker exec". We previously only supported
consoles for the container process, but not exec processes.
The SYS_IOCTL syscall was added to the default seccomp filter list, but only
for ioctls that get/set winsize and termios structs. We need to allow these
ioctl for all containers because it's possible to run "exec -ti" on a
container that was started without an attached console, after the filters
have been installed.
Note that control-character signals are still not properly supported.
Tested with:
$ docker run --runtime=runsc -it alpine
In another terminial:
$ docker exec -it <containerid> /bin/sh
PiperOrigin-RevId: 210185456
Change-Id: I6d2401e53a7697bb988c120a8961505c335f96d9
|
|
Resume checks the status of the container and unpauses the kernel
if its status is paused. Otherwise nothing happens.
Tests were added to ensure that the process is in the correct state
after various commands.
PiperOrigin-RevId: 201251234
Change-Id: Ifd11b336c33b654fea6238738f864fcf2bf81e19
|
|
Detachable exec commands are handled in the client entirely and the detach option is not used anymore.
PiperOrigin-RevId: 195181272
Change-Id: I6e82a2876d2c173709c099be59670f71702e5bf0
|
|
PiperOrigin-RevId: 194583126
Change-Id: Ica1d8821a90f74e7e745962d71801c598c652463
|