Age | Commit message (Collapse) | Author |
|
Use private futexes for performance and to align with other runtime uses.
PiperOrigin-RevId: 219422634
Change-Id: Ief2af5e8302847ea6dc246e8d1ee4d64684ca9dd
|
|
PiperOrigin-RevId: 219151173
Change-Id: I73014ea648ae485692ea0d44860c87f4365055cb
|
|
This change also adds extensive testing to the p9 package via mocks. The sanity
checks and type checks are moved from the gofer into the core package, where
they can be more easily validated.
PiperOrigin-RevId: 218296768
Change-Id: I4fc3c326e7bf1e0e140a454cbacbcc6fd617ab55
|
|
Added events for *ctl syscalls that may have multiple different commands.
For runsc, each syscall event is only logged once. For *ctl syscalls, use
the cmd as identifier, not only the syscall number.
PiperOrigin-RevId: 218015941
Change-Id: Ie3c19131ae36124861e9b492a7dbe1765d9e5e59
|
|
PiperOrigin-RevId: 217951017
Change-Id: Ie08bf6987f98467d07457bcf35b5f1ff6e43c035
|
|
It's hard to resolve symlinks inside the sandbox because rootfs and mounts
may be read-only, forcing us to create mount points inside lower layer of an
overlay, **before** the volumes are mounted.
Since the destination must already be resolved outside the sandbox when creating
mounts, take this opportunity to rewrite the spec with paths resolved.
"runsc boot" will use the "resolved" spec to load mounts. In addition, symlink
traversals were disabled while mounting containers inside the sandbox.
It haven't been able to write a good test for it. So I'm relying on manual tests
for now.
PiperOrigin-RevId: 217749904
Change-Id: I7ac434d5befd230db1488446cda03300cc0751a9
|
|
We were closing the FD directly. If the test then created a new socket pair
with the same FD, in-flight RPCs would get directed to the new socket and break
the test.
Instead, we should use unet.Socket.Close(), which allows any in-flight RPCs to
finish.
PiperOrigin-RevId: 217608491
Change-Id: I8c5a76638899ba30f33ca976e6fac967fa0aadbf
|
|
Now containers run with "docker run -it" support control characters like ^C and
^Z.
This required refactoring our signal handling a bit. Signals delivered to the
"runsc boot" process are turned into loader.Signal calls with the appropriate
delivery mode. Previously they were always sent directly to PID 1.
PiperOrigin-RevId: 217566770
Change-Id: I5b7220d9a0f2b591a56335479454a200c6de8732
|
|
PiperOrigin-RevId: 217548429
Change-Id: Ie640c881fdc4fc70af58c8ca834df1ac531e519a
|
|
--pid allows specific processes to be signalled rather than the container root
process or all processes in the container. containerd needs to SIGKILL exec'd
processes that timeout and check whether processes are still alive.
PiperOrigin-RevId: 217547636
Change-Id: I2058ebb548b51c8eb748f5884fb88bad0b532e45
|
|
PiperOrigin-RevId: 216780438
Change-Id: Ide637fe36f8d2a61fea9e5b16d1b3401f2540416
|
|
This is a breaking change if you're using --debug-log-dir.
The fix is to replace it with --debug-log and add a '/' at
the end:
--debug-log-dir=/tmp/runsc ==> --debug-log=/tmp/runsc/
PiperOrigin-RevId: 216761212
Change-Id: I244270a0a522298c48115719fa08dad55e34ade1
|
|
This change introduces a new flags to create/run called
--user-log. Logs to this files are visible to users and
are meant to help debugging problems with their images
and containers.
For now only unsupported syscalls are sent to this log,
and only minimum support was added. We can build more
infrastructure around it as needed.
PiperOrigin-RevId: 216735977
Change-Id: I54427ca194604991c407d49943ab3680470de2d0
|
|
PiperOrigin-RevId: 216616873
Change-Id: I4d974ab968058eadd01542081e18a987ef08f50a
|
|
Change-Id: I1fb9f5b47a264a7617912f6f56f995f3c4c5e578
PiperOrigin-RevId: 216591484
|
|
Currently, in the face of FileMem fragmentation and a large sendmsg or
recvmsg call, host sockets may pass > 1024 iovecs to the host, which
will immediately cause the host to return EMSGSIZE.
When we detect this case, use a single intermediate buffer to pass to
the kernel, copying to/from the src/dst buffer.
To avoid creating unbounded intermediate buffers, enforce message size
checks and truncation w.r.t. the send buffer size. The same
functionality is added to netstack unix sockets for feature parity.
PiperOrigin-RevId: 216590198
Change-Id: I719a32e71c7b1098d5097f35e6daf7dd5190eff7
|
|
Sandbox creation uses the limits and reservations configured in the
OCI spec and set cgroup options accordinly. Then it puts both the
sandbox and gofer processes inside the cgroup.
It also allows the cgroup to be pre-configured by the caller. If the
cgroup already exists, sandbox and gofer processes will join the
cgroup but it will not modify the cgroup with spec limits.
PiperOrigin-RevId: 216538209
Change-Id: If2c65ffedf55820baab743a0edcfb091b89c1019
|
|
PiperOrigin-RevId: 216431260
Change-Id: Ia6e5c8d506940148d10ff2884cf4440f470e5820
|
|
We were previously only sending to the originator of the process group.
Integration test was changed to test this behavior. It fails without the
corresponding code change.
PiperOrigin-RevId: 216297263
Change-Id: I7e41cfd6bdd067f4b9dc215e28f555fb5088916f
|
|
PiperOrigin-RevId: 216281263
Change-Id: Ie0c189e7f5934b77c6302336723bc1181fd2866c
|
|
We were previously using the sandbox process's stdio as the root container's
stdio. This makes it difficult/impossible to distinguish output application
output from sandbox output, such as panics, which are always written to stderr.
Also close the console socket when we are done with it.
PiperOrigin-RevId: 215585180
Change-Id: I980b8c69bd61a8b8e0a496fd7bc90a06446764e0
|
|
PiperOrigin-RevId: 215574070
Change-Id: Ib36e804adebaf756adb9cbc2752be9789691530b
|
|
Terminal support in runsc relies on host tty file descriptors that are imported
into the sandbox. Application tty ioctls are sent directly to the host fd.
However, those host tty ioctls are associated in the host kernel with a host
process (in this case runsc), and the host kernel intercepts job control
characters like ^C and send signals to the host process. Thus, typing ^C into a
"runsc exec" shell will send a SIGINT to the runsc process.
This change makes "runsc exec" handle all signals, and forward them into the
sandbox via the "ContainerSignal" urpc method. Since the "runsc exec" is
associated with a particular container process in the sandbox, the signal must
be associated with the same container process.
One big difficulty is that the signal should not necessarily be sent to the
sandbox process started by "exec", but instead must be sent to the foreground
process group for the tty. For example, we may exec "bash", and from bash call
"sleep 100". A ^C at this point should SIGINT sleep, not bash.
To handle this, tty files inside the sandbox must keep track of their
foreground process group, which is set/get via ioctls. When an incoming
ContainerSignal urpc comes in, we look up the foreground process group via the
tty file. Unfortunately, this means we have to expose and cache the tty file in
the Loader.
Note that "runsc exec" now handles signals properly, but "runs run" does not.
That will come in a later CL, as this one is complex enough already.
Example:
root@:/usr/local/apache2# sleep 100
^C
root@:/usr/local/apache2# sleep 100
^Z
[1]+ Stopped sleep 100
root@:/usr/local/apache2# fg
sleep 100
^C
root@:/usr/local/apache2#
PiperOrigin-RevId: 215334554
Change-Id: I53cdce39653027908510a5ba8d08c49f9cf24f39
|
|
And remove multicontainer option.
PiperOrigin-RevId: 215236981
Change-Id: I9fd1d963d987e421e63d5817f91a25c819ced6cb
|
|
PiperOrigin-RevId: 214976251
Change-Id: I631348c3886f41f63d0e77e7c4f21b3ede2ab521
|
|
PiperOrigin-RevId: 214975659
Change-Id: I7bd31a2c54f03ff52203109da312e4206701c44c
|
|
It's easier to manage a single map with processes that we're interested
to track. This will make the next change to clean up the map on destroy
easier.
PiperOrigin-RevId: 214894210
Change-Id: I099247323a0487cd0767120df47ba786fac0926d
|
|
We already forward TCSETS and TCSETSW. TCSETSF is roughly equivalent but
discards pending input.
The filters were relaxed to allow host ioctls with TCSETSF argument.
This fixes programs like "passwd" that prevent user input from being displayed
on the terminal.
Before:
root@b8a0240fc836:/# passwd
Enter new UNIX password: 123
Retype new UNIX password: 123
passwd: password updated successfully
After:
root@ae6f5dabe402:/# passwd
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
PiperOrigin-RevId: 214869788
Change-Id: I31b4d1373c1388f7b51d0f2f45ce40aa8e8b0b58
|
|
In order to implement kill --all correctly, the Sentry needs
to track all tasks that belong to a given container. This change
introduces ContainerID to the task, that gets inherited by all
children. 'kill --all' then iterates over all tasks comparing the
ContainerID field to find all processes that need to be signalled.
PiperOrigin-RevId: 214841768
Change-Id: I693b2374be8692d88cc441ef13a0ae34abf73ac6
|
|
This makes the flow slightly simpler (no need to call
Loader.SetRootContainer). And this is required change to tag
tasks with container ID inside the Sentry.
PiperOrigin-RevId: 214795210
Change-Id: I6ff4af12e73bb07157f7058bb15fd5bb88760884
|
|
This CL changes the semantics of the "--file-access" flag so that it only
affects the root filesystem. The default remains "exclusive" which is the
common use case, as neither Docker nor K8s supports sharing the root.
Keeping the root fs as "exclusive" means that the fs-intensive work done during
application startup will mostly be cacheable, and thus faster.
Non-root bind mounts will always be shared.
This CL also removes some redundant FSAccessType validations. We validate this
flag in main(), so we can assume it is valid afterwards.
PiperOrigin-RevId: 214359936
Change-Id: I7e75d7bf52dbd7fa834d0aacd4034868314f3b51
|
|
PiperOrigin-RevId: 213908919
Change-Id: I74eff99a5360bb03511b946f4cb5658bb5fc40c7
|
|
Destroy flushes dirent references, which triggers many async close operations.
We must wait for those to finish before returning from Destroy, otherwise we
may kill the gofer, causing a cascade of failing RPCs and leading to an
inconsistent FS state.
PiperOrigin-RevId: 213884637
Change-Id: Id054b47fc0f97adc5e596d747c08d3b97a1d1f71
|
|
The issue with the previous change was that the stdin/stdout/stderr passed to
the sentry were dup'd by host.ImportFile. This left a dangling FD that by never
closing caused containerd to timeout waiting on container stop.
PiperOrigin-RevId: 213753032
Change-Id: Ia5e4c0565c42c8610d3b59f65599a5643b0901e4
|
|
This method will:
1. Stop the container process if it is still running.
2. Unmount all sanadbox-internal mounts for the container.
3. Delete the contaner root directory inside the sandbox.
Destroy is idempotent, and safe to call concurrantly.
This fixes a bug where after stopping a container, we cannot unmount the
container root directory on the host. This bug occured because the sandbox
dirent cache was holding a dirent with a host fd corresponding to a file inside
the container root on the host. The dirent cache did not know that the
container had exited, and kept the FD open, preventing us from unmounting on
the host.
Now that we unmount (and flush) all container mounts inside the sandbox, any
host FDs donated by the gofer will be closed, and we can unmount the container
root on the host.
PiperOrigin-RevId: 213737693
Change-Id: I28c0ff4cd19a08014cdd72fec5154497e92aacc9
|
|
Capabilities.Set() adds capabilities,
but doesn't remove existing ones that might have been loaded. Fixed
the code and added tests.
PiperOrigin-RevId: 213726369
Change-Id: Id7fa6fce53abf26c29b13b9157bb4c6616986fba
|
|
PiperOrigin-RevId: 213715511
Change-Id: I3e41b583c6138edbdeba036dfb9df4864134fc12
|
|
`docker run --cpuset-cpus=/--cpus=` will generate cpu resource info in config.json
(runtime spec file). When nginx worker_connections is configured as auto, the worker is
generated according to the number of CPUs. If the cgroup is already set on the host, but
it is not displayed correctly in the sandbox, performance may be degraded.
This patch can get cpus info from spec file and apply to sentry on bootup, so the
/proc/cpuinfo can show the correct cpu numbers. `lscpu` and other commands rely on
`/sys/devices/system/cpu/online` are also affected by this patch.
e.g.
--cpuset-cpus=2,3 -> cpu number:2
--cpuset-cpus=4-7 -> cpu number:4
--cpus=2.8 -> cpu number:3
--cpus=0.5 -> cpu number:1
Change-Id: Ideb22e125758d4322a12be7c51795f8018e3d316
PiperOrigin-RevId: 213685199
|
|
PiperOrigin-RevId: 213504354
Change-Id: Iadd42f0ca4b7e7a9eae780bee9900c7233fb4f3f
|
|
panic() during init() can be hard to debug.
Updates #100
PiperOrigin-RevId: 213391932
Change-Id: Ic103f1981c5b48f1e12da3b42e696e84ffac02a9
|
|
This makes `runsc wait` behave more like waitpid()/wait4() in that:
- Once a process has run to completion, you can wait on it and get its exit
code.
- Processes not waited on will consume memory (like a zombie process)
PiperOrigin-RevId: 213358916
Change-Id: I5b5eca41ce71eea68e447380df8c38361a4d1558
|
|
Stdin/out/err weren't being sent to the sentry.
PiperOrigin-RevId: 213307171
Change-Id: Ie4b634a58b1b69aa934ce8597e5cc7a47a2bcda2
|
|
This CL:
1) Fix `runsc wait`, it now also works after the container exits;
2) Generate correct container state in Load;
2) Make sure `Destory` cleanup everything before successfully return.
PiperOrigin-RevId: 212900107
Change-Id: Ie129cbb9d74f8151a18364f1fc0b2603eac4109a
|
|
This is different from the existing -pid-file flag, which saves a host pid.
PiperOrigin-RevId: 212713968
Change-Id: I2c486de8dd5cfd9b923fb0970165ef7c5fc597f0
|
|
It was only used by whitelistfs, which was removed in
bc81f3fe4a042a15343d2eab44da32d818ac1ade.
PiperOrigin-RevId: 212666374
Change-Id: Ia35e6dc9d68c1a3b015d5b5f71ea3e68e46c5bed
|
|
PiperOrigin-RevId: 212557844
Change-Id: I414de848e75d57ecee2c05e851d05b607db4aa57
|
|
We were previously openining the platform device (i.e. /dev/kvm) inside the
platfrom constructor (i.e. kvm.New). This requires that we have RW access to
the platform device when constructing the platform.
However, now that the runsc sandbox process runs as user "nobody", it is not
able to open the platform device.
This CL changes the kvm constructor to take the platform device FD, rather than
opening the device file itself. The device file is opened outside of the
sandbox and passed to the sandbox process.
PiperOrigin-RevId: 212505804
Change-Id: I427e1d9de5eb84c84f19d513356e1bb148a52910
|
|
PiperOrigin-RevId: 212483372
Change-Id: If95f32a8e41126cf3dc8bd6c8b2fb0fcfefedc6d
|
|
We must use a context.Context with a Root Dirent that corresponds to the
container's chroot. Previously we were using the root context, which does not
have a chroot.
Getting the correct context required refactoring some of the path-lookup code.
We can't lookup the path without a context.Context, which requires
kernel.CreateProcArgs, which we only get inside control.Execute. So we have to
do the path lookup much later than we previously were.
PiperOrigin-RevId: 212064734
Change-Id: I84a5cfadacb21fd9c3ab9c393f7e308a40b9b537
|
|
PiperOrigin-RevId: 212028121
Change-Id: If9c2c62f3be103e2bb556b8d154c169888e34369
|