summaryrefslogtreecommitdiffhomepage
path: root/runsc
AgeCommit message (Collapse)Author
2020-05-13Enable overlayfs_stale_read by default for runsc.Jamie Liu
Linux 4.18 and later make reads and writes coherent between pre-copy-up and post-copy-up FDs representing the same file on an overlay filesystem. However, memory mappings remain incoherent: - Documentation/filesystems/overlayfs.rst, "Non-standard behavior": "If a file residing on a lower layer is opened for read-only and then memory mapped with MAP_SHARED, then subsequent changes to the file are not reflected in the memory mapping." - fs/overlay/file.c:ovl_mmap() passes through to the underlying FD without any management of coherence in the overlay. - Experimentally on Linux 5.2: ``` $ cat mmap_cat_page.c #include <err.h> #include <fcntl.h> #include <stdio.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> int main(int argc, char **argv) { if (argc < 2) { errx(1, "syntax: %s [FILE]", argv[0]); } const int fd = open(argv[1], O_RDONLY); if (fd < 0) { err(1, "open(%s)", argv[1]); } const size_t page_size = sysconf(_SC_PAGE_SIZE); void* page = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); if (page == MAP_FAILED) { err(1, "mmap"); } for (;;) { write(1, page, strnlen(page, page_size)); if (getc(stdin) == EOF) { break; } } return 0; } $ gcc -O2 -o mmap_cat_page mmap_cat_page.c $ mkdir lowerdir upperdir workdir overlaydir $ echo old > lowerdir/file $ sudo mount -t overlay -o "lowerdir=lowerdir,upperdir=upperdir,workdir=workdir" none overlaydir $ ./mmap_cat_page overlaydir/file old ^Z [1]+ Stopped ./mmap_cat_page overlaydir/file $ echo new > overlaydir/file $ cat overlaydir/file new $ fg ./mmap_cat_page overlaydir/file old ``` Therefore, while the VFS1 gofer client's behavior of reopening read FDs is only necessary pre-4.18, replacing existing memory mappings (in both sentry and application address spaces) with mappings of the new FD is required regardless of kernel version, and this latter behavior is common to both VFS1 and VFS2. Re-document accordingly, and change the runsc flag to enabled by default. New test: - Before this CL: https://source.cloud.google.com/results/invocations/5b222d2c-e918-4bae-afc4-407f5bac509b - After this CL: https://source.cloud.google.com/results/invocations/f28c747e-d89c-4d8c-a461-602b33e71aab PiperOrigin-RevId: 311361267
2020-05-13Use VFS2 mount namesFabricio Voznika
Updates #1487 PiperOrigin-RevId: 311356385
2020-05-12Adjust a few log messagesFabricio Voznika
PiperOrigin-RevId: 311234146
2020-05-11Automated rollback of changelist 310417191Bhasker Hariharan
PiperOrigin-RevId: 310963404
2020-05-10Stop avoiding preadv2 and pwritev2, and add them to the filters.Nicolas Lacasse
Some code paths needed these syscalls anyways, so they should be included in the filters. Given that we depend on these syscalls in some cases, there's no real reason to avoid them any more. PiperOrigin-RevId: 310829126
2020-05-07Allocate device numbers for VFS2 filesystems.Jamie Liu
Updates #1197, #1198, #1672 PiperOrigin-RevId: 310432006
2020-05-07Automated rollback of changelist 309339316Bhasker Hariharan
PiperOrigin-RevId: 310417191
2020-05-07Update privateunixsocket TODOs.Dean Deng
Synthetic sockets do not have the race condition issue in VFS2, and we will get rid of privateunixsocket as well. Fixes #1200. PiperOrigin-RevId: 310386474
2020-05-06Fix runsc syscall documentation generation.Adin Scannell
We can register any number of tables with any number of architectures, and need not limit the definitions to the architecture in question. This allows runsc to generate documentation for all architectures simultaneously. Similarly, this simplifies the VFSv2 patching process. PiperOrigin-RevId: 310224827
2020-05-04Enable TestRunNonRoot on VFS2Fabricio Voznika
Also added back the default test dimension back which was dropped in a previous refactor. PiperOrigin-RevId: 309797327
2020-05-04Mount VSFS2 filesystem using root credentialsFabricio Voznika
PiperOrigin-RevId: 309787938
2020-05-04Add TTY support on VFS2 to runscFabricio Voznika
Updates #1623, #1487 PiperOrigin-RevId: 309777922
2020-04-30Enable FIFO QDisc by default in runsc.Bhasker Hariharan
Updates #231 PiperOrigin-RevId: 309339316
2020-04-30FIFO QDisc implementationBhasker Hariharan
Updates #231 PiperOrigin-RevId: 309323808
2020-04-29Merge pull request #2487 from moricho:fix/bindmountgVisor bot
PiperOrigin-RevId: 309082540
2020-04-28Merge pull request #2558 from prattmic:forward_signalgVisor bot
PiperOrigin-RevId: 308829800
2020-04-27Merge pull request #2544 from prattmic:runsc_do_cleanupgVisor bot
PiperOrigin-RevId: 308727526
2020-04-27runsc: extend do network cleanupMichael Pratt
Previously we unconditionally failed to cleanup the networking files (hostname, resolve.conf, hosts), and failed to cleanup the netns, etc on partial setup failure. We can drop the iptables commands from cleanup, as the routes automatically go away when the device is deleted. Those commands were failing previously. Forward signals to the container, allowing it to exit normally when a signal is received, and then for runsc to run the cleanup. This doesn't cover cleanup when runsc is signalled before the container start, it covers the most common case. Fixes #2539 Fixes #2540
2020-04-27container: use sighandling packageMichael Pratt
Use the sighandling package for Container.ForwardSignals, for consistency with other signal forwarding. Fixes #2546
2020-04-27Update container.gokevin.xu
typo, should be `start` in comments
2020-04-26refactor and add test for bindmountmoricho
Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-25Add container tests passing with VFS2Zach Koopmans
Several tests are passing after getting TestAppExitStatus (run /bin/true) changes. Make versions that run via VFS2 so that we know what is and isn't working. In addition, fix bug in VFSFile ReadFull. For the TestExePath test in container_test.go, the case "unmasked" will return 0 bytes read with no EOF err, causing the ReadFull call to spin. PiperOrigin-RevId: 308428126
2020-04-25add bind/rbind options for mountmoricho
Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-25fix behavior of `getMountNameAndOptions` when options include either bind or ↵moricho
rbind Signed-off-by: moricho <ikeda.morito@gmail.com>
2020-04-24VFS2: Get HelloWorld image tests to pass with VFS2Zach Koopmans
This change includes: - Modifications to loader_test.go to get TestCreateMountNamespace to pass with VFS2. - Changes necessary to get TestHelloWorld in image tests to pass with VFS2. This means runsc can run the hello-world container with docker on VSF2. Note: Containers that use sockets will not run with these changes. See "//test/image/...". Any tests here with sockets currently fail (which is all of them but HelloWorld). PiperOrigin-RevId: 308363072
2020-04-24Propagate PID limit from OCI to sandbox cgroupFabricio Voznika
Closes #2489 PiperOrigin-RevId: 308362434
2020-04-24Plumb context.Context into kernfs.Inode.Open().Dean Deng
PiperOrigin-RevId: 308304793
2020-04-24Move hostfs mount to Kernel struct.Dean Deng
This is needed to set up host fds passed through a Unix socket. Note that the host package depends on kernel, so we cannot set up the hostfs mount directly in Kernel.Init as we do for sockfs and pipefs. Also, adjust sockfs to make its setup look more like hostfs's and pipefs's. PiperOrigin-RevId: 308274053
2020-04-23Add vfs.MkdirOptions.ForSyntheticMountpoint.Jamie Liu
PiperOrigin-RevId: 308143529
2020-04-23Simplify Docker test infrastructure.Adin Scannell
This change adds a layer of abstraction around the internal Docker APIs, and eliminates all direct dependencies on Dockerfiles in the infrastructure. A subsequent change will automated the generation of local images (with efficient caching). Note that this change drops the use of bazel container rules, as that experiment does not seem to be viable. PiperOrigin-RevId: 308095430
2020-04-22Move user home detection to its own library.Nicolas Lacasse
PiperOrigin-RevId: 307977689
2020-04-22Specify a memory file in platform.New().Andrei Vagin
PiperOrigin-RevId: 307941984
2020-04-20Add a functional vm_test for root_test.Adin Scannell
This change renames the tools/images directory to tools/vm for clarity, and adds a functional vm_test. Sharding is also added to the same test, and some documentation added around key flags & variables to describe how they work. Subsequent changes will add vm_tests for other cases, such as the runtime tests. PiperOrigin-RevId: 307492245
2020-04-17Add test name to boot and gofer log filesFabricio Voznika
This is to make easier to find corresponding logs in case test fails. PiperOrigin-RevId: 307104283
2020-04-17Get /bin/true to run on VFS2Zach Koopmans
Included: - loader_test.go RunTest and TestStartSignal VFS2 - container_test.go TestAppExitStatus on VFS2 - experimental flag added to runsc to turn on VFS2 Note: shared mounts are not yet supported. PiperOrigin-RevId: 307070753
2020-04-16Preserve log FD after execveFabricio Voznika
PiperOrigin-RevId: 306908296
2020-04-14Merge pull request #2212 from aaronlu:dup_stdioFDsgVisor bot
PiperOrigin-RevId: 306477639
2020-04-10Add logging message for noNewPrivileges OCI option.Ian Lewis
noNewPrivileges is ignored if set to false since gVisor assumes that PR_SET_NO_NEW_PRIVS is always enabled. PiperOrigin-RevId: 305991947
2020-04-10Use O_CLOEXEC when dup'ing FDsFabricio Voznika
The sentry doesn't allow execve, but it's a good defense in-depth measure. PiperOrigin-RevId: 305958737
2020-04-09Merge pull request #2253 from amscanne:nogogVisor bot
PiperOrigin-RevId: 305807868
2020-04-09Don't unconditionally set --panic-signalFabricio Voznika
Closes #2393 PiperOrigin-RevId: 305793027
2020-04-08Clean up TODOsFabricio Voznika
PiperOrigin-RevId: 305592245
2020-04-08Fix all printf formatting errors.Adin Scannell
Updates #2243
2020-04-08Fix all copy locks violations.Adin Scannell
This required minor restructuring of how system call tables were saved and restored, but it makes way more sense this way. Updates #2243
2020-04-07Add friendlier messages for frequently encountered errors.Ian Lewis
Issue #2270 Issue #1765 PiperOrigin-RevId: 305385436
2020-04-07Update TODO to #238Ian Lewis
Move TODO to #238 so that proper synchronization of operations is handled when we create the urpc client. Issue #238 Fixes #512 PiperOrigin-RevId: 305383924
2020-04-07Don't map the 0 uid into a sandbox user namespaceAndrei Vagin
Starting with go1.13, we can specify ambient capabilities when we execute a new process with os/exe.Cmd. PiperOrigin-RevId: 305366706
2020-04-07Remove TODOs for local gofer extended attributes.Dean Deng
PiperOrigin-RevId: 305344989
2020-04-01Automated rollback of changelist 303799678Adin Scannell
PiperOrigin-RevId: 304221302
2020-03-31checkpoint/restore: make sure the donated stdioFDs have the same valueAaron Lu
Suppose I start a runsc container using kvm platform like this: $ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash The donating FD and the corresponding cmdline for runsc-sandbox is: D0313 17:50:12.608203 44389 x:0] Donating FD 3: "1.txt" D0313 17:50:12.608214 44389 x:0] Donating FD 4: "control_server_socket" D0313 17:50:12.608224 44389 x:0] Donating FD 5: "|0" D0313 17:50:12.608229 44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json" D0313 17:50:12.608234 44389 x:0] Donating FD 7: "|1" D0313 17:50:12.608238 44389 x:0] Donating FD 8: "sandbox IO FD" D0313 17:50:12.608242 44389 x:0] Donating FD 9: "/dev/kvm" D0313 17:50:12.608246 44389 x:0] Donating FD 10: "/dev/stdin" D0313 17:50:12.608249 44389 x:0] Donating FD 11: "/dev/stdout" D0313 17:50:12.608253 44389 x:0] Donating FD 12: "/dev/stderr" D0313 17:50:12.608257 44389 x:0] Starting sandbox: /proc/self/exe [runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log= --max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt --debug-log-format=text --file-access=exclusive --overlay=false --fsgofer-host-uds=false --network=sandbox --log-packets=false --platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true --num-network-channels=1 --rootless=false --alsologtostderr=false --ref-leak-mode=disabled --gso=true --software-gso=true --overlayfs-stale-read=false --shared-volume= --debug-log-fd=3 --panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true --setup-root --cpu-num 32 --total-memory 4294967296 rootbash] Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12. If I restore a container from the checkpoint image which is derived by checkpointing the above rootbash container, but either omit the platform switch or specify to use ptrace platform explicitely: $ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash the donating FD and corresponding cmdline for runsc-sandbox is: D0313 17:50:15.258632 44452 x:0] Donating FD 3: "1.txt" D0313 17:50:15.258640 44452 x:0] Donating FD 4: "control_server_socket" D0313 17:50:15.258645 44452 x:0] Donating FD 5: "|0" D0313 17:50:15.258648 44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json" D0313 17:50:15.258653 44452 x:0] Donating FD 7: "|1" D0313 17:50:15.258657 44452 x:0] Donating FD 8: "sandbox IO FD" D0313 17:50:15.258661 44452 x:0] Donating FD 9: "/dev/stdin" D0313 17:50:15.258675 44452 x:0] Donating FD 10: "/dev/stdout" D0313 17:50:15.258680 44452 x:0] Donating FD 11: "/dev/stderr" D0313 17:50:15.258684 44452 x:0] Starting sandbox: /proc/self/exe [runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log= --max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt --debug-log-format=text --file-access=exclusive --overlay=false --fsgofer-host-uds=false --network=sandbox --log-packets=false --platform=ptrace --strace=false --strace-syscalls= --strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true --num-network-channels=1 --rootless=false --alsologtostderr=false --ref-leak-mode=disabled --gso=true --software-gso=true --overlayfs-stale-read=false --shared-volume= --debug-log-fd=3 --panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9 --stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory 4294967296 restored_rootbash] Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the saved host.descritor.origFD which is 12 for stderr is no longer valid). For the three host FD based files, The s.Dev and s.Ino derived from fstat(fd) shall all be the same and since the two fields are used as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same. Note that for MultiDevice m, m.cache records the mapping of key to value and m.rcache records the mapping of value to key. If same value doesn't map to the same key, it will panic on restore. Now that stderr's origFD 12 is no longer valid(it happens to be /memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be correct. But its InodeID is still the same as saved, MultiDevice.Load() will complain about the same value(InodeID) being mapped to different keys (different from stdin and stdout's) and panic with: "MultiDevice's caches are inconsistent". Solve this problem by making sure stdioFDs for root container's init task are always the same on initial start and on restore time, no matter what cmdline user has used: debug log specified or not, platform changed or not etc. shall not affect the ability to restore. Fixes #1844.