diff options
author | Aaron Lu <ziqian.lzq@antfin.com> | 2020-03-16 15:12:56 +0800 |
---|---|---|
committer | Aaron Lu <ziqian.lzq@antfin.com> | 2020-03-31 11:37:11 +0800 |
commit | 0cfdd47391d30dfe8214e2d11bdad9b27419ad26 (patch) | |
tree | 0f085b21acbbe3754329ee99334317aa576c07fd | |
parent | 32a133537e61bbceb6a0a16c95815495d8f17a35 (diff) |
checkpoint/restore: make sure the donated stdioFDs have the same value
Suppose I start a runsc container using kvm platform like this:
$ sudo runsc --debug=true --debug-log=1.txt --platform=kvm run rootbash
The donating FD and the corresponding cmdline for runsc-sandbox is:
D0313 17:50:12.608203 44389 x:0] Donating FD 3: "1.txt"
D0313 17:50:12.608214 44389 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:12.608224 44389 x:0] Donating FD 5: "|0"
D0313 17:50:12.608229 44389 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:12.608234 44389 x:0] Donating FD 7: "|1"
D0313 17:50:12.608238 44389 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:12.608242 44389 x:0] Donating FD 9: "/dev/kvm"
D0313 17:50:12.608246 44389 x:0] Donating FD 10: "/dev/stdin"
D0313 17:50:12.608249 44389 x:0] Donating FD 11: "/dev/stdout"
D0313 17:50:12.608253 44389 x:0] Donating FD 12: "/dev/stderr"
D0313 17:50:12.608257 44389 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=kvm --strace=false --strace-syscalls=--strace-log-size=1024
--watchdog-action=Panic --panic-signal=-1 --profile=false --net-raw=true
--num-network-channels=1 --rootless=false --alsologtostderr=false
--ref-leak-mode=disabled --gso=true --software-gso=true
--overlayfs-stale-read=false --shared-volume= --debug-log-fd=3
--panic-signal=15 boot --bundle=/home/ziqian.lzq/bundle/bash/runsc
--controller-fd=4 --mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8
--device-fd=9 --stdio-fds=10 --stdio-fds=11 --stdio-fds=12 --pidns=true
--setup-root --cpu-num 32 --total-memory 4294967296 rootbash]
Note stdioFDs starts from 10 with kvm platform and stderr's FD is 12.
If I restore a container from the checkpoint image which is derived
by checkpointing the above rootbash container, but either omit the
platform switch or specify to use ptrace platform explicitely:
$ sudo runsc --debug=true --debug-log=1.txt restore --image-path=some_path restored_rootbash
the donating FD and corresponding cmdline for runsc-sandbox is:
D0313 17:50:15.258632 44452 x:0] Donating FD 3: "1.txt"
D0313 17:50:15.258640 44452 x:0] Donating FD 4: "control_server_socket"
D0313 17:50:15.258645 44452 x:0] Donating FD 5: "|0"
D0313 17:50:15.258648 44452 x:0] Donating FD 6: "/home/ziqian.lzq/bundle/bash/runsc/config.json"
D0313 17:50:15.258653 44452 x:0] Donating FD 7: "|1"
D0313 17:50:15.258657 44452 x:0] Donating FD 8: "sandbox IO FD"
D0313 17:50:15.258661 44452 x:0] Donating FD 9: "/dev/stdin"
D0313 17:50:15.258675 44452 x:0] Donating FD 10: "/dev/stdout"
D0313 17:50:15.258680 44452 x:0] Donating FD 11: "/dev/stderr"
D0313 17:50:15.258684 44452 x:0] Starting sandbox: /proc/self/exe
[runsc-sandbox --root=/run/containerd/runsc/default --debug=true --log=
--max-threads=256 --reclaim-period=5 --log-format=text --debug-log=1.txt
--debug-log-format=text --file-access=exclusive --overlay=false
--fsgofer-host-uds=false --network=sandbox --log-packets=false
--platform=ptrace --strace=false --strace-syscalls=
--strace-log-size=1024 --watchdog-action=Panic --panic-signal=-1
--profile=false --net-raw=true --num-network-channels=1 --rootless=false
--alsologtostderr=false --ref-leak-mode=disabled --gso=true
--software-gso=true --overlayfs-stale-read=false --shared-volume=
--debug-log-fd=3 --panic-signal=15 boot
--bundle=/home/ziqian.lzq/bundle/bash/runsc --controller-fd=4
--mounts-fd=5 --spec-fd=6 --start-sync-fd=7 --io-fds=8 --stdio-fds=9
--stdio-fds=10 --stdio-fds=11 --setup-root --cpu-num 32 --total-memory
4294967296 restored_rootbash]
Note this time, stdioFDs starts from 9 and stderr's FD is 11(so the
saved host.descritor.origFD which is 12 for stderr is no longer valid).
For the three host FD based files, The s.Dev and s.Ino derived from
fstat(fd) shall all be the same and since the two fields are used
as device.MultiDeviceKey, the host.inodeFileState.sattr.InodeId which is
the value of MultiDevice.Map(MultiDeviceKey), shall also all be the same.
Note that for MultiDevice m, m.cache records the mapping of key to value
and m.rcache records the mapping of value to key. If same value doesn't
map to the same key, it will panic on restore.
Now that stderr's origFD 12 is no longer valid(it happens to be
/memfd:runsc-memory in my test on restore), the s.Dev and s.Ino derived
from fstat(fd=12) in host.inodeFileState.afterLoad() will neither be
correct. But its InodeID is still the same as saved, MultiDevice.Load()
will complain about the same value(InodeID) being mapped to different
keys (different from stdin and stdout's) and panic with: "MultiDevice's
caches are inconsistent".
Solve this problem by making sure stdioFDs for root container's init
task are always the same on initial start and on restore time, no matter
what cmdline user has used: debug log specified or not, platform changed
or not etc. shall not affect the ability to restore.
Fixes #1844.
-rw-r--r-- | runsc/boot/loader.go | 30 |
1 files changed, 29 insertions, 1 deletions
diff --git a/runsc/boot/loader.go b/runsc/boot/loader.go index e7ca98134..1ed46bdb9 100644 --- a/runsc/boot/loader.go +++ b/runsc/boot/loader.go @@ -175,6 +175,9 @@ type Args struct { UserLogFD int } +// make sure stdioFDs are always the same on initial start and on restore +const startingStdioFD = 64 + // New initializes a new kernel loader configured by spec. // New also handles setting up a kernel for restoring a container. func New(args Args) (*Loader, error) { @@ -319,6 +322,21 @@ func New(args Args) (*Loader, error) { return nil, fmt.Errorf("creating pod mount hints: %v", err) } + var stdioFDs []int + newfd := startingStdioFD + for _, fd := range args.StdioFDs { + err := syscall.Dup3(fd, newfd, syscall.O_CLOEXEC) + if err != nil { + return nil, fmt.Errorf("dup3 of stdioFDs failed: %v", err) + } + stdioFDs = append(stdioFDs, newfd) + err = syscall.Close(fd) + if err != nil { + return nil, fmt.Errorf("close original stdioFDs failed: %v", err) + } + newfd++ + } + eid := execID{cid: args.ID} l := &Loader{ k: k, @@ -327,7 +345,7 @@ func New(args Args) (*Loader, error) { watchdog: dog, spec: args.Spec, goferFDs: args.GoferFDs, - stdioFDs: args.StdioFDs, + stdioFDs: stdioFDs, rootProcArgs: procArgs, sandboxID: args.ID, processes: map[execID]*execProcess{eid: {}}, @@ -569,6 +587,16 @@ func (l *Loader) run() error { } }) + // l.stdioFDs are derived from dup() in boot.New() and they are now dup()ed again + // either in createFDTable() during initial start or in descriptor.initAfterLoad() + // during restore, we can release l.stdioFDs now. + for _, fd := range l.stdioFDs { + err := syscall.Close(fd) + if err != nil { + return fmt.Errorf("close dup()ed stdioFDs: %v", err) + } + } + log.Infof("Process should have started...") l.watchdog.Start() return l.k.Start() |