gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2019-06-14	Skip tid allocation which is using	Yong He
	When leader of process group (session) exit, the process group ID (session ID) is holding by other processes in the process group, so the process group ID (session ID) can not be reused. If reusing the process group ID (seession ID) as new process group ID for new process, this will cause session create failed, and later runsc crash when access process group. The fix skip the tid if it is using by a process group (session) when allocating a new tid. We could easily reproduce the runsc crash follow these steps: 1. build test program, and run inside container int main(int argc, char argv[]) { pid_t cpid, spid; cpid = fork(); if (cpid == -1) { perror("fork"); exit(EXIT_FAILURE); } if (cpid == 0) { pid_t sid = setsid(); printf("Start New Session %ld\n",sid); printf("Child PID %ld / PPID %ld / PGID %ld / SID %ld\n", getpid(),getppid(),getpgid(getpid()),getsid(getpid())); spid = fork(); if (spid == 0) { setpgid(getpid(), getpid()); printf("Set GrandSon as New Process Group\n"); printf("GrandSon PID %ld / PPID %ld / PGID %ld / SID %ld\n", getpid(),getppid(),getpgid(getpid()),getsid(getpid())); while(1) { usleep(1); } } sleep(3); exit(0); } else { exit(0); } return 0; } 2. build hello program int main(int argc, char argv[]) { printf("Current PID is %ld\n", (long) getpid()); return 0; } 3. run script on host which run hello inside container, you can speed up the test with set TasksLimit as lower value. for (( i=0; i<65535; i++ )) do docker exec <container id> /test/hello done 4. when hello process reusing the process group of loop process, runsc will crash. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x79f0c8] goroutine 612475 [running]: gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(ProcessGroup).decRefWithParent(0x0, 0x0) pkg/sentry/kernel/sessions.go:160 +0x78 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).exitNotifyLocked(0xc000663500, 0x0) pkg/sentry/kernel/task_exit.go:672 +0x2b7 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(runExitNotify).execute(0x0, 0xc000663500, 0x0, 0x0) pkg/sentry/kernel/task_exit.go:542 +0xc4 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).run(0xc000663500, 0xc) pkg/sentry/kernel/task_run.go:91 +0x194 created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start pkg/sentry/kernel/task_start.go:286 +0xfe
2019-06-11	Eat sendfile partial error	Adin Scannell
	For sendfile(2), we propagate a TCP error through the system call layer. This should be eaten if there is a partial result. This change also adds a test to ensure that there is no panic in this case, for both TCP sockets and unix domain sockets. PiperOrigin-RevId: 252746192
2019-06-11	kokoro: don't overwrite test results for different runtimes	Andrei Vagin
	PiperOrigin-RevId: 252724255
2019-06-11	Explicitly reference workspace root in test command	Michael Pratt
	oh-my-zsh aliases ... to ../.. [1]. Add an explicit reference to workspace root to work around the alias. [1] https://github.com/robbyrussell/oh-my-zsh/blob/master/lib/directories.zsh Fixes #341 PiperOrigin-RevId: 252720590
2019-06-11	Add support to mount pod shared tmpfs mounts	Fabricio Voznika
	Parse annotations containing 'gvisor.dev/spec/mount' that gives hints about how mounts are shared between containers inside a pod. This information can be used to better inform how to mount these volumes inside gVisor. For example, a volume that is shared between containers inside a pod can be bind mounted inside the sandbox, instead of being two independent mounts. For now, this information is used to allow the same tmpfs mounts to be shared between containers which wasn't possible before. PiperOrigin-RevId: 252704037
2019-06-11	Use net.HardwareAddr for FDBasedLink.LinkAddress	Fabricio Voznika
	It prints formatted to the log. PiperOrigin-RevId: 252699551
2019-06-11	Fix broken pipe error building version file	Fabricio Voznika
	(11:34:09) ERROR: /tmpfs/src/github/repo/runsc/BUILD:82:1: Couldn't build file runsc/version.txt: Executing genrule //runsc:deb-version failed (Broken pipe): bash failed: error executing command PiperOrigin-RevId: 252691902
2019-06-11	gvisor/test: create a per-testcase directory for runsc logs	Andrei Vagin
	Otherwise it's hard to find a directory for a specific test case. PiperOrigin-RevId: 252636901
2019-06-10	Add introspection for Linux/AMD64 syscalls	Ian Lewis
	Adds simple introspection for syscall compatibility information to Linux/AMD64. Syscalls registered in the syscall table now have associated metadata like name, support level, notes, and URLs to relevant issues. Syscall information can be exported as a table, JSON, or CSV using the new 'runsc help syscalls' command. Users can use this info to debug and get info on the compatibility of the version of runsc they are running or to generate documentation. PiperOrigin-RevId: 252558304
2019-06-10	Move //pkg/sentry/platform/procid to //pkg/procid.	Jamie Liu
	PiperOrigin-RevId: 252501653
2019-06-10	Fixes to listen backlog handling.	Bhasker Hariharan
	Changes netstack to confirm to current linux behaviour where if the backlog is full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto backlog connections to be in SYN-RCVD state as long as the backlog is not full. We also now drop a SYN if syn cookies are in use and the backlog for the listening endpoint is full. Added new tests to confirm the behaviour. Also reverted the change to increase the backlog in TcpPortReuseMultiThread syscall test. Fixes #236 PiperOrigin-RevId: 252500462
2019-06-10	Store more information in the kernel socket table.	Rahat Mahmood
	Store enough information in the kernel socket table to distinguish between different types of sockets. Previously we were only storing the socket family, but this isn't enough to classify sockets. For example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets are SOCK_DGRAM sockets with a particular protocol. Instead of creating more sub-tables, flatten the socket table and provide a filtering mechanism based on the socket entry. Also generate and store a socket entry index ("sl" in linux) which allows us to output entries in a stable order from procfs. PiperOrigin-RevId: 252495895
2019-06-07	Move //pkg/sentry/memutil to //pkg/memutil.	Jamie Liu
	PiperOrigin-RevId: 252124156
2019-06-06	BUILD: Use runsc to generate version	Adin Scannell
	This also ensures BUILD files are correctly formatted. PiperOrigin-RevId: 251990267
2019-06-06	Change visibility of //pkg/sentry/time.	Jamie Liu
	PiperOrigin-RevId: 251965598
2019-06-06	Add alsologtostderr option	Fabricio Voznika
	When set sends log messages to the error log: sudo ./runsc --logtostderr do ls I0531 17:59:58.105064 144564 x:0] *************************** I0531 17:59:58.105087 144564 x:0] Args: [runsc --logtostderr do ls] I0531 17:59:58.105112 144564 x:0] PID: 144564 I0531 17:59:58.105125 144564 x:0] UID: 0, GID: 0 [...] PiperOrigin-RevId: 251964377
2019-06-06	Cap initial usermem.CopyStringIn buffer size.	Jamie Liu
	Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this size is measurably inefficient in most cases: most paths will not be this long, 4 KB is a lot of bytes to zero, and as of this writing the Go runtime allocator maps only two 4 KB objects to each 8 KB span, necessitating a call to runtime.mcache.refill() on ~every other call. Limit the initial buffer size to 256 B instead, and geometrically reallocate if necessary. PiperOrigin-RevId: 251960441
2019-06-06	Use common definition of SockType.	Rahat Mahmood
	SockType isn't specific to unix domain sockets, and the current definition basically mirrors the linux ABI's definition. PiperOrigin-RevId: 251956740
2019-06-06	Add the gVisor gitter badge to the README	Ian Lewis
	Moves the build badge to just below the logo and adds the gitter badge next to it for consistency. PiperOrigin-RevId: 251956383
2019-06-06	Copy up parent when binding UDS on overlayfs	Fabricio Voznika
	Overlayfs was expecting the parent to exist when bind(2) was called, which may not be the case. The fix is to copy the parent directory to the upper layer before binding the UDS. There is not good place to add tests for it. Syscall tests would be ideal, but it's hard to guarantee that the directory where the socket is created hasn't been touched before (and thus copied the parent to the upper layer). Added it to runsc integration tests for now. If it turns out we have lots of these kind of tests, we can consider moving them somewhere more appropriate. PiperOrigin-RevId: 251954156
2019-06-06	"Implement" mbind(2).	Jamie Liu
	We still only advertise a single NUMA node, and ignore mempolicy accordingly, but mbind() at least now succeeds and has effects reflected by get_mempolicy(). Also fix handling of nodemasks: round sizes to unsigned long (as documented and done by Linux), and zero trailing bits when copying them out. PiperOrigin-RevId: 251950859
2019-06-06	Implement reclaim-driven MemoryFile eviction.	Jamie Liu
	PiperOrigin-RevId: 251950660
2019-06-06	Remove tmpfs restriction from test	Fabricio Voznika
	runsc supports UDS over gofer mounts and tmpfs is not needed for this test. PiperOrigin-RevId: 251944870
2019-06-06	Track and export socket state.	Rahat Mahmood
	This is necessary for implementing network diagnostic interfaces like /proc/net/{tcp,udp,unix} and sock_diag(7). For pass-through endpoints such as hostinet, we obtain the socket state from the backend. For netstack, we add explicit tracking of TCP states. PiperOrigin-RevId: 251934850
2019-06-06	Add overlay dimension to FS related syscall tests	Fabricio Voznika
	PiperOrigin-RevId: 251929314
2019-06-06	Try increase listen backlog.	Rahat Mahmood
	PiperOrigin-RevId: 251928000
2019-06-06	Internal change.	Googler
	PiperOrigin-RevId: 251902567
2019-06-06	Send error message to docker/kubectl exec on failure	Fabricio Voznika
	Containerd uses the last error message sent to the log to print as failure cause for create/exec. This required a few changes in the logging logic for runsc: - cmd.Errorf/Fatalf: now writes a message with 'error' level to containerd log, in addition to stderr and debug logs, like before. - log.Infof/Warningf/Fatalf: are not sent to containerd log anymore. They are mostly used for debugging and not useful to containerd. In most cases, --debug-log is enabled and this avoids the logs messages from being duplicated. - stderr is not used as default log destination anymore. Some commands assume stdio is for the container/process running inside the sandbox and it's better to never use it for logging. By default, logs are supressed now. PiperOrigin-RevId: 251881815
2019-06-06	Add multi-fd support to fdbased endpoint.	Bhasker Hariharan
	This allows an fdbased endpoint to have multiple underlying fd's from which packets can be read and dispatched/written to. This should allow for higher throughput as well as better scalability of the network stack as number of connections increases. Updates #231 PiperOrigin-RevId: 251852825
2019-06-05	netstack/sniffer: log GSO attributes	Andrei Vagin
	PiperOrigin-RevId: 251788534
2019-06-05	Shutdown host sockets on internal shutdown	Michael Pratt
	This is required to make the shutdown visible to peers outside the sandbox. The readClosed / writeClosed fields were dropped, as they were preventing a shutdown socket from reading the remainder of queued bytes. The host syscalls will return the appropriate errors for shutdown. The control message tests have been split out of socket_unix.cc to make the (few) remaining tests accessible to testing inherited host UDS, which don't support sending control messages. Updates #273 PiperOrigin-RevId: 251763060
2019-06-05	netstack/tcp: fix calculating a number of outstanding packets	Andrei Vagin
	In case of GSO, a segment can container more than one packet and we need to use the pCount() helper to get a number of packets. PiperOrigin-RevId: 251743020
2019-06-05	Adjust route when looping multicast packets	Chris Kuiper
	Multicast packets are special in that their destination address does not identify a specific interface. When sending out such a packet the multicast address is the remote address, but for incoming packets it is the local address. Hence, when looping a multicast packet, the route needs to be tweaked to reflect this. PiperOrigin-RevId: 251739298
2019-06-05	Give test instantiations meaningful names.	Ian Gudger
	PiperOrigin-RevId: 251737069
2019-06-05	Bump googletest version.	Nicolas Lacasse
	PiperOrigin-RevId: 251716439
2019-06-05	Implement dumpability tracking and checks	Michael Pratt
	We don't actually support core dumps, but some applications want to get/set dumpability, which still has an effect in procfs. Lack of support for set-uid binaries or fs creds simplifies things a bit. As-is, processes started via CreateProcess (i.e., init and sentryctl exec) have normal dumpability. I'm a bit torn on whether sentryctl exec tasks should be dumpable, but at least since they have no parent normal UID/GID checks should protect them. PiperOrigin-RevId: 251712714
2019-06-04	Building containerd with go modules is broken, use GOPATH.	Adin Scannell
	PiperOrigin-RevId: 251583707
2019-06-04	Fix data race in synRcvdState.	Bhasker Hariharan
	When checking the length of the acceptedChan we should hold the endpoint mutex otherwise a syn received while the listening socket is being closed can result in a data race where the cleanupLocked routine sets acceptedChan to nil while a handshake goroutine in progress could try and check it at the same time. PiperOrigin-RevId: 251537697
2019-06-04	Drop one dirent reference after referenced by file	Yong He
	When pipe is created, a dirent of pipe will be created and its initial reference is set as 0. Cause all dirent will only be destroyed when the reference decreased to -1, so there is already a 'initial reference' of dirent after it created. For destroying dirent after all reference released, the correct way is to drop the 'initial reference' once someone hold a reference to the dirent, such as fs.NewFile, otherwise the reference of dirent will stay 0 all the time, and will cause memory leak of dirent. Except pipe, timerfd/eventfd/epoll has the same problem Here is a simple case to create memory leak of dirent for pipe/timerfd/eventfd/epoll in C langange, after run the case, pprof the runsc process, you will find lots dirents of pipe/timerfd/eventfd/epoll not freed: int main(int argc, char *argv[]) { int i; int n; int pipefd[2]; if (argc != 3) { printf("Usage: %s epoll\|timerfd\|eventfd\|pipe <iterations>\n", argv[0]); } n = strtol(argv[2], NULL, 10); if (strcmp(argv[1], "epoll") == 0) { for (i = 0; i < n; ++i) close(epoll_create(1)); } else if (strcmp(argv[1], "timerfd") == 0) { for (i = 0; i < n; ++i) close(timerfd_create(CLOCK_REALTIME, 0)); } else if (strcmp(argv[1], "eventfd") == 0) { for (i = 0; i < n; ++i) close(eventfd(0, 0)); } else if (strcmp(argv[1], "pipe") == 0) { for (i = 0; i < n; ++i) if (pipe(pipefd) == 0) { close(pipefd[0]); close(pipefd[1]); } } printf("%s %s test finished\r\n",argv[1],argv[2]); return 0; } Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2 PiperOrigin-RevId: 251531096
2019-06-04	Use github directory if it exists.	Adin Scannell
	Unfortunately, kokoro names the top-level directory per the SCM type. This means there's no way to make the job names match; we simply need to probe for the existence of the correct directory. PiperOrigin-RevId: 251519409
2019-06-04	Remove the Dirent field from Pipe.	Nicolas Lacasse
	Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe raises difficult questions about the lifecycle of the Pipe and Dirent. Fortunately, we can side-step those questions by removing the Dirent field from Pipe entirely. We only need the Dirent when constructing fs.Files (which are ref-counted), and in GetFile (when a Dirent is passed to us anyways). PiperOrigin-RevId: 251497628
2019-06-04	Fix Kokoro revision and 'go get usage'	Adin Scannell
	As a convenience for debugging, also factor the scripts such that can be run without Kokoro. In the future, this may be used to add additional presubmit hooks that run without Kokoro. PiperOrigin-RevId: 251474868
2019-06-03	Resolve impossible dependencies.	Adin Scannell
	PiperOrigin-RevId: 251377523
2019-06-03	gvisor/sock/unix: pass creds when a message is sent between unconnected sockets	Andrei Vagin
	and don't report a sender address if it doesn't have one PiperOrigin-RevId: 251371284
2019-06-03	gvisor/fs: return a proper error from FileWriter.Write in case of a short-write	Andrei Vagin
	The io.Writer contract requires that Write writes all available bytes and does not return short writes. This causes errors with io.Copy, since our own Write interface does not have this same contract. PiperOrigin-RevId: 251368730
2019-06-03	Refactor container FS setup	Fabricio Voznika
	No change in functionaly. Added containerMounter object to keep state while the mounts are processed. This will help upcoming changes to share mounts per-pod. PiperOrigin-RevId: 251350096
2019-06-03	Remove 'clearStatus' option from container.Wait*PID()	Fabricio Voznika
	clearStatus was added to allow detached execution to wait on the exec'd process and retrieve its exit status. However, it's not currently used. Both docker and gvisor-containerd-shim wait on the "shim" process and retrieve the exit status from there. We could change gvisor-containerd-shim to use waits, but it will end up also consuming a process for the wait, which is similar to having the shim process. Closes #234 PiperOrigin-RevId: 251349490
2019-06-03	Allow specification of origin in cloudbuild.	Adin Scannell
	PiperOrigin-RevId: 251347966
2019-06-03	Delete debug log lines left by mistake.	Bhasker Hariharan
	Updates #236 PiperOrigin-RevId: 251337915
2019-06-03	Remove duplicate socket tests	Michael Pratt
	socket_unix_abstract.cc: Subset of socket_abstract.cc socket_unix_filesystem.cc: Subset of socket_filesystem.cc PiperOrigin-RevId: 251297117