gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2019-06-19	fileOp{On,At} should pass the remaning symlink traversal count.	Nicolas Lacasse
	And methods that do more traversals should use the remaining count rather than resetting. PiperOrigin-RevId: 254041720
2019-06-19	Add MountNamespace to task.	Nicolas Lacasse
	This allows tasks to have distinct mount namespace, instead of all sharing the kernel's root mount namespace. Currently, the only way for a task to get a different mount namespace than the kernel's root is by explicitly setting a different MountNamespace in CreateProcessArgs, and nothing does this (yet). In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a new container inside runsc. Note that "MountNamespace" is a poor term for this thing. It's more like a distinct VFS tree. When we get around to adding real mount namespaces, this will need a better naem. PiperOrigin-RevId: 254009310
2019-06-18	Attempt to fix TestPipeWritesAccumulate	Fabricio Voznika
	Test fails because it's reading 4KB instead of the expected 64KB. Changed the test to read pipe buffer size instead of hardcode and added some logging in case the reason for failure was not pipe buffer size. PiperOrigin-RevId: 253916040
2019-06-18	gvisor/fs: don't update file.offset for sockets, pipes, etc	Andrei Vagin
	sockets, pipes and other non-seekable file descriptors don't use file.offset, so we don't need to update it. With this change, we will be able to call file operations without locking the file.mu mutex. This is already used for pipes in the splice system call. PiperOrigin-RevId: 253746644
2019-06-14	Skip tid allocation which is using	Yong He
	When leader of process group (session) exit, the process group ID (session ID) is holding by other processes in the process group, so the process group ID (session ID) can not be reused. If reusing the process group ID (seession ID) as new process group ID for new process, this will cause session create failed, and later runsc crash when access process group. The fix skip the tid if it is using by a process group (session) when allocating a new tid. We could easily reproduce the runsc crash follow these steps: 1. build test program, and run inside container int main(int argc, char argv[]) { pid_t cpid, spid; cpid = fork(); if (cpid == -1) { perror("fork"); exit(EXIT_FAILURE); } if (cpid == 0) { pid_t sid = setsid(); printf("Start New Session %ld\n",sid); printf("Child PID %ld / PPID %ld / PGID %ld / SID %ld\n", getpid(),getppid(),getpgid(getpid()),getsid(getpid())); spid = fork(); if (spid == 0) { setpgid(getpid(), getpid()); printf("Set GrandSon as New Process Group\n"); printf("GrandSon PID %ld / PPID %ld / PGID %ld / SID %ld\n", getpid(),getppid(),getpgid(getpid()),getsid(getpid())); while(1) { usleep(1); } } sleep(3); exit(0); } else { exit(0); } return 0; } 2. build hello program int main(int argc, char argv[]) { printf("Current PID is %ld\n", (long) getpid()); return 0; } 3. run script on host which run hello inside container, you can speed up the test with set TasksLimit as lower value. for (( i=0; i<65535; i++ )) do docker exec <container id> /test/hello done 4. when hello process reusing the process group of loop process, runsc will crash. panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x79f0c8] goroutine 612475 [running]: gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(ProcessGroup).decRefWithParent(0x0, 0x0) pkg/sentry/kernel/sessions.go:160 +0x78 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).exitNotifyLocked(0xc000663500, 0x0) pkg/sentry/kernel/task_exit.go:672 +0x2b7 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(runExitNotify).execute(0x0, 0xc000663500, 0x0, 0x0) pkg/sentry/kernel/task_exit.go:542 +0xc4 gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(Task).run(0xc000663500, 0xc) pkg/sentry/kernel/task_run.go:91 +0x194 created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start pkg/sentry/kernel/task_start.go:286 +0xfe
2019-06-13	Add support for TCP receive buffer auto tuning.	Bhasker Hariharan
	The implementation is similar to linux where we track the number of bytes consumed by the application to grow the receive buffer of a given TCP endpoint. This ensures that the advertised window grows at a reasonable rate to accomodate for the sender's rate and prevents large amounts of data being held in stack buffers if the application is not actively reading or not reading fast enough. The original paper that was used to implement the linux receive buffer auto- tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf NOTE: Linux does not implement DRS as defined in that paper, it's just a good reference to understand the solution space. Updates #230 PiperOrigin-RevId: 253168283
2019-06-13	Plumb context through more layers of filesytem.	Ian Gudger
	All functions which allocate objects containing AtomicRefCounts will soon need a context. PiperOrigin-RevId: 253147709
2019-06-13	Fix deadlock in fasync.	Ian Gudger
	The deadlock can occur when both ends of a connected Unix socket which has FIOASYNC enabled on at least one end are closed at the same time. One end notifies that it is closing, calling (waiter.Queue).Notify which takes waiter.Queue.mu (as a read lock) and then calls (FileAsync).Callback, which takes FileAsync.mu. The other end tries to unregister for notifications by calling (FileAsync).Unregister, which takes FileAsync.mu and calls (waiter.Queue).EventUnregister which takes waiter.Queue.mu. This is fixed by moving the calls to waiter.Waitable.EventRegister and waiter.Waitable.EventUnregister outside of the protection of any mutex used in (FileAsync).Callback. The new test is related, but does not cover this particular situation. Also fix a data race on FileAsync.e.Callback. (FileAsync).Callback checked FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter calling (*FileAsync).Callback could not and did not. This is fixed by making FileAsync.e.Callback immutable before passing it to the waiter for the first time. Fixes #346 PiperOrigin-RevId: 253138340
2019-06-13	Implement getsockopt() SO_DOMAIN, SO_PROTOCOL and SO_TYPE.	Rahat Mahmood
	SO_TYPE was already implemented for everything but netlink sockets. PiperOrigin-RevId: 253138157
2019-06-13	Update canonical repository.	Adin Scannell
	This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620
2019-06-13	Add p9 and unet benchmarks.	Jamie Liu
	PiperOrigin-RevId: 253122166
2019-06-12	Minor BUILD file cleanup.	Adin Scannell
	PiperOrigin-RevId: 252918338
2019-06-12	Merge branch 'master' into iptables-1-pkg	Kevin Krakauer
	Change-Id: I7457a11de4725e1bf3811420c505d225b1cb6943
2019-06-12	Add support for TCP_CONGESTION socket option.	Bhasker Hariharan
	This CL also cleans up the error returned for setting congestion control which was incorrectly returning EINVAL instead of ENOENT. PiperOrigin-RevId: 252889093
2019-06-12	gvisor/ptrace: print guest registers if a stub stopped with unexpected code	Andrei Vagin
	PiperOrigin-RevId: 252855280
2019-06-11	Eat sendfile partial error	Adin Scannell
	For sendfile(2), we propagate a TCP error through the system call layer. This should be eaten if there is a partial result. This change also adds a test to ensure that there is no panic in this case, for both TCP sockets and unix domain sockets. PiperOrigin-RevId: 252746192
2019-06-11	Add support to mount pod shared tmpfs mounts	Fabricio Voznika
	Parse annotations containing 'gvisor.dev/spec/mount' that gives hints about how mounts are shared between containers inside a pod. This information can be used to better inform how to mount these volumes inside gVisor. For example, a volume that is shared between containers inside a pod can be bind mounted inside the sandbox, instead of being two independent mounts. For now, this information is used to allow the same tmpfs mounts to be shared between containers which wasn't possible before. PiperOrigin-RevId: 252704037
2019-06-10	Add introspection for Linux/AMD64 syscalls	Ian Lewis
	Adds simple introspection for syscall compatibility information to Linux/AMD64. Syscalls registered in the syscall table now have associated metadata like name, support level, notes, and URLs to relevant issues. Syscall information can be exported as a table, JSON, or CSV using the new 'runsc help syscalls' command. Users can use this info to debug and get info on the compatibility of the version of runsc they are running or to generate documentation. PiperOrigin-RevId: 252558304
2019-06-10	Move //pkg/sentry/platform/procid to //pkg/procid.	Jamie Liu
	PiperOrigin-RevId: 252501653
2019-06-10	Fixes to listen backlog handling.	Bhasker Hariharan
	Changes netstack to confirm to current linux behaviour where if the backlog is full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto backlog connections to be in SYN-RCVD state as long as the backlog is not full. We also now drop a SYN if syn cookies are in use and the backlog for the listening endpoint is full. Added new tests to confirm the behaviour. Also reverted the change to increase the backlog in TcpPortReuseMultiThread syscall test. Fixes #236 PiperOrigin-RevId: 252500462
2019-06-10	Store more information in the kernel socket table.	Rahat Mahmood
	Store enough information in the kernel socket table to distinguish between different types of sockets. Previously we were only storing the socket family, but this isn't enough to classify sockets. For example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets are SOCK_DGRAM sockets with a particular protocol. Instead of creating more sub-tables, flatten the socket table and provide a filtering mechanism based on the socket entry. Also generate and store a socket entry index ("sl" in linux) which allows us to output entries in a stable order from procfs. PiperOrigin-RevId: 252495895
2019-06-10	Address more comments.	Kevin Krakauer
	Change-Id: I83ae1079f3dcba6b018f59ab7898decab5c211d2
2019-06-07	Move //pkg/sentry/memutil to //pkg/memutil.	Jamie Liu
	PiperOrigin-RevId: 252124156
2019-06-07	Address Ian's comments.	Kevin Krakauer
	Change-Id: I7445033b1970cbba3f2ed0682fe520dce02d8fad
2019-06-06	Change visibility of //pkg/sentry/time.	Jamie Liu
	PiperOrigin-RevId: 251965598
2019-06-06	Cap initial usermem.CopyStringIn buffer size.	Jamie Liu
	Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this size is measurably inefficient in most cases: most paths will not be this long, 4 KB is a lot of bytes to zero, and as of this writing the Go runtime allocator maps only two 4 KB objects to each 8 KB span, necessitating a call to runtime.mcache.refill() on ~every other call. Limit the initial buffer size to 256 B instead, and geometrically reallocate if necessary. PiperOrigin-RevId: 251960441
2019-06-06	Use common definition of SockType.	Rahat Mahmood
	SockType isn't specific to unix domain sockets, and the current definition basically mirrors the linux ABI's definition. PiperOrigin-RevId: 251956740
2019-06-06	Copy up parent when binding UDS on overlayfs	Fabricio Voznika
	Overlayfs was expecting the parent to exist when bind(2) was called, which may not be the case. The fix is to copy the parent directory to the upper layer before binding the UDS. There is not good place to add tests for it. Syscall tests would be ideal, but it's hard to guarantee that the directory where the socket is created hasn't been touched before (and thus copied the parent to the upper layer). Added it to runsc integration tests for now. If it turns out we have lots of these kind of tests, we can consider moving them somewhere more appropriate. PiperOrigin-RevId: 251954156
2019-06-06	"Implement" mbind(2).	Jamie Liu
	We still only advertise a single NUMA node, and ignore mempolicy accordingly, but mbind() at least now succeeds and has effects reflected by get_mempolicy(). Also fix handling of nodemasks: round sizes to unsigned long (as documented and done by Linux), and zero trailing bits when copying them out. PiperOrigin-RevId: 251950859
2019-06-06	Implement reclaim-driven MemoryFile eviction.	Jamie Liu
	PiperOrigin-RevId: 251950660
2019-06-06	Track and export socket state.	Rahat Mahmood
	This is necessary for implementing network diagnostic interfaces like /proc/net/{tcp,udp,unix} and sock_diag(7). For pass-through endpoints such as hostinet, we obtain the socket state from the backend. For netstack, we add explicit tracking of TCP states. PiperOrigin-RevId: 251934850
2019-06-06	Add multi-fd support to fdbased endpoint.	Bhasker Hariharan
	This allows an fdbased endpoint to have multiple underlying fd's from which packets can be read and dispatched/written to. This should allow for higher throughput as well as better scalability of the network stack as number of connections increases. Updates #231 PiperOrigin-RevId: 251852825
2019-06-05	netstack/sniffer: log GSO attributes	Andrei Vagin
	PiperOrigin-RevId: 251788534
2019-06-05	Shutdown host sockets on internal shutdown	Michael Pratt
	This is required to make the shutdown visible to peers outside the sandbox. The readClosed / writeClosed fields were dropped, as they were preventing a shutdown socket from reading the remainder of queued bytes. The host syscalls will return the appropriate errors for shutdown. The control message tests have been split out of socket_unix.cc to make the (few) remaining tests accessible to testing inherited host UDS, which don't support sending control messages. Updates #273 PiperOrigin-RevId: 251763060
2019-06-05	netstack/tcp: fix calculating a number of outstanding packets	Andrei Vagin
	In case of GSO, a segment can container more than one packet and we need to use the pCount() helper to get a number of packets. PiperOrigin-RevId: 251743020
2019-06-05	Adjust route when looping multicast packets	Chris Kuiper
	Multicast packets are special in that their destination address does not identify a specific interface. When sending out such a packet the multicast address is the remote address, but for incoming packets it is the local address. Hence, when looping a multicast packet, the route needs to be tweaked to reflect this. PiperOrigin-RevId: 251739298
2019-06-05	Implement dumpability tracking and checks	Michael Pratt
	We don't actually support core dumps, but some applications want to get/set dumpability, which still has an effect in procfs. Lack of support for set-uid binaries or fs creds simplifies things a bit. As-is, processes started via CreateProcess (i.e., init and sentryctl exec) have normal dumpability. I'm a bit torn on whether sentryctl exec tasks should be dumpable, but at least since they have no parent normal UID/GID checks should protect them. PiperOrigin-RevId: 251712714
2019-06-04	Fix data race in synRcvdState.	Bhasker Hariharan
	When checking the length of the acceptedChan we should hold the endpoint mutex otherwise a syn received while the listening socket is being closed can result in a data race where the cleanupLocked routine sets acceptedChan to nil while a handshake goroutine in progress could try and check it at the same time. PiperOrigin-RevId: 251537697
2019-06-04	Drop one dirent reference after referenced by file	Yong He
	When pipe is created, a dirent of pipe will be created and its initial reference is set as 0. Cause all dirent will only be destroyed when the reference decreased to -1, so there is already a 'initial reference' of dirent after it created. For destroying dirent after all reference released, the correct way is to drop the 'initial reference' once someone hold a reference to the dirent, such as fs.NewFile, otherwise the reference of dirent will stay 0 all the time, and will cause memory leak of dirent. Except pipe, timerfd/eventfd/epoll has the same problem Here is a simple case to create memory leak of dirent for pipe/timerfd/eventfd/epoll in C langange, after run the case, pprof the runsc process, you will find lots dirents of pipe/timerfd/eventfd/epoll not freed: int main(int argc, char *argv[]) { int i; int n; int pipefd[2]; if (argc != 3) { printf("Usage: %s epoll\|timerfd\|eventfd\|pipe <iterations>\n", argv[0]); } n = strtol(argv[2], NULL, 10); if (strcmp(argv[1], "epoll") == 0) { for (i = 0; i < n; ++i) close(epoll_create(1)); } else if (strcmp(argv[1], "timerfd") == 0) { for (i = 0; i < n; ++i) close(timerfd_create(CLOCK_REALTIME, 0)); } else if (strcmp(argv[1], "eventfd") == 0) { for (i = 0; i < n; ++i) close(eventfd(0, 0)); } else if (strcmp(argv[1], "pipe") == 0) { for (i = 0; i < n; ++i) if (pipe(pipefd) == 0) { close(pipefd[0]); close(pipefd[1]); } } printf("%s %s test finished\r\n",argv[1],argv[2]); return 0; } Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2 PiperOrigin-RevId: 251531096
2019-06-04	Remove the Dirent field from Pipe.	Nicolas Lacasse
	Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe raises difficult questions about the lifecycle of the Pipe and Dirent. Fortunately, we can side-step those questions by removing the Dirent field from Pipe entirely. We only need the Dirent when constructing fs.Files (which are ref-counted), and in GetFile (when a Dirent is passed to us anyways). PiperOrigin-RevId: 251497628
2019-06-03	gvisor/sock/unix: pass creds when a message is sent between unconnected sockets	Andrei Vagin
	and don't report a sender address if it doesn't have one PiperOrigin-RevId: 251371284
2019-06-03	gvisor/fs: return a proper error from FileWriter.Write in case of a short-write	Andrei Vagin
	The io.Writer contract requires that Write writes all available bytes and does not return short writes. This causes errors with io.Copy, since our own Write interface does not have this same contract. PiperOrigin-RevId: 251368730
2019-06-03	Delete debug log lines left by mistake.	Bhasker Hariharan
	Updates #236 PiperOrigin-RevId: 251337915
2019-06-03	gvisor: validate a new map region in the mremap syscall	Andrei Vagin
	Right now, mremap allows to remap a memory region over MaxUserAddress, this means that we can change the stub region. PiperOrigin-RevId: 251266886
2019-05-31	Disable certain tests that are flaky under race detector.	Bhasker Hariharan
	PiperOrigin-RevId: 250976665
2019-05-31	Change segment queue limit to be of fixed size.	Bhasker Hariharan
	Netstack sets the unprocessed segment queue size to match the receive buffer size. This is not required as this queue only needs to hold enough for a short duration before the endpoint goroutine can process it. Updates #230 PiperOrigin-RevId: 250976323
2019-05-31	Add basic iptables structures to netstack.	Kevin Krakauer
	Change-Id: Ib589906175a59dae315405a28f2d7f525ff8877f
2019-05-30	Simplify overlayBoundEndpoint.	Nicolas Lacasse
	There is no reason to do the recursion manually, since Inode.BoundEndpoint will do it for us. PiperOrigin-RevId: 250794903
2019-05-30	Add build guard to files using go:linkname	Fabricio Voznika
	Funcion signatures are not validated during compilation. Since they are not exported, they can change at any time. The guard ensures that they are verified at least on every version upgrade. PiperOrigin-RevId: 250733742
2019-05-30	Fixes to TCP listen behavior.	Bhasker Hariharan
	Netstack listen loop can get stuck if cookies are in-use and the app is slow to accept incoming connections. Further we continue to complete handshake for a connection even if the backlog is full. This creates a problem when a lots of connections come in rapidly and we end up with lots of completed connections just hanging around to be delivered. These fixes change netstack behaviour to mirror what linux does as described here in the following article http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK and not complete the handshake if the backlog is full. This will result in the connection staying in a half-complete state. Eventually the sender will retransmit the ACK and if backlog has space we will transition to a connected state and deliver the endpoint. Similarly when cookies are in use we do not try and create an endpoint unless there is space in the accept queue to accept the newly created endpoint. If there is no space then we again silently drop the ACK as we can just recreate it when the ACK is retransmitted by the peer. We also now use the backlog to cap the size of the SYN-RCVD queue for a given endpoint. So at any time there can be N connections in the backlog and N in a SYN-RCVD state if the application is not accepting connections. Any new SYNs will be dropped. This CL also fixes another small bug where we mark a new endpoint which has not completed handshake as connected. We should wait till handshake successfully completes before marking it connected. Updates #236 PiperOrigin-RevId: 250717817