gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2020-12-02	Add /proc/sys/kernel/sem.	Jing Chen
	PiperOrigin-RevId: 345178956
2020-11-30	Fix deadlock in UDP handleControlPacket path.	Bhasker Hariharan
	Fixing the sendto deadlock exposed yet another deadlock where a lock inversion occurs on the handleControlPacket path where e.mu and demuxer.epsByNIC.mu are acquired in reverse order from say when RegisterTransportEndpoint is called in endpoint.Connect(). This fix sidesteps the issue by just making endpoint.state an atomic and gets rid of the need to acquire e.mu in e.HandleControlPacket. PiperOrigin-RevId: 344939895
2020-11-23	Omit sandbox from chown test.	Adin Scannell
	This test fails because it must include additional UIDs. Omit the bazel sandbox to ensure that it can function correctly. PiperOrigin-RevId: 343927190
2020-11-23	Ignore permission failures in CheckDuplicatesRecursively.	Adin Scannell
	Not all files are always accessible by the process itself. This was specifically seen with map_files, but there's no rule that every entry must be accessible by the process itself. PiperOrigin-RevId: 343919117
2020-11-19	Internal change.	gVisor bot
	PiperOrigin-RevId: 343419851
2020-11-19	Propagate IP address prefix from host to netstack	Fabricio Voznika
	Closes #4022 PiperOrigin-RevId: 343378647
2020-11-18	[netstack] Move SO_KEEPALIVE and SO_ACCEPTCONN option to SocketOptions.	Ayush Ranjan
	PiperOrigin-RevId: 343217712
2020-11-18	[netstack] Move SO_REUSEPORT and SO_REUSEADDR option to SocketOptions.	Ayush Ranjan
	This changes also introduces: - `SocketOptionsHandler` interface which can be implemented by endpoints to handle endpoint specific behavior on SetSockOpt. This is analogous to what Linux does. - `DefaultSocketOptionsHandler` which is a default implementation of the above. This is embedded in all endpoints so that we don't have to uselessly implement empty functions. Endpoints with specific behavior can override the embedded method by manually defining its own implementation. PiperOrigin-RevId: 343158301
2020-11-18	[netstack] Move SO_NO_CHECK option to SocketOptions.	Ayush Ranjan
	PiperOrigin-RevId: 343146856
2020-11-18	[netstack] Move SO_PASSCRED option to SocketOptions.	Ayush Ranjan
	This change also makes the following fixes: - Make SocketOptions use atomic operations instead of having to acquire/drop locks upon each get/set option. - Make documentation more consistent. - Remove tcpip.SocketOptions from socketOpsCommon because it already exists in transport.Endpoint. - Refactors get/set socket options tests to be easily extendable. PiperOrigin-RevId: 343103780
2020-11-17	Fix possible deadlock in UDP.Write().	Bhasker Hariharan
	In UDP endpoint.Write() sendUDP is called with e.mu Rlocked. But if this happens to send a datagram over loopback which ends up generating an ICMP response of say ErrNoPortReachable, the handling of the response in HandleControlPacket also acquires e.mu using RLock. This is mostly fine unless there is a competing caller trying to acquire e.mu in exclusive mode using Lock(). This will deadlock as a caller waiting in Lock() disallows an new RLocks() to ensure it can actually acquire the Lock. This is documented here https://golang.org/pkg/sync/#RWMutex. This change releases the endpoint mutex before calling sendUDP to resolve the possibility of the deadlock. Reported-by: syzbot+537989797548c66e8ee3@syzkaller.appspotmail.com Reported-by: syzbot+eb0b73b4ab486f7673ba@syzkaller.appspotmail.com PiperOrigin-RevId: 342894148
2020-11-17	Fix SO_ERROR behavior for TCP in gVisor.	Bhasker Hariharan
	Fixes the behaviour of SO_ERROR for tcp sockets where in linux it returns sk->sk_err and if sk->sk_err is 0 then it returns sk->sk_soft_err. In gVisor TCP we endpoint.HardError is the equivalent of sk->sk_err and endpoint.LastError holds soft errors. This change brings this into alignment with Linux such that both hard/soft errors are cleared when retrieved using getsockopt(.. SO_ERROR) is called on a socket. Fixes #3812 PiperOrigin-RevId: 342868552
2020-11-16	Reset watchdog timer between sendfile() iterations.	Jamie Liu
	As part of this, change Task.interrupted() to not drain Task.interruptChan, and do so explicitly using new function Task.unsetInterrupted() instead. PiperOrigin-RevId: 342768365
2020-11-13	Disable save/restore in PartialBadBufferTest.SendMsgTCP.	Jamie Liu
	PiperOrigin-RevId: 342314586
2020-11-12	Deflake tcp_socket test.	Mithun Iyer
	Increase the wait time for the thread to be blocked on read/write syscall. PiperOrigin-RevId: 342204627
2020-11-12	Refactor SOL_SOCKET options	Nayana Bidari
	Store all the socket level options in a struct and call {Get/Set}SockOpt on this struct. This will avoid implementing socket level options on all endpoints. This CL contains implementing one socket level option for tcp and udp endpoints. PiperOrigin-RevId: 342203981
2020-11-09	Skip `EventHUp` notify in `FIN_WAIT2` on a socket close.	Mithun Iyer
	This Notify was added as part of cl/279106406; but notifying `EventHUp` in `FIN_WAIT2` is incorrect, as we want to only notify later on `TIME_WAIT` or a reset. However, we do need to notify any blocked waiters of an activity on the endpoint with `EventIn`\|`EventOut`. PiperOrigin-RevId: 341490913
2020-11-09	net: connect to the ipv4 localhost returns ENETUNREACH if the address isn't set	Andrei Vagin
	cl/340002915 modified the code to return EADDRNOTAVAIL if connect is called for a localhost address which isn't set. But actually, Linux returns EADDRNOTAVAIL for ipv6 addresses and ENETUNREACH for ipv4 addresses. Updates #4735 PiperOrigin-RevId: 341479129
2020-11-06	Implement command GETNCNT for semctl.	Jing Chen
	PiperOrigin-RevId: 341154192
2020-11-06	Fix infinite loop when splicing to pipes/eventfds.	Nicolas Lacasse
	Writes to pipes of size < PIPE_BUF are guaranteed to be atomic, so writes larger than that will return EAGAIN if the pipe has capacity < PIPE_BUF. Writes to eventfds will return EAGAIN if the write would cause the eventfd value to go over the max. In both such cases, calling Ready() on the FD will return true (because it is possible to write), but specific kinds of writes will in fact return EAGAIN. This CL fixes an infinite loop in splice and sendfile (VFS1 and VFS2) by forcing skipping the readiness check for the outfile in send, splice, and tee. PiperOrigin-RevId: 341102260
2020-11-06	Do not send to the zero port	Ghanan Gowripalan
	Port 0 is not meant to identify any remote port so attempting to send a packet to it should return an error. PiperOrigin-RevId: 341009528
2020-11-05	Deflake semaphore_test.	Jamie Liu
	- Disable saving in tests that wait for EINTR. - Do not execute async-signal-unsafe code after fork() (see fork(2)'s manpage, "After a fork in a multithreaded program ...") - Check for errors returned by semctl(GETZCNT). PiperOrigin-RevId: 340901353
2020-11-02	Implement command GETZCNT for semctl.	Jing Chen
	PiperOrigin-RevId: 340389884
2020-11-02	Clean up the code of setupTimeWaitClose	Andrei Vagin
	The active_closefd has to be shutdown only for write, otherwise the second poll will always return immediately. The second poll should not be called from a separate thread. PiperOrigin-RevId: 340319071
2020-11-01	Fix returned error when deleting non-existant address	Ian Lewis
	PiperOrigin-RevId: 340149214
2020-10-31	net/tcpip: connect to unset loopback address has to return EADDRNOTAVAIL	Andrei Vagin
	In the docker container, the ipv6 loopback address is not set, and connect("::1") has to return ENEADDRNOTAVAIL in this case. Without this fix, it returns EHOSTUNREACH. PiperOrigin-RevId: 340002915
2020-10-30	Separate kernel.Task.AsCopyContext() into CopyContext() and OwnCopyContext().	Jamie Liu
	kernel.copyContext{t} cannot be used outside of t's task goroutine, for three reasons: - t.CopyScratchBuffer() is task-goroutine-local. - Calling t.MemoryManager() without running on t's task goroutine or locking t.mu violates t.MemoryManager()'s preconditions. - kernel.copyContext passes t as context.Context to MM IO methods, which is illegal outside of t's task goroutine (cf. kernel.Task.Value()). Fix this by splitting AsCopyContext() into CopyContext() (which takes an explicit context.Context and is usable outside of the task goroutine) and OwnCopyContext() (which uses t as context.Context, but is only usable by t's task goroutine). PiperOrigin-RevId: 339933809
2020-10-28	Merge pull request #2849 from lubinszARM:pr_memory_barrier	gVisor bot
	PiperOrigin-RevId: 339504677
2020-10-27	Wake up any waiters on an ICMP error on UDP socket.	Bhasker Hariharan
	This change wakes up any waiters when we receive an ICMP port unreachable control packet on an UDP socket as well as sets waiter.EventErr in the result returned by Readiness() when e.lastError is not nil. The latter is required where an epoll()/poll() is done after the error is already handled since we will never notify again in such cases. PiperOrigin-RevId: 339370469
2020-10-27	Implement /proc/[pid]/mem	Lennart
	This PR implements /proc/[pid]/mem for `pkg/sentry/fs` (refer to #2716) and `pkg/sentry/fsimpl`. @majek COPYBARA_INTEGRATE_REVIEW=https://github.com/google/gvisor/pull/4060 from lnsp:proc-pid-mem 2caf9021254646f441be618a9bb5528610e44d43 PiperOrigin-RevId: 339369629
2020-10-27	Add basic address deletion to netlink	Ian Lewis
	Updates #3921 PiperOrigin-RevId: 339195417
2020-10-26	Implement command IPC_STAT for semctl.	Jing Chen
	PiperOrigin-RevId: 339166854
2020-10-26	Fix SCM Rights S/R reference leak.	Dean Deng
	Control messages collected when peeking into a socket were being leaked. PiperOrigin-RevId: 339114961
2020-10-24	Avoid excessive save/restore cycles in socket_ipv4_udp_unbound tests.	Jamie Liu
	PiperOrigin-RevId: 338805321
2020-10-23	Support VFS2 save/restore.	Jamie Liu
	Inode number consistency checks are now skipped in save/restore tests for reasons described in greatest detail in StatTest.StateDoesntChangeAfterRename. They pass in VFS1 due to the bug described in new test case SimpleStatTest.DifferentFilesHaveDifferentDeviceInodeNumberPairs. Fixes #1663 PiperOrigin-RevId: 338776148
2020-10-23	Fix socket_ipv4_udp_unbound_loopback_test_linux	Zach Koopmans
	Handle "Resource temporarily unavailable" EAGAIN errors with a select call before calling recvmsg. Also rename similar helper call from "RecvMsgTimeout" to "RecvTimeout", because it calls "recv". PiperOrigin-RevId: 338761695
2020-10-23	Support getsockopt for SO_ACCEPTCONN.	Nayana Bidari
	The SO_ACCEPTCONN option is used only on getsockopt(). When this option is specified, getsockopt() indicates whether socket listening is enabled for the socket. A value of zero indicates that socket listening is disabled; non-zero that it is enabled. PiperOrigin-RevId: 338703206
2020-10-23	Decrement e.synRcvdCount once handshake is complete.	Bhasker Hariharan
	Earlier the count was dropped only after calling e.deliverAccepted. This lead to an issue where there were no connections in SYN-RCVD state for the listening endpoint but e.synRcvdCount would not be zero because it was being reduced only when handleSynSegment returned after deliverAccepted returned. This issue is seen when the Nth SYN for a listen backlog of size N which would cause the listen backlog to be full gets dropped occasionally. This happens when the new SYN comes at when the previous completed endpoint has been delivered to the accept queue but the synRcvdCount hasn't yet been decremented because the goroutine running handleSynSegment has not yet completed. PiperOrigin-RevId: 338690646
2020-10-23	Rewrite reference leak checker without finalizers.	Dean Deng
	Our current reference leak checker uses finalizers to verify whether an object has reached zero references before it is garbage collected. There are multiple problems with this mechanism, so a rewrite is in order. With finalizers, there is no way to guarantee that a finalizer will run before the program exits. When an unreachable object with a finalizer is garbage collected, its finalizer will be added to a queue and run asynchronously. The best we can do is run garbage collection upon sandbox exit to make sure that all finalizers are enqueued. Furthermore, if there is a chain of finalized objects, e.g. A points to B points to C, garbage collection needs to run multiple times before all of the finalizers are enqueued. The first GC run will register the finalizer for A but not free it. It takes another GC run to free A, at which point B's finalizer can be registered. As a result, we need to run GC as many times as the length of the longest such chain to have a somewhat reliable leak checker. Finally, a cyclical chain of structs pointing to one another will never be garbage collected if a finalizer is set. This is a well-known issue with Go finalizers (https://github.com/golang/go/issues/7358). Using leak checking on filesystem objects that produce cycles will not work and even result in memory leaks. The new leak checker stores reference counted objects in a global map when leak check is enabled and removes them once they are destroyed. At sandbox exit, any remaining objects in the map are considered as leaked. This provides a deterministic way of detecting leaks without relying on the complexities of finalizers and garbage collection. This approach has several benefits over the former, including: - Always detects leaks of objects that should be destroyed very close to sandbox exit. The old checker very rarely detected these leaks, because it relied on garbage collection to be run in a short window of time. - Panics if we forgot to enable leak check on a ref-counted object (we will try to remove it from the map when it is destroyed, but it will never have been added). - Can store extra logging information in the map values without adding to the size of the ref count struct itself. With the size of just an int64, the ref count object remains compact, meaning frequent operations like IncRef/DecRef are more cache-efficient. - Can aggregate leak results in a single report after the sandbox exits. Instead of having warnings littered in the log, which were non-deterministically triggered by garbage collection, we can print all warning messages at once. Note that this could also be a limitation--the sandbox must exit properly for leaks to be detected. Some basic benchmarking indicates that this change does not significantly affect performance when leak checking is enabled, which is understandable since registering/unregistering is only done once for each filesystem object. Updates #1486. PiperOrigin-RevId: 338685972
2020-10-19	Fix runsc tests on VFS2 overlay.	Jamie Liu
	- Check the sticky bit in overlay.filesystem.UnlinkAt(). Fixes StickyTest.StickyBitPermDenied. - When configuring a VFS2 overlay in runsc, copy the lower layer's root owner/group/mode to the upper layer's root (as in the VFS1 equivalent, boot.addOverlay()). This makes the overlay root owned by UID/GID 65534 with mode 0755 rather than owned by UID/GID 0 with mode 01777. Fixes CreateTest.CreateFailsOnUnpermittedDir, which assumes that the test cannot create files in /. - MknodTest.UnimplementedTypesReturnError assumes that the creation of device special files is not supported. However, while the VFS2 gofer client still doesn't support device special files, VFS2 tmpfs does, and in the overlay test dimension mknod() targets a tmpfs upper layer. The test initially has all capabilities, including CAP_MKNOD, so its creation of these files succeeds. Constrain these tests to VFS1. - Rename overlay.nonDirectoryFD to overlay.regularFileFD and only use it for regular files, using the original FD for pipes and device special files. This is more consistent with Linux (which gets the original inode_operations, and therefore file_operations, for these file types from ovl_fill_inode() => init_special_inode()) and fixes remaining mknod and pipe tests. - Read/write 1KB at a time in PipeTest.Streaming, rather than 4 bytes. This isn't strictly necessary, but it makes the test less obnoxiously slow on ptrace. Fixes #4407 PiperOrigin-RevId: 337971042
2020-10-16	Use POSIX interval timers in flock test.	Dean Deng
	ualarm(2) is obsolete. Move IntervalTimer into a test util, where it can be used by flock tests. These tests were flaky with TSAN, probably because it slowed the tests down enough that the alarm was expiring before flock() was called. Use an interval timer so that even if we miss the first alarm (or more), flock() is still guaranteed to be interrupted. PiperOrigin-RevId: 337578751
2020-10-15	sockets: ignore io.EOF from view.ReadAt	Andrei Vagin
	Reported-by: syzbot+5466463b7604c2902875@syzkaller.appspotmail.com PiperOrigin-RevId: 337451896
2020-10-09	TCP Receive window advertisement fixes.	Bhasker Hariharan
	The fix in commit 028e045da93b7c1c26417e80e4b4e388b86a713d was incorrect as it can cause the right edge of the window to shrink when we announce a zero window due to receive buffer being full as its done before the check for seeing if the window is being shrunk because of the selected window. Further the window was calculated purely on available space but in cases where we are getting full sized segments it makes more sense to use the actual bytes being held. This CL changes to use the lower of the total available space vs the available space in the maximal window we could advertise minus the actual payload bytes being held. This change also cleans up the code so that the window selection logic is not duplicated between getSendParams() and windowCrossedACKThresholdLocked. PiperOrigin-RevId: 336404827
2020-10-09	test/syscall/iptables: don't use designated initializers	Andrei Vagin
	test/syscalls/linux/iptables.cc:130:3: error: C99 designator 'name' outside aggregate initializer 130 \| }; \| PiperOrigin-RevId: 336331738
2020-10-06	Implement membarrier(2) commands other than *_SYNC_CORE.	Jamie Liu
	Updates #267 PiperOrigin-RevId: 335713923
2020-10-03	Fix kcov enabling and disabling procedures.	Dean Deng
	- When the KCOV_ENABLE_TRACE ioctl is called with the trace kind KCOV_TRACE_PC, the kcov mode should be set to KCOV_MODE_TRACE_PC. - When the owning task of kcov exits, the memory mapping should not be cleared so it can be used by other tasks. - Add more tests (also tested on native Linux kcov). PiperOrigin-RevId: 335202585
2020-09-30	ip6tables: redirect support	Kevin Krakauer
	Adds support for the IPv6-compatible redirect target. Redirection is a limited form of DNAT, where the destination is always the localhost. Updates #3549. PiperOrigin-RevId: 334698344
2020-09-30	avoid the random memory barrier issue in mmap testing on Arm64	Bin Lu
	There is a new random issue on some Arm64 machines. This scene can be summarized as following: Sometimes, the content of the func() pointer is still 0 opcode. The probability of this kind of issue is very low, currently only available on some machines. After inserting a simple memory barrier, this issue was gone. The code to directly use the memory barrier is as follows: memcpy(reinterpret_cast<void>(addr), machine_code, sizeof(machine_code)); isb() func = reinterpret_cast<uint32_t ()(void)>(addr); Signed-off-by: Bin Lu <bin.lu@arm.com>
2020-09-29	Add /proc/[pid]/cwd	Fabricio Voznika
	PiperOrigin-RevId: 334478850
2020-09-29	iptables: refactor to make targets extendable	Kevin Krakauer
	Like matchers, targets should use a module-like register/lookup system. This replaces the brittle switch statements we had before. The only behavior change is supporing IPT_GET_REVISION_TARGET. This makes it much easier to add IPv6 redirect in the next change. Updates #3549. PiperOrigin-RevId: 334469418