summaryrefslogtreecommitdiffhomepage
path: root/pkg
AgeCommit message (Collapse)Author
2019-06-10Store more information in the kernel socket table.Rahat Mahmood
Store enough information in the kernel socket table to distinguish between different types of sockets. Previously we were only storing the socket family, but this isn't enough to classify sockets. For example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets are SOCK_DGRAM sockets with a particular protocol. Instead of creating more sub-tables, flatten the socket table and provide a filtering mechanism based on the socket entry. Also generate and store a socket entry index ("sl" in linux) which allows us to output entries in a stable order from procfs. PiperOrigin-RevId: 252495895
2019-06-07Move //pkg/sentry/memutil to //pkg/memutil.Jamie Liu
PiperOrigin-RevId: 252124156
2019-06-06Change visibility of //pkg/sentry/time.Jamie Liu
PiperOrigin-RevId: 251965598
2019-06-06Cap initial usermem.CopyStringIn buffer size.Jamie Liu
Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this size is measurably inefficient in most cases: most paths will not be this long, 4 KB is a lot of bytes to zero, and as of this writing the Go runtime allocator maps only two 4 KB objects to each 8 KB span, necessitating a call to runtime.mcache.refill() on ~every other call. Limit the initial buffer size to 256 B instead, and geometrically reallocate if necessary. PiperOrigin-RevId: 251960441
2019-06-06Use common definition of SockType.Rahat Mahmood
SockType isn't specific to unix domain sockets, and the current definition basically mirrors the linux ABI's definition. PiperOrigin-RevId: 251956740
2019-06-06Copy up parent when binding UDS on overlayfsFabricio Voznika
Overlayfs was expecting the parent to exist when bind(2) was called, which may not be the case. The fix is to copy the parent directory to the upper layer before binding the UDS. There is not good place to add tests for it. Syscall tests would be ideal, but it's hard to guarantee that the directory where the socket is created hasn't been touched before (and thus copied the parent to the upper layer). Added it to runsc integration tests for now. If it turns out we have lots of these kind of tests, we can consider moving them somewhere more appropriate. PiperOrigin-RevId: 251954156
2019-06-06"Implement" mbind(2).Jamie Liu
We still only advertise a single NUMA node, and ignore mempolicy accordingly, but mbind() at least now succeeds and has effects reflected by get_mempolicy(). Also fix handling of nodemasks: round sizes to unsigned long (as documented and done by Linux), and zero trailing bits when copying them out. PiperOrigin-RevId: 251950859
2019-06-06Implement reclaim-driven MemoryFile eviction.Jamie Liu
PiperOrigin-RevId: 251950660
2019-06-06Track and export socket state.Rahat Mahmood
This is necessary for implementing network diagnostic interfaces like /proc/net/{tcp,udp,unix} and sock_diag(7). For pass-through endpoints such as hostinet, we obtain the socket state from the backend. For netstack, we add explicit tracking of TCP states. PiperOrigin-RevId: 251934850
2019-06-06Add multi-fd support to fdbased endpoint.Bhasker Hariharan
This allows an fdbased endpoint to have multiple underlying fd's from which packets can be read and dispatched/written to. This should allow for higher throughput as well as better scalability of the network stack as number of connections increases. Updates #231 PiperOrigin-RevId: 251852825
2019-06-05netstack/sniffer: log GSO attributesAndrei Vagin
PiperOrigin-RevId: 251788534
2019-06-05Shutdown host sockets on internal shutdownMichael Pratt
This is required to make the shutdown visible to peers outside the sandbox. The readClosed / writeClosed fields were dropped, as they were preventing a shutdown socket from reading the remainder of queued bytes. The host syscalls will return the appropriate errors for shutdown. The control message tests have been split out of socket_unix.cc to make the (few) remaining tests accessible to testing inherited host UDS, which don't support sending control messages. Updates #273 PiperOrigin-RevId: 251763060
2019-06-05netstack/tcp: fix calculating a number of outstanding packetsAndrei Vagin
In case of GSO, a segment can container more than one packet and we need to use the pCount() helper to get a number of packets. PiperOrigin-RevId: 251743020
2019-06-05Adjust route when looping multicast packetsChris Kuiper
Multicast packets are special in that their destination address does not identify a specific interface. When sending out such a packet the multicast address is the remote address, but for incoming packets it is the local address. Hence, when looping a multicast packet, the route needs to be tweaked to reflect this. PiperOrigin-RevId: 251739298
2019-06-05Implement dumpability tracking and checksMichael Pratt
We don't actually support core dumps, but some applications want to get/set dumpability, which still has an effect in procfs. Lack of support for set-uid binaries or fs creds simplifies things a bit. As-is, processes started via CreateProcess (i.e., init and sentryctl exec) have normal dumpability. I'm a bit torn on whether sentryctl exec tasks should be dumpable, but at least since they have no parent normal UID/GID checks should protect them. PiperOrigin-RevId: 251712714
2019-06-04Fix data race in synRcvdState.Bhasker Hariharan
When checking the length of the acceptedChan we should hold the endpoint mutex otherwise a syn received while the listening socket is being closed can result in a data race where the cleanupLocked routine sets acceptedChan to nil while a handshake goroutine in progress could try and check it at the same time. PiperOrigin-RevId: 251537697
2019-06-04Drop one dirent reference after referenced by fileYong He
When pipe is created, a dirent of pipe will be created and its initial reference is set as 0. Cause all dirent will only be destroyed when the reference decreased to -1, so there is already a 'initial reference' of dirent after it created. For destroying dirent after all reference released, the correct way is to drop the 'initial reference' once someone hold a reference to the dirent, such as fs.NewFile, otherwise the reference of dirent will stay 0 all the time, and will cause memory leak of dirent. Except pipe, timerfd/eventfd/epoll has the same problem Here is a simple case to create memory leak of dirent for pipe/timerfd/eventfd/epoll in C langange, after run the case, pprof the runsc process, you will find lots dirents of pipe/timerfd/eventfd/epoll not freed: int main(int argc, char *argv[]) { int i; int n; int pipefd[2]; if (argc != 3) { printf("Usage: %s epoll|timerfd|eventfd|pipe <iterations>\n", argv[0]); } n = strtol(argv[2], NULL, 10); if (strcmp(argv[1], "epoll") == 0) { for (i = 0; i < n; ++i) close(epoll_create(1)); } else if (strcmp(argv[1], "timerfd") == 0) { for (i = 0; i < n; ++i) close(timerfd_create(CLOCK_REALTIME, 0)); } else if (strcmp(argv[1], "eventfd") == 0) { for (i = 0; i < n; ++i) close(eventfd(0, 0)); } else if (strcmp(argv[1], "pipe") == 0) { for (i = 0; i < n; ++i) if (pipe(pipefd) == 0) { close(pipefd[0]); close(pipefd[1]); } } printf("%s %s test finished\r\n",argv[1],argv[2]); return 0; } Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2 PiperOrigin-RevId: 251531096
2019-06-04Remove the Dirent field from Pipe.Nicolas Lacasse
Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe raises difficult questions about the lifecycle of the Pipe and Dirent. Fortunately, we can side-step those questions by removing the Dirent field from Pipe entirely. We only need the Dirent when constructing fs.Files (which are ref-counted), and in GetFile (when a Dirent is passed to us anyways). PiperOrigin-RevId: 251497628
2019-06-03gvisor/sock/unix: pass creds when a message is sent between unconnected socketsAndrei Vagin
and don't report a sender address if it doesn't have one PiperOrigin-RevId: 251371284
2019-06-03gvisor/fs: return a proper error from FileWriter.Write in case of a short-writeAndrei Vagin
The io.Writer contract requires that Write writes all available bytes and does not return short writes. This causes errors with io.Copy, since our own Write interface does not have this same contract. PiperOrigin-RevId: 251368730
2019-06-03Delete debug log lines left by mistake.Bhasker Hariharan
Updates #236 PiperOrigin-RevId: 251337915
2019-06-03gvisor: validate a new map region in the mremap syscallAndrei Vagin
Right now, mremap allows to remap a memory region over MaxUserAddress, this means that we can change the stub region. PiperOrigin-RevId: 251266886
2019-05-31Disable certain tests that are flaky under race detector.Bhasker Hariharan
PiperOrigin-RevId: 250976665
2019-05-31Change segment queue limit to be of fixed size.Bhasker Hariharan
Netstack sets the unprocessed segment queue size to match the receive buffer size. This is not required as this queue only needs to hold enough for a short duration before the endpoint goroutine can process it. Updates #230 PiperOrigin-RevId: 250976323
2019-05-30Simplify overlayBoundEndpoint.Nicolas Lacasse
There is no reason to do the recursion manually, since Inode.BoundEndpoint will do it for us. PiperOrigin-RevId: 250794903
2019-05-30Add build guard to files using go:linknameFabricio Voznika
Funcion signatures are not validated during compilation. Since they are not exported, they can change at any time. The guard ensures that they are verified at least on every version upgrade. PiperOrigin-RevId: 250733742
2019-05-30Fixes to TCP listen behavior.Bhasker Hariharan
Netstack listen loop can get stuck if cookies are in-use and the app is slow to accept incoming connections. Further we continue to complete handshake for a connection even if the backlog is full. This creates a problem when a lots of connections come in rapidly and we end up with lots of completed connections just hanging around to be delivered. These fixes change netstack behaviour to mirror what linux does as described here in the following article http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK and not complete the handshake if the backlog is full. This will result in the connection staying in a half-complete state. Eventually the sender will retransmit the ACK and if backlog has space we will transition to a connected state and deliver the endpoint. Similarly when cookies are in use we do not try and create an endpoint unless there is space in the accept queue to accept the newly created endpoint. If there is no space then we again silently drop the ACK as we can just recreate it when the ACK is retransmitted by the peer. We also now use the backlog to cap the size of the SYN-RCVD queue for a given endpoint. So at any time there can be N connections in the backlog and N in a SYN-RCVD state if the application is not accepting connections. Any new SYNs will be dropped. This CL also fixes another small bug where we mark a new endpoint which has not completed handshake as connected. We should wait till handshake successfully completes before marking it connected. Updates #236 PiperOrigin-RevId: 250717817
2019-05-30Update procid for Go 1.13Michael Pratt
Upstream Go has no changes here. PiperOrigin-RevId: 250602731
2019-05-30Add VmData field to /proc/{pid}/statuschris.zn
VmData is the size of private data segments. It has the same meaning as in Linux. Change-Id: Iebf1ae85940a810524a6cde9c2e767d4233ddb2a PiperOrigin-RevId: 250593739
2019-05-30Add support for collecting execution trace to runsc.Bhasker Hariharan
Updates #220 PiperOrigin-RevId: 250532302
2019-05-30gvisor: socket() returns EPROTONOSUPPORT if protocol is not supportedAndrei Vagin
PiperOrigin-RevId: 250426407
2019-05-30Always wait on tracee childrenMichael Pratt
After bf959931ddb88c4e4366e96dd22e68fa0db9527c ("wait/ptrace: assume __WALL if the child is traced") (Linux 4.7), tracees are always eligible for waiting, regardless of type. PiperOrigin-RevId: 250399527
2019-05-30Remove obsolete bug.Adin Scannell
The original bug is no longer relevant, and the FIXME here contains lots of obsolete information. PiperOrigin-RevId: 249924036
2019-05-24Remove obsolete TODO.Adin Scannell
We don't need to model internal interfaces after the system call interfaces (which are objectively worse and simply use a flag to distinguish between two logically different operations). PiperOrigin-RevId: 249916814 Change-Id: I45d02e0ec0be66b782a685b1f305ea027694cab9
2019-05-24Wrap comments and reword in common present tenseMichael Pratt
PiperOrigin-RevId: 249888234 Change-Id: Icfef32c3ed34809c34100c07e93e9581c786776e
2019-05-24Remove unused wakersTamir Duberstein
These wakers are uselessly allocated and passed around; nothing ever listens for notifications on them. The code here appears to be vestigial, so removing it and allowing a nil waker to be passed seems appropriate. PiperOrigin-RevId: 249879320 Change-Id: Icd209fb77cc0dd4e5c49d7a9f2adc32bf88b4b71
2019-05-23gvisor: interrupt the sendfile system call if a task has been interruptedAndrei Vagin
sendfile can be called for a big range and it can require significant amount of time to process it, so we need to handle task interrupts in this system call. PiperOrigin-RevId: 249781023 Change-Id: Ifc2ec505d74c06f5ee76f93b8d30d518ec2d4015
2019-05-23Added boilerplate code for ext4 fs.Ayush Ranjan
Initialized BUILD with license Mount is still unimplemented and is not meant to be part of this CL. Rest of the fs interface is implemented. Referenced the Linux kernel appropriately when needed PiperOrigin-RevId: 249741997 Change-Id: Id1e4c7c9e68b3f6946da39896fc6a0c3dcd7f98c
2019-05-23Initial support for bind mountsFabricio Voznika
Separate MountSource from Mount. This is needed to allow mounts to be shared by multiple containers within the same pod. PiperOrigin-RevId: 249617810 Change-Id: Id2944feb7e4194951f355cbe6d4944ae3c02e468
2019-05-22Fix the signature for gopark.Bhasker Hariharan
gopark's signature was changed from having a string reason to a uint8. See: https://github.com/golang/go/commit/4d7cf3fedbc382215df5ff6167ee9782a9cc9375 This broke execution tracing of the sentry. Switching to the right signature makes tracing work again. Updates #220 PiperOrigin-RevId: 249565311 Change-Id: If77fd276cecb37d4003c8222f6de510b8031a074
2019-05-22Log unhandled faults only at DEBUG level.Adin Scannell
PiperOrigin-RevId: 249561399 Change-Id: Ic73c68c8538bdca53068f38f82b7260939addac2
2019-05-22Add WCLONE / WALL support to waitidMichael Pratt
The previous commit adds WNOTHREAD support to waitid, so we may as well complete the upstream change. Linux added WCLONE, WALL, WNOTHREAD support to waitid(2) in 91c4e8ea8f05916df0c8a6f383508ac7c9e10dba ("wait: allow sys_waitid() to accept __WNOTHREAD/__WCLONE/__WALL"). i.e., Linux 4.7. PiperOrigin-RevId: 249560587 Change-Id: Iff177b0848a3f7bae6cb5592e44500c5a942fbeb
2019-05-22Remove obsolete TODO.Adin Scannell
There no obvious reason to require that BlockSize and StatFS are MountSource operations. Today they are in INodeOperations, and they can be moved elsewhere in the future as part of a normal refactor process. PiperOrigin-RevId: 249549982 Change-Id: Ib832e02faeaf8253674475df4e385bcc53d780f3
2019-05-22Add support for wait(WNOTHREAD)Michael Pratt
PiperOrigin-RevId: 249537694 Change-Id: Iaa4bca73a2d8341e03064d59a2eb490afc3f80da
2019-05-22UDP and TCP raw socket support.Kevin Krakauer
PiperOrigin-RevId: 249511348 Change-Id: I34539092cc85032d9473ff4dd308fc29dc9bfd6b
2019-05-22Move wait constants to abi/linux packageMichael Pratt
Updates #214 PiperOrigin-RevId: 249483756 Change-Id: I0d3cf4112bed75a863d5eb08c2063fbc506cd875
2019-05-21Clean up pipe internals and add fcntl supportAdin Scannell
Pipe internals are made more efficient by avoiding garbage collection. A pool is now used that can be shared by all pipes, and buffers are chained via an intrusive list. The documentation for pipe structures and methods is also simplified and clarified. The pipe tests are now parameterized, so that they are run on all different variants (named pipes, small buffers, default buffers). The pipe buffer sizes are exposed by fcntl, which is now supported by this change. A size change test has been added to the suite. These new tests uncovered a bug regarding the semantics of open named pipes with O_NONBLOCK, which is also fixed by this CL. This fix also addresses the lack of the O_LARGEFILE flag for named pipes. PiperOrigin-RevId: 249375888 Change-Id: I48e61e9c868aedb0cadda2dff33f09a560dee773
2019-05-21Fix inconsistencies in ELF anonymous mappingsMichael Pratt
* A segment with filesz == 0, memsz > 0 should be an anonymous only mapping. We were failing to load such an ELF. * Anonymous pages are always mapped RW, regardless of the segment protections. PiperOrigin-RevId: 249355239 Change-Id: I251e5c0ce8848cf8420c3aadf337b0d77b1ad991
2019-05-21Refactor fdbased endpoint dispatcher code.Bhasker Hariharan
This is in preparation to support an fdbased endpoint that can read/dispatch packets from multiple underlying fds. Updates #231 PiperOrigin-RevId: 249337074 Change-Id: Id7d375186cffcf55ae5e38986e7d605a96916d35
2019-05-21Add basic plumbing for splice and stub implementation.Adin Scannell
This does not actually implement an efficient splice or sendfile. Rather, it adds a generic plumbing to the file internals so that this can be added. All file implementations use the stub fileutil.NoSplice implementation, which causes sendfile and splice to fall back to an internal copy. A basic splice system call interface is added, along with a test. PiperOrigin-RevId: 249335960 Change-Id: Ic5568be2af0a505c19e7aec66d5af2480ab0939b