Age | Commit message (Collapse) | Author |
|
And methods that do more traversals should use the remaining count rather than
resetting.
PiperOrigin-RevId: 254041720
|
|
This allows tasks to have distinct mount namespace, instead of all sharing the
kernel's root mount namespace.
Currently, the only way for a task to get a different mount namespace than the
kernel's root is by explicitly setting a different MountNamespace in
CreateProcessArgs, and nothing does this (yet).
In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a
new container inside runsc.
Note that "MountNamespace" is a poor term for this thing. It's more like a
distinct VFS tree. When we get around to adding real mount namespaces, this
will need a better naem.
PiperOrigin-RevId: 254009310
|
|
Test fails because it's reading 4KB instead of the
expected 64KB. Changed the test to read pipe buffer
size instead of hardcode and added some logging in
case the reason for failure was not pipe buffer size.
PiperOrigin-RevId: 253916040
|
|
sockets, pipes and other non-seekable file descriptors don't
use file.offset, so we don't need to update it.
With this change, we will be able to call file operations
without locking the file.mu mutex. This is already used for
pipes in the splice system call.
PiperOrigin-RevId: 253746644
|
|
When leader of process group (session) exit, the process
group ID (session ID) is holding by other processes in
the process group, so the process group ID (session ID)
can not be reused.
If reusing the process group ID (seession ID) as new process
group ID for new process, this will cause session create
failed, and later runsc crash when access process group.
The fix skip the tid if it is using by a process group
(session) when allocating a new tid.
We could easily reproduce the runsc crash follow
these steps:
1. build test program, and run inside container
int main(int argc, char *argv[])
{
pid_t cpid, spid;
cpid = fork();
if (cpid == -1) {
perror("fork");
exit(EXIT_FAILURE);
}
if (cpid == 0) {
pid_t sid = setsid();
printf("Start New Session %ld\n",sid);
printf("Child PID %ld / PPID %ld / PGID %ld / SID %ld\n",
getpid(),getppid(),getpgid(getpid()),getsid(getpid()));
spid = fork();
if (spid == 0) {
setpgid(getpid(), getpid());
printf("Set GrandSon as New Process Group\n");
printf("GrandSon PID %ld / PPID %ld / PGID %ld / SID %ld\n",
getpid(),getppid(),getpgid(getpid()),getsid(getpid()));
while(1) {
usleep(1);
}
}
sleep(3);
exit(0);
} else {
exit(0);
}
return 0;
}
2. build hello program
int main(int argc, char *argv[])
{
printf("Current PID is %ld\n", (long) getpid());
return 0;
}
3. run script on host which run hello inside container, you can
speed up the test with set TasksLimit as lower value.
for (( i=0; i<65535; i++ ))
do
docker exec <container id> /test/hello
done
4. when hello process reusing the process group of loop process,
runsc will crash.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x79f0c8]
goroutine 612475 [running]:
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*ProcessGroup).decRefWithParent(0x0, 0x0)
pkg/sentry/kernel/sessions.go:160 +0x78
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).exitNotifyLocked(0xc000663500, 0x0)
pkg/sentry/kernel/task_exit.go:672 +0x2b7
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*runExitNotify).execute(0x0, 0xc000663500, 0x0, 0x0)
pkg/sentry/kernel/task_exit.go:542 +0xc4
gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).run(0xc000663500, 0xc)
pkg/sentry/kernel/task_run.go:91 +0x194
created by gvisor.googlesource.com/gvisor/pkg/sentry/kernel.(*Task).Start
pkg/sentry/kernel/task_start.go:286 +0xfe
|
|
The implementation is similar to linux where we track the number of bytes
consumed by the application to grow the receive buffer of a given TCP endpoint.
This ensures that the advertised window grows at a reasonable rate to accomodate
for the sender's rate and prevents large amounts of data being held in stack
buffers if the application is not actively reading or not reading fast enough.
The original paper that was used to implement the linux receive buffer auto-
tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf
NOTE: Linux does not implement DRS as defined in that paper, it's just a good
reference to understand the solution space.
Updates #230
PiperOrigin-RevId: 253168283
|
|
All functions which allocate objects containing AtomicRefCounts will soon need
a context.
PiperOrigin-RevId: 253147709
|
|
The deadlock can occur when both ends of a connected Unix socket which has
FIOASYNC enabled on at least one end are closed at the same time. One end
notifies that it is closing, calling (*waiter.Queue).Notify which takes
waiter.Queue.mu (as a read lock) and then calls (*FileAsync).Callback, which
takes FileAsync.mu. The other end tries to unregister for notifications by
calling (*FileAsync).Unregister, which takes FileAsync.mu and calls
(*waiter.Queue).EventUnregister which takes waiter.Queue.mu.
This is fixed by moving the calls to waiter.Waitable.EventRegister and
waiter.Waitable.EventUnregister outside of the protection of any mutex used
in (*FileAsync).Callback.
The new test is related, but does not cover this particular situation.
Also fix a data race on FileAsync.e.Callback. (*FileAsync).Callback checked
FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter
calling (*FileAsync).Callback could not and did not. This is fixed by making
FileAsync.e.Callback immutable before passing it to the waiter for the first
time.
Fixes #346
PiperOrigin-RevId: 253138340
|
|
SO_TYPE was already implemented for everything but netlink sockets.
PiperOrigin-RevId: 253138157
|
|
This can be merged after:
https://github.com/google/gvisor-website/pull/77
or
https://github.com/google/gvisor-website/pull/78
PiperOrigin-RevId: 253132620
|
|
PiperOrigin-RevId: 253122166
|
|
PiperOrigin-RevId: 252918338
|
|
Change-Id: I7457a11de4725e1bf3811420c505d225b1cb6943
|
|
This CL also cleans up the error returned for setting congestion
control which was incorrectly returning EINVAL instead of ENOENT.
PiperOrigin-RevId: 252889093
|
|
PiperOrigin-RevId: 252855280
|
|
For sendfile(2), we propagate a TCP error through the system call layer.
This should be eaten if there is a partial result. This change also adds
a test to ensure that there is no panic in this case, for both TCP sockets
and unix domain sockets.
PiperOrigin-RevId: 252746192
|
|
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.
For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.
PiperOrigin-RevId: 252704037
|
|
Adds simple introspection for syscall compatibility information to Linux/AMD64.
Syscalls registered in the syscall table now have associated metadata like
name, support level, notes, and URLs to relevant issues.
Syscall information can be exported as a table, JSON, or CSV using the new
'runsc help syscalls' command. Users can use this info to debug and get info
on the compatibility of the version of runsc they are running or to generate
documentation.
PiperOrigin-RevId: 252558304
|
|
PiperOrigin-RevId: 252501653
|
|
Changes netstack to confirm to current linux behaviour where if the backlog is
full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto
backlog connections to be in SYN-RCVD state as long as the backlog is not full.
We also now drop a SYN if syn cookies are in use and the backlog for the
listening endpoint is full.
Added new tests to confirm the behaviour.
Also reverted the change to increase the backlog in TcpPortReuseMultiThread
syscall test.
Fixes #236
PiperOrigin-RevId: 252500462
|
|
Store enough information in the kernel socket table to distinguish
between different types of sockets. Previously we were only storing
the socket family, but this isn't enough to classify sockets. For
example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets
are SOCK_DGRAM sockets with a particular protocol.
Instead of creating more sub-tables, flatten the socket table and
provide a filtering mechanism based on the socket entry.
Also generate and store a socket entry index ("sl" in linux) which
allows us to output entries in a stable order from procfs.
PiperOrigin-RevId: 252495895
|
|
Change-Id: I83ae1079f3dcba6b018f59ab7898decab5c211d2
|
|
PiperOrigin-RevId: 252124156
|
|
Change-Id: I7445033b1970cbba3f2ed0682fe520dce02d8fad
|
|
PiperOrigin-RevId: 251965598
|
|
Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which
passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this
size is measurably inefficient in most cases: most paths will not be
this long, 4 KB is a lot of bytes to zero, and as of this writing the Go
runtime allocator maps only two 4 KB objects to each 8 KB span,
necessitating a call to runtime.mcache.refill() on ~every other call.
Limit the initial buffer size to 256 B instead, and geometrically
reallocate if necessary.
PiperOrigin-RevId: 251960441
|
|
SockType isn't specific to unix domain sockets, and the current
definition basically mirrors the linux ABI's definition.
PiperOrigin-RevId: 251956740
|
|
Overlayfs was expecting the parent to exist when bind(2)
was called, which may not be the case. The fix is to copy
the parent directory to the upper layer before binding
the UDS.
There is not good place to add tests for it. Syscall tests
would be ideal, but it's hard to guarantee that the
directory where the socket is created hasn't been touched
before (and thus copied the parent to the upper layer).
Added it to runsc integration tests for now. If it turns
out we have lots of these kind of tests, we can consider
moving them somewhere more appropriate.
PiperOrigin-RevId: 251954156
|
|
We still only advertise a single NUMA node, and ignore mempolicy
accordingly, but mbind() at least now succeeds and has effects reflected
by get_mempolicy().
Also fix handling of nodemasks: round sizes to unsigned long (as
documented and done by Linux), and zero trailing bits when copying them
out.
PiperOrigin-RevId: 251950859
|
|
PiperOrigin-RevId: 251950660
|
|
This is necessary for implementing network diagnostic interfaces like
/proc/net/{tcp,udp,unix} and sock_diag(7).
For pass-through endpoints such as hostinet, we obtain the socket
state from the backend. For netstack, we add explicit tracking of TCP
states.
PiperOrigin-RevId: 251934850
|
|
This allows an fdbased endpoint to have multiple underlying fd's from which
packets can be read and dispatched/written to.
This should allow for higher throughput as well as better scalability of the
network stack as number of connections increases.
Updates #231
PiperOrigin-RevId: 251852825
|
|
PiperOrigin-RevId: 251788534
|
|
This is required to make the shutdown visible to peers outside the
sandbox.
The readClosed / writeClosed fields were dropped, as they were
preventing a shutdown socket from reading the remainder of queued bytes.
The host syscalls will return the appropriate errors for shutdown.
The control message tests have been split out of socket_unix.cc to make
the (few) remaining tests accessible to testing inherited host UDS,
which don't support sending control messages.
Updates #273
PiperOrigin-RevId: 251763060
|
|
In case of GSO, a segment can container more than one packet
and we need to use the pCount() helper to get a number of packets.
PiperOrigin-RevId: 251743020
|
|
Multicast packets are special in that their destination address does not
identify a specific interface. When sending out such a packet the multicast
address is the remote address, but for incoming packets it is the local
address. Hence, when looping a multicast packet, the route needs to be
tweaked to reflect this.
PiperOrigin-RevId: 251739298
|
|
We don't actually support core dumps, but some applications want to
get/set dumpability, which still has an effect in procfs.
Lack of support for set-uid binaries or fs creds simplifies things a
bit.
As-is, processes started via CreateProcess (i.e., init and sentryctl
exec) have normal dumpability. I'm a bit torn on whether sentryctl exec
tasks should be dumpable, but at least since they have no parent normal
UID/GID checks should protect them.
PiperOrigin-RevId: 251712714
|
|
When checking the length of the acceptedChan we should hold the
endpoint mutex otherwise a syn received while the listening socket
is being closed can result in a data race where the cleanupLocked
routine sets acceptedChan to nil while a handshake goroutine
in progress could try and check it at the same time.
PiperOrigin-RevId: 251537697
|
|
When pipe is created, a dirent of pipe will be
created and its initial reference is set as 0.
Cause all dirent will only be destroyed when
the reference decreased to -1, so there is already
a 'initial reference' of dirent after it created.
For destroying dirent after all reference released,
the correct way is to drop the 'initial reference'
once someone hold a reference to the dirent, such
as fs.NewFile, otherwise the reference of dirent
will stay 0 all the time, and will cause memory
leak of dirent.
Except pipe, timerfd/eventfd/epoll has the same
problem
Here is a simple case to create memory leak of dirent
for pipe/timerfd/eventfd/epoll in C langange, after
run the case, pprof the runsc process, you will
find lots dirents of pipe/timerfd/eventfd/epoll not
freed:
int main(int argc, char *argv[])
{
int i;
int n;
int pipefd[2];
if (argc != 3) {
printf("Usage: %s epoll|timerfd|eventfd|pipe <iterations>\n", argv[0]);
}
n = strtol(argv[2], NULL, 10);
if (strcmp(argv[1], "epoll") == 0) {
for (i = 0; i < n; ++i)
close(epoll_create(1));
} else if (strcmp(argv[1], "timerfd") == 0) {
for (i = 0; i < n; ++i)
close(timerfd_create(CLOCK_REALTIME, 0));
} else if (strcmp(argv[1], "eventfd") == 0) {
for (i = 0; i < n; ++i)
close(eventfd(0, 0));
} else if (strcmp(argv[1], "pipe") == 0) {
for (i = 0; i < n; ++i)
if (pipe(pipefd) == 0) {
close(pipefd[0]);
close(pipefd[1]);
}
}
printf("%s %s test finished\r\n",argv[1],argv[2]);
return 0;
}
Change-Id: Ia1b8a1fb9142edb00c040e44ec644d007f81f5d2
PiperOrigin-RevId: 251531096
|
|
Dirents are ref-counted, but Pipes are not. Holding a Dirent inside of a Pipe
raises difficult questions about the lifecycle of the Pipe and Dirent.
Fortunately, we can side-step those questions by removing the Dirent field from
Pipe entirely. We only need the Dirent when constructing fs.Files (which are
ref-counted), and in GetFile (when a Dirent is passed to us anyways).
PiperOrigin-RevId: 251497628
|
|
and don't report a sender address if it doesn't have one
PiperOrigin-RevId: 251371284
|
|
The io.Writer contract requires that Write writes all available
bytes and does not return short writes. This causes errors with
io.Copy, since our own Write interface does not have this same
contract.
PiperOrigin-RevId: 251368730
|
|
Updates #236
PiperOrigin-RevId: 251337915
|
|
Right now, mremap allows to remap a memory region over MaxUserAddress,
this means that we can change the stub region.
PiperOrigin-RevId: 251266886
|
|
PiperOrigin-RevId: 250976665
|
|
Netstack sets the unprocessed segment queue size to match the receive
buffer size. This is not required as this queue only needs to hold enough
for a short duration before the endpoint goroutine can process it.
Updates #230
PiperOrigin-RevId: 250976323
|
|
Change-Id: Ib589906175a59dae315405a28f2d7f525ff8877f
|
|
There is no reason to do the recursion manually, since
Inode.BoundEndpoint will do it for us.
PiperOrigin-RevId: 250794903
|
|
Funcion signatures are not validated during compilation. Since
they are not exported, they can change at any time. The guard
ensures that they are verified at least on every version upgrade.
PiperOrigin-RevId: 250733742
|
|
Netstack listen loop can get stuck if cookies are in-use and the app is slow to
accept incoming connections. Further we continue to complete handshake for a
connection even if the backlog is full. This creates a problem when a lots of
connections come in rapidly and we end up with lots of completed connections
just hanging around to be delivered.
These fixes change netstack behaviour to mirror what linux does as described
here in the following article
http://veithen.io/2014/01/01/how-tcp-backlog-works-in-linux.html
Now when cookies are not in-use Netstack will silently drop the ACK to a SYN-ACK
and not complete the handshake if the backlog is full. This will result in the
connection staying in a half-complete state. Eventually the sender will
retransmit the ACK and if backlog has space we will transition to a connected
state and deliver the endpoint.
Similarly when cookies are in use we do not try and create an endpoint unless
there is space in the accept queue to accept the newly created endpoint. If
there is no space then we again silently drop the ACK as we can just recreate it
when the ACK is retransmitted by the peer.
We also now use the backlog to cap the size of the SYN-RCVD queue for a given
endpoint. So at any time there can be N connections in the backlog and N in a
SYN-RCVD state if the application is not accepting connections. Any new SYNs
will be dropped.
This CL also fixes another small bug where we mark a new endpoint which has not
completed handshake as connected. We should wait till handshake successfully
completes before marking it connected.
Updates #236
PiperOrigin-RevId: 250717817
|