Age | Commit message (Collapse) | Author |
|
Makes CLOCK_BOOTTIME available with
* clock_gettime
* timerfd_create
* clock_gettime vDSO
CLOCK_BOOTTIME is implemented as an alias to CLOCK_MONOTONIC.
CLOCK_MONOTONIC already keeps track of time across save
and restore. This is the closest possible behavior to Linux
CLOCK_BOOTIME, as there is no concept of suspend/resume.
Updates google/gvisor#218
|
|
For files with O_APPEND, a file write operation gets a file size and uses it as
offset to call an inode write operation. This means that all other operations
which can change a file size should be blocked while the write operation doesn't
complete.
PiperOrigin-RevId: 254873771
|
|
This prevents a race before PDEATH_SIG can take effect during
a sentry crash.
Discovered and solution by avagin@.
PiperOrigin-RevId: 254871534
|
|
PiperOrigin-RevId: 254854346
|
|
The tracee is stopped early during process exit, when registers are still
available, allowing the tracer to see where the exit occurred, whereas the
normal exit notifi? cation is done after the process is finished exiting.
Without this option, dumpAndPanic fails to get registers.
PiperOrigin-RevId: 254852917
|
|
The previous number was for the arm architecture.
Also change the statx tests to force them to run on gVisor, which would have
caught this issue.
PiperOrigin-RevId: 254846831
|
|
New options are:
runsc debug --strace=off|all|function1,function2
runsc debug --log-level=warning|info|debug
runsc debug --log-packets=true|false
Updates #407
PiperOrigin-RevId: 254843128
|
|
We don't have the plumbing for btime yet, so that field is left off. The
returned mask indicates that btime is absent.
Fixes #343
PiperOrigin-RevId: 254575752
|
|
Today we have the logic split in two places between endpoint Read() and the
worker goroutine which actually sends a zero window. This change makes it so
that when a zero window ACK is sent we set a flag in the endpoint which can be
read by the endpoint to decide if it should notify the worker to send a
nonZeroWindow update.
The worker now does not do the check again but instead sends an ACK and flips
the flag right away.
Similarly today when SO_RECVBUF is set the SetSockOpt call has logic
to decide if a zero window update is required. Rather than do that we move
the logic to the worker goroutine and it can check the zeroWindow flag
and send an update if required.
PiperOrigin-RevId: 254505447
|
|
FileMaxOffset is a special case when lseek(d, 0, SEEK_END) has been called.
PiperOrigin-RevId: 254498777
|
|
Currently, the path tracking in the gofer involves an O(n) lookup of
child fidRefs. This causes a significant overhead on unlinks in
directories with lots of child fidRefs (<4k).
In this transition, pathNode moves from sync.Map to normal synchronized
maps. There is a small chance of contention in walk, but the lock is
held for a very short time (and sync.Map also had a chance of requiring
locking).
OTOH, sync.Map makes it very difficult to add a fidRef reverse map.
PiperOrigin-RevId: 254489952
|
|
This test will occasionally fail waiting to read a packet. From repeated runs,
I've seen it up to 1.5s for waitForPackets to complete.
PiperOrigin-RevId: 254484627
|
|
PiperOrigin-RevId: 254482180
|
|
Flipcall is a (conceptually) simple local-only RPC mechanism. Compared
to unet, Flipcall does not support passing FDs (support for which will
be provided out of band by another package), requires users to establish
connections manually, and requires user management of concurrency since
each connected Endpoint pair supports only a single RPC at a time;
however, it improves performance by using shared memory for data
(reducing memory copies) and using futexes for control signaling (which
is much cheaper than sendto/recvfrom/sendmsg/recvmsg).
PiperOrigin-RevId: 254471986
|
|
PiperOrigin-RevId: 254450309
|
|
Neither fidRefs or children are (directly) synchronized by mu. Remove
the preconditions that say so.
That said, the surrounding does enforce some synchronization guarantees
(e.g., fidRef.renameChildTo does not atomically replace the child in the
maps). I've tried to note the need for callers to do this
synchronization.
I've also renamed the maps to what are (IMO) clearer names. As is, it is
not obvious that pathNode.fidRefs is a map of *child* fidRefs rather
than self fidRefs.
PiperOrigin-RevId: 254446965
|
|
defer here doesn't improve readability, but we know it slower that
the explicit call.
PiperOrigin-RevId: 254441473
|
|
PiperOrigin-RevId: 254428866
|
|
Otherwise every call to, say, fs.ContextCanAccessFile() in a benchmark
using contexttest allocates new auth.Credentials, a new
auth.UserNamespace, ...
PiperOrigin-RevId: 254261051
|
|
These are the only packages missing docs:
https://godoc.org/gvisor.dev/gvisor
PiperOrigin-RevId: 254261022
|
|
PiperOrigin-RevId: 254254058
|
|
PiperOrigin-RevId: 254253777
|
|
The sendfile syscall's backing doSplice contained a race with regard to
blocking. If the first attempt failed with syserror.ErrWouldBlock and then
the blocking file became ready before registering a waiter, we would just
return the ErrWouldBlock (even if we were supposed to block).
PiperOrigin-RevId: 254114432
|
|
Otherwise future renames may miss Renamed calls.
PiperOrigin-RevId: 254060946
|
|
And methods that do more traversals should use the remaining count rather than
resetting.
PiperOrigin-RevId: 254041720
|
|
This allows tasks to have distinct mount namespace, instead of all sharing the
kernel's root mount namespace.
Currently, the only way for a task to get a different mount namespace than the
kernel's root is by explicitly setting a different MountNamespace in
CreateProcessArgs, and nothing does this (yet).
In a follow-up CL, we will set CreateProcessArgs.MountNamespace when creating a
new container inside runsc.
Note that "MountNamespace" is a poor term for this thing. It's more like a
distinct VFS tree. When we get around to adding real mount namespaces, this
will need a better naem.
PiperOrigin-RevId: 254009310
|
|
Test fails because it's reading 4KB instead of the
expected 64KB. Changed the test to read pipe buffer
size instead of hardcode and added some logging in
case the reason for failure was not pipe buffer size.
PiperOrigin-RevId: 253916040
|
|
sockets, pipes and other non-seekable file descriptors don't
use file.offset, so we don't need to update it.
With this change, we will be able to call file operations
without locking the file.mu mutex. This is already used for
pipes in the splice system call.
PiperOrigin-RevId: 253746644
|
|
The implementation is similar to linux where we track the number of bytes
consumed by the application to grow the receive buffer of a given TCP endpoint.
This ensures that the advertised window grows at a reasonable rate to accomodate
for the sender's rate and prevents large amounts of data being held in stack
buffers if the application is not actively reading or not reading fast enough.
The original paper that was used to implement the linux receive buffer auto-
tuning is available @ https://public.lanl.gov/radiant/pubs/drs/lacsi2001.pdf
NOTE: Linux does not implement DRS as defined in that paper, it's just a good
reference to understand the solution space.
Updates #230
PiperOrigin-RevId: 253168283
|
|
All functions which allocate objects containing AtomicRefCounts will soon need
a context.
PiperOrigin-RevId: 253147709
|
|
The deadlock can occur when both ends of a connected Unix socket which has
FIOASYNC enabled on at least one end are closed at the same time. One end
notifies that it is closing, calling (*waiter.Queue).Notify which takes
waiter.Queue.mu (as a read lock) and then calls (*FileAsync).Callback, which
takes FileAsync.mu. The other end tries to unregister for notifications by
calling (*FileAsync).Unregister, which takes FileAsync.mu and calls
(*waiter.Queue).EventUnregister which takes waiter.Queue.mu.
This is fixed by moving the calls to waiter.Waitable.EventRegister and
waiter.Waitable.EventUnregister outside of the protection of any mutex used
in (*FileAsync).Callback.
The new test is related, but does not cover this particular situation.
Also fix a data race on FileAsync.e.Callback. (*FileAsync).Callback checked
FileAsync.e.Callback under the protection of FileAsync.mu, but the waiter
calling (*FileAsync).Callback could not and did not. This is fixed by making
FileAsync.e.Callback immutable before passing it to the waiter for the first
time.
Fixes #346
PiperOrigin-RevId: 253138340
|
|
SO_TYPE was already implemented for everything but netlink sockets.
PiperOrigin-RevId: 253138157
|
|
This can be merged after:
https://github.com/google/gvisor-website/pull/77
or
https://github.com/google/gvisor-website/pull/78
PiperOrigin-RevId: 253132620
|
|
PiperOrigin-RevId: 253122166
|
|
PiperOrigin-RevId: 252918338
|
|
This CL also cleans up the error returned for setting congestion
control which was incorrectly returning EINVAL instead of ENOENT.
PiperOrigin-RevId: 252889093
|
|
PiperOrigin-RevId: 252855280
|
|
For sendfile(2), we propagate a TCP error through the system call layer.
This should be eaten if there is a partial result. This change also adds
a test to ensure that there is no panic in this case, for both TCP sockets
and unix domain sockets.
PiperOrigin-RevId: 252746192
|
|
Parse annotations containing 'gvisor.dev/spec/mount' that gives
hints about how mounts are shared between containers inside a
pod. This information can be used to better inform how to mount
these volumes inside gVisor. For example, a volume that is shared
between containers inside a pod can be bind mounted inside the
sandbox, instead of being two independent mounts.
For now, this information is used to allow the same tmpfs mounts
to be shared between containers which wasn't possible before.
PiperOrigin-RevId: 252704037
|
|
Adds simple introspection for syscall compatibility information to Linux/AMD64.
Syscalls registered in the syscall table now have associated metadata like
name, support level, notes, and URLs to relevant issues.
Syscall information can be exported as a table, JSON, or CSV using the new
'runsc help syscalls' command. Users can use this info to debug and get info
on the compatibility of the version of runsc they are running or to generate
documentation.
PiperOrigin-RevId: 252558304
|
|
PiperOrigin-RevId: 252501653
|
|
Changes netstack to confirm to current linux behaviour where if the backlog is
full then we drop the SYN and do not send a SYN-ACK. Similarly we allow upto
backlog connections to be in SYN-RCVD state as long as the backlog is not full.
We also now drop a SYN if syn cookies are in use and the backlog for the
listening endpoint is full.
Added new tests to confirm the behaviour.
Also reverted the change to increase the backlog in TcpPortReuseMultiThread
syscall test.
Fixes #236
PiperOrigin-RevId: 252500462
|
|
Store enough information in the kernel socket table to distinguish
between different types of sockets. Previously we were only storing
the socket family, but this isn't enough to classify sockets. For
example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets
are SOCK_DGRAM sockets with a particular protocol.
Instead of creating more sub-tables, flatten the socket table and
provide a filtering mechanism based on the socket entry.
Also generate and store a socket entry index ("sl" in linux) which
allows us to output entries in a stable order from procfs.
PiperOrigin-RevId: 252495895
|
|
PiperOrigin-RevId: 252124156
|
|
PiperOrigin-RevId: 251965598
|
|
Almost (?) all uses of CopyStringIn are via linux.copyInPath(), which
passes maxlen = linux.PATH_MAX = 4096. Pre-allocating a buffer of this
size is measurably inefficient in most cases: most paths will not be
this long, 4 KB is a lot of bytes to zero, and as of this writing the Go
runtime allocator maps only two 4 KB objects to each 8 KB span,
necessitating a call to runtime.mcache.refill() on ~every other call.
Limit the initial buffer size to 256 B instead, and geometrically
reallocate if necessary.
PiperOrigin-RevId: 251960441
|
|
SockType isn't specific to unix domain sockets, and the current
definition basically mirrors the linux ABI's definition.
PiperOrigin-RevId: 251956740
|
|
Overlayfs was expecting the parent to exist when bind(2)
was called, which may not be the case. The fix is to copy
the parent directory to the upper layer before binding
the UDS.
There is not good place to add tests for it. Syscall tests
would be ideal, but it's hard to guarantee that the
directory where the socket is created hasn't been touched
before (and thus copied the parent to the upper layer).
Added it to runsc integration tests for now. If it turns
out we have lots of these kind of tests, we can consider
moving them somewhere more appropriate.
PiperOrigin-RevId: 251954156
|
|
We still only advertise a single NUMA node, and ignore mempolicy
accordingly, but mbind() at least now succeeds and has effects reflected
by get_mempolicy().
Also fix handling of nodemasks: round sizes to unsigned long (as
documented and done by Linux), and zero trailing bits when copying them
out.
PiperOrigin-RevId: 251950859
|
|
PiperOrigin-RevId: 251950660
|