Age | Commit message (Collapse) | Author |
|
Replace mknod call with mknodat equivalent to protect
against symlink attacks. Also added Mknod tests.
Remove goferfs reliance on gofer to check for file
existence before creating a synthetic entry.
Updates #2923
PiperOrigin-RevId: 327544516
|
|
PiperOrigin-RevId: 327300635
|
|
Running garbage collection enqueues all finalizers, which are used by the
refs/refs_vfs2 packages to detect reference leaks. Note that even with GC,
there is no guarantee that all finalizers will be run before the program exits.
This is a best effort attempt to activate leak checks as much as possible.
Updates #3545.
PiperOrigin-RevId: 325834438
|
|
Earlier we were using NLink to decide if /tmp is empty or not. However, NLink
at best tells us about the number of subdirectories (via the ".." entries).
NLink = n + 2 for n subdirectories. But it does not tell us if the directory is
empty. There still might be non-directory files. We could also not rely on
NLink because host overlayfs always returned 1.
VFS1 uses Readdir to decide if the directory is empty. Used a similar approach.
We now use IterDirents to decide if the "/tmp" directory is empty.
Fixes #3369
PiperOrigin-RevId: 325554234
|
|
PiperOrigin-RevId: 325266487
|
|
Also removes `--profile-goroutine` because it's equivalent
to `debug --stacks`.
PiperOrigin-RevId: 325061502
|
|
The loader dup's stdio FD into stable FD's starting at a fixed
number. During tests, it's possible that the target FD is already
in use. Added check to error early so it's easier to debug failures.
Also bumped up the starting FD number to prevent collisions.
PiperOrigin-RevId: 324917299
|
|
context is passed to DecRef() and Release() which is
needed for SO_LINGER implementation.
PiperOrigin-RevId: 324672584
|
|
9P2000.L is silent as to how readdir RPCs interact with directory mutation. The
most performant option is for Treaddir with offset=0 to restart iteration,
avoiding needing to walk+open+clunk a new directory fid between invocations of
getdents64(2), and the VFS2 gofer client assumes this is the case. Make this
actually true for the runsc fsgofer.
Fixes #3344, #3345, #3355
PiperOrigin-RevId: 324090384
|
|
PiperOrigin-RevId: 324080111
|
|
PiperOrigin-RevId: 323638518
|
|
The bazel server was being started as the wrong user, leading to issues
where the container would suddenly exit during a build.
We can also simplify the waiting logic by starting the container in two
separate steps: those that must complete first, then the asynchronous bit.
PiperOrigin-RevId: 323391161
|
|
... when it is possible.
The guitar gVisorKernel*Workflow-s runs test with the local execution_method.
In this case, blaze runs test cases locally without sandboxes. This means
that all tests run in the same network namespace. We have a few tests which
use hard-coded network ports and they can fail if one of these port will be
used by someone else or by another test cases.
PiperOrigin-RevId: 323137254
|
|
Implement WalkGetAttr() to reuse the stat that is already
needed for Walk(). In addition, cache file QID, so it
doesn't need to stat the file to compute it.
open(2) time improved by 10%:
Baseline: 6780 ns
Change: 6083 ns
Also fixed file type which was not being set in all places.
PiperOrigin-RevId: 323102560
|
|
Allow FUSE filesystems to be mounted using libfuse.
The appropriate flags and mount options are parsed and
understood by fusefs.
|
|
Open tries to reuse the control file to save syscalls and
file descriptors when opening a file. However, when the
control file was opened using O_PATH (e.g. no file permission
to open readonly), Open() would not check for it.
PiperOrigin-RevId: 322821729
|
|
Updates #173
PiperOrigin-RevId: 322665518
|
|
PiperOrigin-RevId: 321449877
|
|
Now it calls pkt.Data.ToView() when writing the packet. This may require
copying when the packet is large, which puts the worse case in an even worse
situation.
This sent out in a separate preparation change as it requires syscall filter
changes. This change will be followed by the change for the adoption of the new
PacketHeader API.
PiperOrigin-RevId: 321447003
|
|
Much like the boot process, apply pdeathsig to the gofer for cases where
the sandbox lifecycle is attached to the parent (runsc run/do).
This isn't strictly necessary, as the gofer normally exits once the
sentry disappears, but this makes that extra reliable.
|
|
PiperOrigin-RevId: 321411758
|
|
- Combine process creation code that is shared between
root and subcontainer processes
- Move root container information into a struct for
clarity
Updates #2714
PiperOrigin-RevId: 321204798
|
|
PiperOrigin-RevId: 321053634
|
|
The go.mod dependency tree for the shim was somehow contradictory. After
resolving these issues (e.g. explicitly imported k8s 1.14, pulling a
specific dbus version), and adding all dependencies, the shim can now be
build as part of the regular bazel tree.
As part of this process, minor cleanup was done in all the source files:
headers were standardized (and include "The gVisor Authors" in addition
to the "The containerd Authors" if originally derived from containerd
sources), and comments were cleaned up to meet coding standards.
This change makes the containerd installation dynamic, so that multiple
versions can be tested, and drops the static installer for the VM image
itself.
This change also updates test/root/crictl_test.go and related utilities,
so that the containerd tests can be run on any version (and in cases
where it applies, they can be run on both v1 and v2 as parameterized
tests).
|
|
Adds a netns flag to runsc spec that allows users to specify a network
namespace path when creating a sample config.json file. Also, adds the ability
to specify the command arguments used when running the container.
This will make it easier for new users to create sample OCI bundles without
having to edit the config.json by hand.
PiperOrigin-RevId: 320486267
|
|
This change gates all FUSE commands (by gating /dev/fuse) behind a runsc
flag. In order to use FUSE commands, use the --fuse flag with the --vfs2
flag. Check if FUSE is enabled by running dmesg in the sandbox.
|
|
Container restart test is disabled for VFS2 for now.
Updates #1487
PiperOrigin-RevId: 320296401
|
|
PiperOrigin-RevId: 320281516
|
|
Removed VDSO dependency on VFS1.
Resolves #2921
PiperOrigin-RevId: 320122176
|
|
This change fixes a few things:
- creating sockets using mknod(2) is supported via vfs2
- fsgofer can create regular files via mknod(2)
- mode = 0 for mknod(2) will be interpreted as regular file in vfs2 as well
Updates #2923
PiperOrigin-RevId: 320074267
|
|
|
|
|
|
|
|
Updates #2912 #1035
PiperOrigin-RevId: 318162565
|
|
Linux controls socket send/receive buffers using a few sysctl variables
- net.core.rmem_default
- net.core.rmem_max
- net.core.wmem_max
- net.core.wmem_default
- net.ipv4.tcp_rmem
- net.ipv4.tcp_wmem
The first 4 control the default socket buffer sizes for all sockets
raw/packet/tcp/udp and also the maximum permitted socket buffer that can be
specified in setsockopt(SOL_SOCKET, SO_(RCV|SND)BUF,...).
The last two control the TCP auto-tuning limits and override the default
specified in rmem_default/wmem_default as well as the max limits.
Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it
to limit the maximum size in setsockopt() as well as uses it for raw/udp
sockets.
This changelist introduces the other 4 and updates the udp/raw sockets to use
the newly introduced variables. The values for min/max match the current
tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is
updated to match the linux value of 212KiB up from the really low current value
of 32 KiB.
Updates #3043
Fixes #3043
PiperOrigin-RevId: 318089805
|
|
Previously, it was not possible to encode/decode an object graph which
contained a pointer to a field within another type. This was because the
encoder was previously unable to disambiguate a pointer to an object and a
pointer within the object.
This CL remedies this by constructing an address map tracking the full memory
range object occupy. The encoded Refvalue message has been extended to allow
references to children objects within another object. Because the encoding
process may learn about object structure over time, we cannot encode any
objects under the entire graph has been generated.
This CL also updates the state package to use standard interfaces intead of
reflection-based dispatch in order to improve performance overall. This
includes a custom wire protocol to significantly reduce the number of
allocations and take advantage of structure packing.
As part of these changes, there are a small number of minor changes in other
places of the code base:
* The lists used during encoding are changed to use intrusive lists with the
objectEncodeState directly, which required that the ilist Len() method is
updated to work properly with the ElementMapper mechanism.
* A bug is fixed in the list code wherein Remove() called on an element that is
already removed can corrupt the list (removing the element if there's only a
single element). Now the behavior is correct.
* Standard error wrapping is introduced.
* Compressio was updated to implement the new wire.Reader and wire.Writer
inteface methods directly. The lack of a ReadByte and WriteByte caused issues
not due to interface dispatch, but because underlying slices for a Read or
Write call through an interface would always escape to the heap!
* Statify has been updated to support the new APIs.
See README.md for a description of how the new mechanism works.
PiperOrigin-RevId: 318010298
|
|
Support is limited to the functionality that exists in VFS1.
Updates #2923 #1035
PiperOrigin-RevId: 317981417
|
|
Metadata was useful for debugging and safety, but enough tests exist that we
should see failures when (de)serialization is broken. It made stack
initialization more cumbersome and it's also getting in the way of ip6tables.
PiperOrigin-RevId: 317210653
|
|
Updates #173,#6
Fixes #2888
PiperOrigin-RevId: 317087652
|
|
--tx-checksum-offload=<true|false>
enable TX checksum offload (default: false)
--rx-checksum-offload=<true|false>
enable RX checksum offload (default: true)
Fixes #2989
PiperOrigin-RevId: 316781309
|
|
The previous format skipped many important structs that
are pointers, especially for cgroups. Change to print
as json, removing parts of the spec that are not relevant.
Also removed debug message from gofer that can be very
noisy when directories are large.
PiperOrigin-RevId: 316713267
|
|
Fixes #701
PiperOrigin-RevId: 316025635
|
|
Major differences from existing overlay filesystems:
- Linux allows lower layers in an overlay to require revalidation, but not the
upper layer. VFS1 allows the upper layer in an overlay to require
revalidation, but not the lower layer. VFS2 does not allow any layers to
require revalidation. (Now that vfs.MkdirOptions.ForSyntheticMountpoint
exists, no uses of overlay in VFS1 are believed to require upper layer
revalidation; in particular, the requirement that the upper layer support the
creation of "trusted." extended attributes for whiteouts effectively required
the upper filesystem to be tmpfs in most cases.)
- Like VFS1, but unlike Linux, VFS2 overlay does not attempt to make mutations
of the upper layer atomic using a working directory and features like
RENAME_WHITEOUT. (This may change in the future, since not having a working
directory makes error recovery for some operations, e.g. rmdir, particularly
painful.)
- Like Linux, but unlike VFS1, VFS2 represents whiteouts using character
devices with rdev == 0; the equivalent of the whiteout attribute on
directories is xattr trusted.overlay.opaque = "y"; and there is no equivalent
to the whiteout attribute on non-directories since non-directories are never
merged with lower layers.
- Device and inode numbers work as follows:
- In Linux, modulo the xino feature and a special case for when all layers
are the same filesystem:
- Directories use the overlay filesystem's device number and an
ephemeral inode number assigned by the overlay.
- Non-directories that have been copied up use the device and inode
number assigned by the upper filesystem.
- Non-directories that have not been copied up use a per-(overlay,
layer)-pair device number and the inode number assigned by the lower
filesystem.
- In VFS1, device and inode numbers always come from the lower layer unless
"whited out"; this has the adverse effect of requiring interaction with
the lower filesystem even for non-directory files that exist on the upper
layer.
- In VFS2, device and inode numbers are assigned as in Linux, except that
xino and the samefs special case are not supported.
- Like Linux, but unlike VFS1, VFS2 does not attempt to maintain memory mapping
coherence across copy-up. (This may have to change in the future, as users
may be dependent on this property.)
- Like Linux, but unlike VFS1, VFS2 uses the overlayfs mounter's credentials
when interacting with the overlay's layers, rather than the caller's.
- Like Linux, but unlike VFS1, VFS2 permits multiple lower layers in an
overlay.
- Like Linux, but unlike VFS1, VFS2's overlay filesystem is
application-mountable.
Updates #1199
PiperOrigin-RevId: 316019067
|
|
- Set hugetlb related fields
- Add realtime scheduler related fields
- Beef up unit tests
Updates #2713
PiperOrigin-RevId: 315797979
|
|
LinuxPids.Limit is the only optional cgroup field in OCI that
is not a pointer. If value is 0 or negative it should be
skipped.
PiperOrigin-RevId: 315791909
|
|
PiperOrigin-RevId: 315583963
|
|
Run vs. exec, VFS1 vs. VFS2 were executable lookup were
slightly different from each other. Combine them all
into the same logic.
PiperOrigin-RevId: 315426443
|
|
This is mostly syscall plumbing, VFS2 already implements the internals of
mounts. In addition to the syscall defintions, the following mount-related
mechanisms are updated:
- Implement MS_NOATIME for VFS2, but only for tmpfs and goferfs. The other VFS2
filesystems don't implement node-level timestamps yet.
- Implement the 'mode', 'uid' and 'gid' mount options for VFS2's tmpfs.
- Plumb mount namespace ownership, which is necessary for checking appropriate
capabilities during mount(2).
Updates #1035
PiperOrigin-RevId: 315035352
|
|
PiperOrigin-RevId: 314997564
|
|
IPTables.connections contains a sync.RWMutex. Copying it will trigger copylocks
analysis. Tested by manually enabling nogo tests.
sync.RWMutex is added to IPTables for the additional race condition discovered.
PiperOrigin-RevId: 314817019
|