summaryrefslogtreecommitdiffhomepage
path: root/pkg
AgeCommit message (Collapse)Author
2020-09-16Implement OpenAt() for verity fsChong Cai
OpenAt() for verity fs is implemented by opening both the target file or directory and the corresponding Merkle tree file in the underlying file system. Generally they are only open for read. In allowRuntimeEnable mode, the Merkle tree file is also open for write. PiperOrigin-RevId: 332116423
2020-09-16Automated rollback of changelist 329526153Nayana Bidari
PiperOrigin-RevId: 332097286
2020-09-16Bind loopback subnets' lifetime to perm addressGhanan Gowripalan
The lifetime of addreses in a loopback interface's associated subnets should be bound to their respective permanent addresses. This change also fixes a race when the stack attempts to get an IPv4 rereferencedNetworkEndpoint for an address in an associated subnet on a loopback interface. Before this change, the stack would only check if an IPv4 address is contained in an associated subnet while holding a read lock but wouldn't do this same check after releasing the read lock for a write lock to create a temporary address. This may cause the stack to bind the lifetime of the address to a new (temporary) endpoint instead of the associated subnet's permanent address. Test: integration_test.TestLoopbackSubnetLifetimeBoundToAddr PiperOrigin-RevId: 332094719
2020-09-16Implement PRead for verity fsChong Cai
PRead is implemented by read from the underlying file in blocks, and verify each block. The verified contents are saved into the output buffer. PiperOrigin-RevId: 332092267
2020-09-16Gracefully translate unknown errno.Ting-Yu Wang
Neither POSIX.1 nor Linux defines an upperbound for errno. PiperOrigin-RevId: 332085017
2020-09-16Merge pull request #3893 from lubinszARM:pr_n1_03gVisor bot
PiperOrigin-RevId: 332069743
2020-09-16Receive broadcast packets on interested endpointsGhanan Gowripalan
When a broadcast packet is received by the stack, the packet should be delivered to each endpoint that may be interested in the packet. This includes all any address and specified broadcast address listeners. Test: integration_test.TestReuseAddrAndBroadcast PiperOrigin-RevId: 332060652
2020-09-16Rename marshal.Task to marshal.CopyContext.Rahat Mahmood
CopyContext is a better name for the interface because from go-marshal's perspective, the interface has nothing to do with a task. A kernel.Task happens to implement the interface, but so can other things like MemoryManager and IO sequences. PiperOrigin-RevId: 331959678
2020-09-15Enable automated marshalling for the syscall package.Rahat Mahmood
PiperOrigin-RevId: 331940975
2020-09-15Add support for OCI seccomp filters in the sandbox.Ian Lewis
OCI configuration includes support for specifying seccomp filters. In runc, these filter configurations are converted into seccomp BPF programs and loaded into the kernel via libseccomp. runsc needs to be a static binary so, for runsc, we cannot rely on a C library and need to implement the functionality in Go. The generator added here implements basic support for taking OCI seccomp configuration and converting it into a seccomp BPF program with the same behavior as a program generated by libseccomp. - New conditional operations were added to pkg/seccomp to support operations available in OCI. - AllowAny and AllowValue were renamed to MatchAny and EqualTo to better reflect that syscalls matching the conditionals result in the provided action not simply SCMP_RET_ALLOW. - BuildProgram in pkg/seccomp no longer panics if provided an empty list of rules. It now builds a program with the architecture sanity check only. - ProgramBuilder now allows adding labels that are unused. However, backwards jumps are still not permitted. Fixes #510 PiperOrigin-RevId: 331938697
2020-09-15Implement gvisor verity fs ioctl with GETFLAGSChong Cai
PiperOrigin-RevId: 331905347
2020-09-15Improve syserror_test.Jamie Liu
- It's very difficult to prevent returnErrnoAsError and returnError from being optimized out. Instead, replace BenchmarkReturn* with BenchmarkAssign*, which store to globalError. - Compare to a non-nil globalError in BenchmarkCompare* and BenchmarkSwitch*. New results: BenchmarkAssignErrno BenchmarkAssignErrno-12 1000000000 0.615 ns/op BenchmarkAssignError BenchmarkAssignError-12 1000000000 0.626 ns/op BenchmarkCompareErrno BenchmarkCompareErrno-12 1000000000 0.522 ns/op BenchmarkCompareError BenchmarkCompareError-12 1000000000 3.54 ns/op BenchmarkSwitchErrno BenchmarkSwitchErrno-12 1000000000 1.45 ns/op BenchmarkSwitchError BenchmarkSwitchError-12 536315757 10.9 ns/op PiperOrigin-RevId: 331875387
2020-09-15Invert dependency between the context and amutex packages.Jamie Liu
This is to allow the syserror package to depend on the context package in a future change. PiperOrigin-RevId: 331866252
2020-09-15Support setting STATX_SIZE for kernfs.InodeAttrs.Dean Deng
Make setting STATX_SIZE a no-op, if it is valid for the given permissions and file type. Also update proc tests, which were overfitted before. Fixes #3842. Updates #1193. PiperOrigin-RevId: 331861087
2020-09-15Move reusable IPv4 test code into a testutil module and refactor itArthur Sfez
The refactor aims to simplify the package, by replacing the Go channel with a PacketBuffer slice. This code will be reused by tests for IPv6 fragmentation. PiperOrigin-RevId: 331860411
2020-09-15Release FDTable lock before dropping the fds.Nayana Bidari
This is needed for SO_LINGER, where close() is blocked for linger timeout and we are holding the FDTable lock for the entire timeout which will not allow us to create/delete other fds. We have to release the locks and then drop the fds. PiperOrigin-RevId: 331844185
2020-09-15Read vfs2 epoll events atomically.Jamie Liu
Discovered by ayushranjan@: VFS2 was employing the following algorithm for fetching ready events from an epoll instance: - Create a statically sized EpollEvent slice on the stack of size 16. - Pass that to EpollInstance.ReadEvents() to populate. - EpollInstance.ReadEvents() requeues level-triggered events that it returns back into the ready queue. - Write the results to usermem. - If the number of results were = 16 then recall EpollInstance.ReadEvents() in the hopes of getting more. But this will cause duplication of the "requeued" ready level-triggered events. So if the ready queue has >= 16 ready events, the EpollWait for loop will spin until it fills the usermem with `maxEvents` events. Fixes #3521 PiperOrigin-RevId: 331840527
2020-09-15RFC: design for a 9P replacementJamie Liu
Tentatively `lisafs` (LInux SAndbox FileSystem). PiperOrigin-RevId: 331839246
2020-09-15Merge pull request #3895 from btw616:fix/issue-3894gVisor bot
PiperOrigin-RevId: 331824411
2020-09-15Don't conclude broadcast from route destinationGhanan Gowripalan
The routing table (in its current) form should not be used to make decisions about whether a remote address is a broadcast address or not (for IPv4). Note, a destination subnet does not always map to a network. E.g. RouterA may have a route to 192.168.0.0/22 through RouterB, but RouterB may be configured with 4x /24 subnets on 4 different interfaces. See https://github.com/google/gvisor/issues/3938. PiperOrigin-RevId: 331819868
2020-09-15Fix proc.(*fdDir).IterDirents for VFS2Tiwei Bie
Currently the returned offset is an index, and we can't use it to find the next fd to serialize, because getdents should iterate correctly despite mutation of fds. Instead, we can return the next fd to serialize plus 2 (which accounts for "." and "..") as the offset. Fixes: #3894 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-14Add note about gofer link(2) limitationFabricio Voznika
PiperOrigin-RevId: 331648296
2020-09-14Store multicast memberships in a setTamir Duberstein
This is simpler and more performant. PiperOrigin-RevId: 331639978
2020-09-14Test RST handling in TIME_WAIT.Mithun Iyer
gVisor stack ignores RSTs when in TIME_WAIT which is not the default Linux behavior. Add a packetimpact test to test the same. Also update code comments to reflect the rationale for the current gVisor behavior. PiperOrigin-RevId: 331629879
2020-09-14Correct FDSize in /proc/[pid]/status.Jamie Liu
In Linux, FDSize is fs/proc/array.c:task_state() => struct fdtable::max_fds, which is set to the underlying array's length in fs/file.c:alloc_fdtable(). Follow-up changes: - Remove FDTable.GetRefs() and FDTable.GetRefsVFS2(), which are unused. - Reset FDTable.used to 0 during restore, since the subsequent calls to FDTable.setAll() increment it again, causing its value to be doubled. (After this CL, FDTable.used is only used to avoid reallocation in FDTable.GetFDs(), so this fix is not very visible.) PiperOrigin-RevId: 331588190
2020-09-12Cap reassembled IPv6 packets at 65535 octetsToshi Kikuchi
IPv4 can accept 65536-octet reassembled packets. Test: - ipv4_test.TestInvalidFragments - ipv4_test.TestReceiveFragments - ipv6.TestInvalidIPv6Fragments - ipv6.TestReceiveIPv6Fragments Fixes #3770 PiperOrigin-RevId: 331382977
2020-09-11Move the 'marshal' and 'primitive' packages to the 'pkg' directory.Rahat Mahmood
PiperOrigin-RevId: 331256608
2020-09-11Implement copy-up-coherent mmap for VFS2 overlayfs.Jamie Liu
This is very similar to copy-up-coherent mmap in the VFS1 overlay, with the minor wrinkle that there is no fs.InodeOperations.Mappable(). Updates #1199 PiperOrigin-RevId: 331206314
2020-09-11Fix host unix socket to not swallow EOF incorrectly.Bhasker Hariharan
Fixes an error where in case of a receive buffer larger than the host send buffer size for a host backed unix dgram socket we would end up swallowing EOF from recvmsg syscall causing the read() to block forever. PiperOrigin-RevId: 331192810
2020-09-11arm64 mm: asid and tlb supportBin Lu
Some optimizations in this pr: 1, Move ASID from TTBR0 to TTBR1 2, tlb_flush_all Signed-off-by: Bin Lu <bin.lu@arm.com>
2020-09-10arm64:place an SB sequence following an ERET instructionBin Lu
Some CPUs(eg: ampere-emag) can speculate past an ERET instruction and potentially perform speculative accesses to memory before processing the exception return. Since the register state is often controlled by a lower privilege level at the point of an ERET, this could potentially be used as part of a side-channel attack. Signed-off-by: Bin Lu <bin.lu@arm.com>
2020-09-09Unlock VFS.mountMu before FilesystemImpl calls for ↵Jamie Liu
/proc/[pid]/{mounts,mountinfo}. Also move VFS.MakeSyntheticMountpoint() (which is a utility wrapper around VFS.MkdirAllAt(), itself a utility wrapper around VFS.MkdirAt()) to not be in the middle of the implementation of these proc files. Fixes #3878 PiperOrigin-RevId: 330843106
2020-09-09Don't write VFS2 gofer client timestamps back on dentry destruction.Jamie Liu
This feature is too expensive for runsc, even with setattrclunk, because fsgofer.localFile.SetAttr() ends up needing to call reopenProcFD(), incurring two string allocations for the FD pathname, an fd.FD allocation, and two calls to runtime.SetFinalizer() when the fd.FD is created and closed respectively (b/133767962) (plus the actual cost of the syscalls, which is negligible). PiperOrigin-RevId: 330843012
2020-09-09Don't sched_setaffinity in ptrace platform.Jamie Liu
PiperOrigin-RevId: 330777900
2020-09-08Implement synthetic mountpoints for kernfs.Jamie Liu
PiperOrigin-RevId: 330629897
2020-09-08[vfs] overlayfs: Fix socket tests.Ayush Ranjan
- BindSocketThenOpen test was expecting the incorrect error when opening a socket. Fixed that. - VirtualFilesystem.BindEndpointAt should not require pop.Path.Begin.Ok() because the filesystem implementations do not need to walk to the parent dentry. This check also exists for MknodAt, MkdirAt, RmdirAt, SymlinkAt and UnlinkAt but those filesystem implementations also need to walk to the parent denty. So that check is valid. Added some syscall tests to test this. PiperOrigin-RevId: 330625220
2020-09-08Add check for both child and childMerkle ENOENTgVisor bot
The check in verity walk returns error for non ENOENT cases, and all ENOENT results should be checked. This case was missing. PiperOrigin-RevId: 330604771
2020-09-08Implement ioctl with enable veritygVisor bot
ioctl with FS_IOC_ENABLE_VERITY is added to verity file system to enable a file as verity file. For a file, a Merkle tree is built with its data. For a directory, a Merkle tree is built with the root hashes of its children. PiperOrigin-RevId: 330604368
2020-09-08[vfs] overlayfs: decref VD when not using it.Ayush Ranjan
overlay/filesystem.go:lookupLocked() did not DecRef the VD on some error paths when it would not end up saving or using the VD. PiperOrigin-RevId: 330589742
2020-09-08Honor readonly flag for root mountFabricio Voznika
Updates #1487 PiperOrigin-RevId: 330580699
2020-09-08Increase resolution timeout for TestCacheResolutionSam Balana
Fixes pkg/tcpip/stack:stack_test flake experienced while running TestCacheResolution with gotsan. This occurs when the test-runner takes longer than the resolution timeout to call linkAddrCache.get. In this test we don't care about the resolution timeout, so set it to the maximum and rely on test-runner timeouts to avoid deadlocks. PiperOrigin-RevId: 330566250
2020-09-08Merge pull request #3856 from btw616:fix/issue-3855gVisor bot
PiperOrigin-RevId: 330565414
2020-09-08Fix data race in tcp.GetSockOpt.Bhasker Hariharan
e.ID can't be read without holding e.mu. GetSockOpt was reading e.ID when looking up OriginalDst without holding e.mu. PiperOrigin-RevId: 330562293
2020-09-08Improve type safety for transport protocol optionsGhanan Gowripalan
The existing implementation for TransportProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for transport protocol options that may be set or queried which transport protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. RELNOTES: n/a PiperOrigin-RevId: 330559811
2020-09-08[vfs] Capitalize x in the {Get/Set/Remove/List}xattr functions.Ayush Ranjan
PiperOrigin-RevId: 330554450
2020-09-08Fix the use after nil check on args.MountNamespaceVFS2Tiwei Bie
The args.MountNamespaceVFS2 is used again after the nil check, instead, mntnsVFS2 which holds the expected reference should be used. This patch fixes this issue. Fixes: #3855 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-04Simplify FD handling for container start/execFabricio Voznika
VFS1 and VFS2 host FDs have different dupping behavior, making error prone to code for both. Change the contract so that FDs are released as they are used, so the caller can simple defer a block that closes all remaining files. This also addresses handling of partial failures. With this fix, more VFS2 tests can be enabled. Updates #1487 PiperOrigin-RevId: 330112266
2020-09-03Adjust input file offset when sendfile only completes a partial write.Dean Deng
Fixes #3779. PiperOrigin-RevId: 330057268
2020-09-03Use fine-grained mutex for stack.cleanupEndpoints.Bhasker Hariharan
stack.cleanupEndpoints is protected by the stack.mu but that can cause contention as the stack mutex is already acquired in a lot of hot paths during new endpoint creation /cleanup etc. Moving this to a fine grained mutex should reduce contention on the stack.mu. PiperOrigin-RevId: 330026151
2020-09-03Use atomic.Value for Stack.tcpProbeFunc.Jamie Liu
b/166980357#comment56 shows: - 837 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).StartTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop - 695 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).CompleteTransportEndpointCleanup gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).cleanupLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).completeWorkerLocked gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop.func1 gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop - 3882 goroutines blocked in: gvisor/pkg/sync/sync.(*RWMutex).Lock gvisor/pkg/tcpip/stack/stack.(*Stack).GetTCPProbe gvisor/pkg/tcpip/transport/tcp/tcp.newEndpoint gvisor/pkg/tcpip/transport/tcp/tcp.(*protocol).NewEndpoint gvisor/pkg/tcpip/stack/stack.(*Stack).NewEndpoint All of these are contending on Stack.mu. Stack.StartTransportEndpointCleanup() and Stack.CompleteTransportEndpointCleanup() insert/delete TransportEndpoints in a map (Stack.cleanupEndpoints), and the former also does endpoint unregistration while holding Stack.mu, so it's not immediately clear how feasible it is to replace the map with a mutex-less implementation or how much doing so would help. However, Stack.GetTCPProbe() just reads a function object (Stack.tcpProbeFunc) that is almost always nil (as far as I can tell, Stack.AddTCPProbe() is only called in tests), and it's called for every new TCP endpoint. So converting it to an atomic.Value should significantly reduce contention on Stack.mu, improving TCP endpoint creation latency and allowing TCP endpoint cleanup to proceed. PiperOrigin-RevId: 330004140