summaryrefslogtreecommitdiffhomepage
path: root/pkg
AgeCommit message (Collapse)Author
2021-10-27Reduce eventFD notifications on transmit.Bhasker Hariharan
When transmitting packets we only need to notify if the peer is not already processing packets. sharedData region is used to enable/disable notifications and the peer will disable notifications when its actively processing packets and enable notifications just before it goes to sleep waiting on packets. This allows more efficient transmit as the sharedmem endpoint does not need to notify on eventFD and incur an expensive host systemcall when the peer is already awake. PiperOrigin-RevId: 406018843
2021-10-27rename tcp_conntrack inbound/outbound to reply/originalKevin Krakauer
Connection tracking is agnostic to whether the packet is inbound or outbound. It cares who initiated the connection. The naming can get confusing as conntrack can track connections originating from any host. Part of resolving #6736. PiperOrigin-RevId: 405997540
2021-10-27NAT ICMPv4 errorsGhanan Gowripalan
...so a NAT-ed connection's socket can handle ICMP errors. Updates #5916. PiperOrigin-RevId: 405970089
2021-10-27Record counts of packets with unknown L3/L4 numbersNick Brown
Previously, we recorded a single aggregated count. These per-protocol counts can help us debug field issues when frames are dropped for this reason. PiperOrigin-RevId: 405913911
2021-10-26Simplify vfs.NewDisconnectedMount signature and callpoints.Ayush Ranjan
vfs.NewDisconnectedMount has no error paths. Its much prettier without the error return value. Also simplify MountDisconnected which would immediately drop the refs taken by NewDisconnectedMount. Instead make it directly call newMount. PiperOrigin-RevId: 405767966
2021-10-26Validate an icmp header before accessing itAndrei Vagin
A header can't be smaller than header.ICMPv4MinimumSize. Reported-by: syzbot+57b68b14b4f6a58bf985@syzkaller.appspotmail.com PiperOrigin-RevId: 405748438
2021-10-26platform/kvm: map vdso and vvar into a guest address spaceAndrei Vagin
Right now, each vdso call triggers vmexit. VDSO and VVAR pages are mapped with VM_IO and get_user_pages fails for such vma-s. KVM was not able to handle this case up to the v4.8 kernel. This problem was fixed by add6a0cd1c5ba ("KVM: MMU: try to fix up page faults before giving up"). For some unknown reasons, it still doesn't work in case of nested virtualization. Before: BenchmarkKernelVDSO-6 252519 4598 ns/op After: BenchmarkKernelVDSO-6 34431957 34.91 ns/op PiperOrigin-RevId: 405715941
2021-10-26Obtain ref on root dentry in mqfs.GetFilesystem.Ayush Ranjan
As documented in FilesystemType.GetFilesystem, a reference should be taken on the returned dentry and filesystem by GetFilesystem implementation. mqfs did not do that. Additionally cleanup and clarify ref counting of dentry, filesystem and mount in mqfs. Reported-by: syzbot+a2c54bfb6e1525228e5f@syzkaller.appspotmail.com Reported-by: syzbot+ccd305cdab11cfebbfff@syzkaller.appspotmail.com PiperOrigin-RevId: 405700565
2021-10-26Move attestation definitions to standalone packageChong Cai
PiperOrigin-RevId: 405698863
2021-10-26Change Notify() to use unix.RawSyscall.Bhasker Hariharan
eventfd.Notify() uses unix.Write which will eventually call unix.Syscall which will yield the current go processor resulting in the Go scheduler parking the current goroutine till the syscall returns. But in most cases where Notify() is called there is no reason to yield as the caller probably wants to continue doing something right afterwards. Like in the case of the sharedmem endpoint which may still have more packets to write. PiperOrigin-RevId: 405693801
2021-10-26Ensure statfs::f_namelen is set by VFS2 gofer statfs/fstatfs.Jamie Liu
VFS1 discards the value of f_namelen returned by the filesystem and returns NAME_MAX unconditionally instead, so it doesn't run into this. Also set f_frsize for completeness. PiperOrigin-RevId: 405579707
2021-10-25Do not leak non-permission mode bits in mq_open(2).Ayush Ranjan
As caught by syzkaller, we were leaking non-permission bits while passing the user generated mode. DynamicBytesFile panics in this case. Reported-by: syzbot+5abe52d47d56a5a98c89@syzkaller.appspotmail.com PiperOrigin-RevId: 405481392
2021-10-25Add support for containerd 1.5Fabricio Voznika
"cri.runtimeoptions.v1" moved to "runtimeoptions.v1" and containerd configuration format version 2 is required. Updates #6449 PiperOrigin-RevId: 405474653
2021-10-23initialize hostFeatureSet from init functionJing Chen
2021-10-23fix the failed test target //pkg/cpuid:cpuid_test on arm64.Jing Chen
2021-10-21Merge pull request #6345 from sudo-sturbia:mq/syscallsgVisor bot
PiperOrigin-RevId: 404901660
2021-10-21Add an integration test for istio like redirect.Bhasker Hariharan
Updates #6441,#6317 PiperOrigin-RevId: 404872327
2021-10-20Report correct error when restore failsFabricio Voznika
When file corruption is detected, report vfs.ErrCorruption to distinguish corruption error from other restore errors. Updates #1035 PiperOrigin-RevId: 404588445
2021-10-19Always parse Transport headersGhanan Gowripalan
..including ICMP headers before delivering them to the TransportDispatcher. Updates #3810. PiperOrigin-RevId: 404404002
2021-10-19Fix typo in FIXMEFabricio Voznika
PiperOrigin-RevId: 404400399
2021-10-19Do not return non-nil *lisafs.Inode to doCreateAt on error.Ayush Ranjan
lisafs.ClientFile.MkdirAt is allowed to return a non-nil Inode and a non-nil error on an RPC error. The caller must not use the returned (invalid) Inode on error. But a code path in the gofer client does end up using it. More specifically, when the Mkdir RPC fails and we end up creating a synthetic dentry for a mountpoint, we end up returning the (invalid) non-nil Inode to filesystem.doCreateAt implementation which thinks that a remote file was created. But that non-nil Inode is actually invalid because the RPC failed. Things go downhill from there. Update client to not use childDirInode if RPC failed. PiperOrigin-RevId: 404396573
2021-10-19Continue reaping bucket after reaping a tupleGhanan Gowripalan
Reaping an expired tuple removes it from its bucket so we need to grab the succeeding tuple in the bucket before reaping the expired tuple. Before this change, only the first expired tuple in a bucket was reaped per reaper run on the bucket. This change just allows more connections to be reaped. PiperOrigin-RevId: 404392925
2021-10-19Stub cpuset cgroup control files.Rahat Mahmood
PiperOrigin-RevId: 404382475
2021-10-18conntrack: update state of un-NATted connectionsKevin Krakauer
This prevents reaping connections unnecessarily early. This change both moves the state update to the beginning of handlePacket and fixes a bug where un-finalized connections could become un-reapable. Fixes #6748 PiperOrigin-RevId: 404141012
2021-10-18conntrack: use tcpip.Clock instead of time.TimeKevin Krakauer
- We should be using a monotonic clock - This will make future testing easier Updates #6748. PiperOrigin-RevId: 404072318
2021-10-18Report ramdiskfs usage correctlyFabricio Voznika
Updates #1035 PiperOrigin-RevId: 404072231
2021-10-18Support distinction for RWMutex and read-only locks.Adin Scannell
Fixes #6590 PiperOrigin-RevId: 404007524
2021-10-15Satisfy nogoGhanan Gowripalan
PiperOrigin-RevId: 403479257
2021-10-15Implement WriteRawPacket for pipeTony Gong
Implement WriteRawPacket for pipe by calling `DeliverNetworkPacket` on the other end with empty values for the route and protocol number, and relies on the `NetworkDispatcher` to decapsulate the link layer header from the raw packet itself. PiperOrigin-RevId: 403461448
2021-10-14Report total memory based on limit or hostFabricio Voznika
gVisor was previously reporting the lower of cgroup limit or 2GB as total memory. This may cause applications to make bad decisions based on amount of memory available to them when more than 2GB is required. This change makes the lower of cgroup limit or the host total memory to be reported inside the sandbox. This also is more inline with docker which always reports host total memory. Note that reporting cgroup limit is strictly better than host total memory when there is a limit set. Fixes #5608 PiperOrigin-RevId: 403241608
2021-10-14Add a size parameterChong Cai
PiperOrigin-RevId: 403214414
2021-10-13Minor fixes to sharedmem.Bhasker Hariharan
Use route/protocol from packetbuffer. Sharedmem implementation should use the EgressRoute/NetworkProtocolNumber embedded in the packetbuffer rather than what is passed as parameters to Write(Raw)Packet(s). PiperOrigin-RevId: 402934171
2021-10-13add create-only raw socketsKevin Krakauer
These can be used by applications to manipulate iptables rules without enabling arbitrary reads from and writes to the underlying packet socket. PiperOrigin-RevId: 402924733
2021-10-13Represent direction with booleanGhanan Gowripalan
...since direction can only hold one of two possible values. PiperOrigin-RevId: 402855698
2021-10-12Support Twice NATGhanan Gowripalan
This CL allows both SNAT and DNAT targets to be performed on the same packet. Fixes #5696. PiperOrigin-RevId: 402714738
2021-10-12Create constants for Keepalive defaults.Bhasker Hariharan
Fixes #6725 PiperOrigin-RevId: 402683244
2021-10-12Separate DNAT and SNAT manip statesGhanan Gowripalan
This change also refactors the conntrack packet handling code to not perform the actual rewriting of the packet while holding the lock. This change prepares for a followup CL that adds support for twice-NAT. Updates #5696. PiperOrigin-RevId: 402671685
2021-10-12Make cgroup creation/deletion more robustFabricio Voznika
- Don't attempt to create directory is controller is not present in the system - Ensure that all files being written exist in cgroupfs - Attempt to delete directories during Uninstall even if other deletions have failed Fixes #6446 PiperOrigin-RevId: 402614820
2021-10-12Remove state:"nosave"/"zerovalue" annotations from all waiter.Queues.Jamie Liu
Prior to cl/318010298, //pkg/state couldn't handle pointers to struct fields, which meant that it couldn't handle intrusive linked lists, which meant that it couldn't handle waiter.Queue, which meant that it couldn't handle epoll. As a result, VFS1 unregisters all epoll waiters before saving and re-registers them after loading, and waitable VFS1 file implementations tag their waiter.Queues state:"nosave" (causing them to be skipped by the save/restore machinery) or state:"zerovalue" (causing them to only be checked for zero-value-equality on save). VFS2 required cl/318010298 to support save/restore (due to the Impl inheritance pattern used by vfs.FileDescription, vfs.Dentry, etc.); correspondingly, VFS2 epoll assumes that waiter.Queues *will be* saved and loaded correctly, and VFS2 file implementations do not tag waiter.Queues. Some waiter.Queues, e.g. pipe.Pipe.Queue and kernel.Task.signalQueue, are used by both VFS1 and VFS2 (the latter via signalfd); as a result of the above, tagging these Queues state:"nosave" or state:"zerovalue" breaks VFS2 epoll. Remove VFS1 epoll unregistration before saving (bringing it in line with VFS2), and remove these tags from all waiter.Queues. Also clean up after the epoll test added by cl/402323053, which implied this issue (by instantiating DisableSave in the new test) without reporting it. PiperOrigin-RevId: 402596216
2021-10-11Support DNAT targetGhanan Gowripalan
PiperOrigin-RevId: 402468096
2021-10-11Create subcontainer cgroups for compatibilityFabricio Voznika
Tools (e.g. cAdvisor) watches for changes inside /sys/fs/cgroup to detect when containers are created and deleted. With gVisor, container cgroups were not created because the containers are not visible to the host. This change enables the creation of [empty] subcontainer cgroups that can be used by tools to detect creation/deletion of subcontainers. This change required a new annotation to be added so that the shim can communicate the pod cgroup path to runsc, so pod and container cgroups can be identified, Fixes #6500 PiperOrigin-RevId: 402392291
2021-10-11Add unit test for Redirect targetGhanan Gowripalan
We already have integration tests `make iptables-tests` that tests the REDIRECT target, but unit tests are a lot faster and easier to run than the integration test. PiperOrigin-RevId: 402365412
2021-10-11Support IP_PKTINFO and IPV6_RECVPKTINFO on raw socketsGhanan Gowripalan
Updates #1584, #3556. PiperOrigin-RevId: 402354066
2021-10-11Merge pull request #6428 from dillanzhou:fix_epoll_vfs2gVisor bot
PiperOrigin-RevId: 402323053
2021-10-08Remove ring0 floating point save/load functions on amd64.Jamie Liu
ring0.Save/LoadFloatingPoint() are only usable if the caller can ensure that Go will not clobber floating point registers before/after calling them respectively. Due to regabig in Go 1.17, this is no longer the case; regabig (among other things) maintains a zeroed XMM15 during ABIInternal execution, including by zeroing it after ABI0-to-ABIInternal transitions. In ring0.sysenter/exception, this happens in ring0.kernelSyscall/kernelException.abi0 respectively; in ring0.CPU.SwitchToUser, this happens after returning from ring0.sysret/iret.abi0. Delete these functions and do floating point save/load in assembly. While arm64 doesn't appear to be immediately affected (so this CL permits us to resume usage of Go 1.17), its use of Save/LoadFloatingPoint() still seems to be incorrect for the same fundamental reason (Go code can't sanely assume what registers the Go compiler will or won't use) and should be fixed eventually. PiperOrigin-RevId: 401895658
2021-10-08Remove redundant slice copy in lisafs gofer client.Ayush Ranjan
listXattr() was doing redundant work. Remove it. PiperOrigin-RevId: 401871315
2021-10-08Disallow "trusted" namespace xattr in VFS2 gofer client.Ayush Ranjan
Allowing this namespace makes way for a lot of GetXattr RPCs to the gofer process when the gofer filesystem is the lower layer of an overlay. The overlay filesystem aggressively queries for "trusted.overlay.opaque" which in practice is never found in the lower layer gofer. But leads to a lot of wasted work. A consequence is that mutable gofer upper layer is not supported anymore but that is still consistent with VFS1. We can revisit when need arises. PiperOrigin-RevId: 401860585
2021-10-07add convenient wrapper for eventfdKevin Krakauer
The same create/write/read pattern is copied around several places. It's easier to understand in a package with names and comments, and we can reuse the smart blocking code in package rawfile. PiperOrigin-RevId: 401647108
2021-10-07Add a new metric to detect the number of spurious loss recoveries.Nayana Bidari
- Implements RFC 3522 (Eifel detection algorithm) to detect if the connection entered loss recovery unnecessarily. - Added a new metric to count the total number of spurious loss recoveries. - Added tests to verify the new metric. PiperOrigin-RevId: 401637359
2021-10-07tests: use a proper path to the kvm deviceAndrei Vagin
PiperOrigin-RevId: 401624134