summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/socket
AgeCommit message (Collapse)Author
2020-10-23Rewrite reference leak checker without finalizers.Dean Deng
Our current reference leak checker uses finalizers to verify whether an object has reached zero references before it is garbage collected. There are multiple problems with this mechanism, so a rewrite is in order. With finalizers, there is no way to guarantee that a finalizer will run before the program exits. When an unreachable object with a finalizer is garbage collected, its finalizer will be added to a queue and run asynchronously. The best we can do is run garbage collection upon sandbox exit to make sure that all finalizers are enqueued. Furthermore, if there is a chain of finalized objects, e.g. A points to B points to C, garbage collection needs to run multiple times before all of the finalizers are enqueued. The first GC run will register the finalizer for A but not free it. It takes another GC run to free A, at which point B's finalizer can be registered. As a result, we need to run GC as many times as the length of the longest such chain to have a somewhat reliable leak checker. Finally, a cyclical chain of structs pointing to one another will never be garbage collected if a finalizer is set. This is a well-known issue with Go finalizers (https://github.com/golang/go/issues/7358). Using leak checking on filesystem objects that produce cycles will not work and even result in memory leaks. The new leak checker stores reference counted objects in a global map when leak check is enabled and removes them once they are destroyed. At sandbox exit, any remaining objects in the map are considered as leaked. This provides a deterministic way of detecting leaks without relying on the complexities of finalizers and garbage collection. This approach has several benefits over the former, including: - Always detects leaks of objects that should be destroyed very close to sandbox exit. The old checker very rarely detected these leaks, because it relied on garbage collection to be run in a short window of time. - Panics if we forgot to enable leak check on a ref-counted object (we will try to remove it from the map when it is destroyed, but it will never have been added). - Can store extra logging information in the map values without adding to the size of the ref count struct itself. With the size of just an int64, the ref count object remains compact, meaning frequent operations like IncRef/DecRef are more cache-efficient. - Can aggregate leak results in a single report after the sandbox exits. Instead of having warnings littered in the log, which were non-deterministically triggered by garbage collection, we can print all warning messages at once. Note that this could also be a limitation--the sandbox must exit properly for leaks to be detected. Some basic benchmarking indicates that this change does not significantly affect performance when leak checking is enabled, which is understandable since registering/unregistering is only done once for each filesystem object. Updates #1486. PiperOrigin-RevId: 338685972
2020-10-15sockets: ignore io.EOF from view.ReadAtAndrei Vagin
Reported-by: syzbot+5466463b7604c2902875@syzkaller.appspotmail.com PiperOrigin-RevId: 337451896
2020-10-14Fix SCM Rights reference leaks.Dean Deng
Control messages should be released on Read (which ignores the control message) or zero-byte Send. Otherwise, open fds sent through the control messages will be leaked. PiperOrigin-RevId: 337110774
2020-09-30ip6tables: redirect supportKevin Krakauer
Adds support for the IPv6-compatible redirect target. Redirection is a limited form of DNAT, where the destination is always the localhost. Updates #3549. PiperOrigin-RevId: 334698344
2020-09-30Make all Target.Action implementation pointer receiversKevin Krakauer
PiperOrigin-RevId: 334652998
2020-09-29iptables: remove unused min/max NAT range fieldsKevin Krakauer
PiperOrigin-RevId: 334531794
2020-09-29Don't allow broadcast/multicast source addressGhanan Gowripalan
As per relevant IP RFCS (see code comments), broadcast (for IPv4) and multicast addresses are not allowed. Currently checks for these are done at the transport layer, but since it is explicitly forbidden at the IP layers, check for them there. This change also removes the UDP.InvalidSourceAddress stat since there is no longer a need for it. Test: ip_test.TestSourceAddressValidation PiperOrigin-RevId: 334490971
2020-09-29iptables: refactor to make targets extendableKevin Krakauer
Like matchers, targets should use a module-like register/lookup system. This replaces the brittle switch statements we had before. The only behavior change is supporing IPT_GET_REVISION_TARGET. This makes it much easier to add IPv6 redirect in the next change. Updates #3549. PiperOrigin-RevId: 334469418
2020-09-29Merge pull request #3875 from btw616:fix/issue-3874gVisor bot
PiperOrigin-RevId: 334428344
2020-09-28Don't leak dentries returned by sockfs.NewDentry().Jamie Liu
PiperOrigin-RevId: 334263322
2020-09-24Add basic stateify annotations.Adin Scannell
Updates #1663 PiperOrigin-RevId: 333539293
2020-09-24Fix socket record leak in VFS2Tiwei Bie
VFS2 socket record is not removed from the system-wide socket table when the socket is released, which will lead to a memory leak. This patch fixes this issue. Fixes: #3874 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-20Merge pull request #3651 from ianlewis:ip-forwardinggVisor bot
PiperOrigin-RevId: 332760843
2020-09-18Count packets dropped by iptables in IPStatsKevin Krakauer
PiperOrigin-RevId: 332486383
2020-09-17ip6tables: filter table supportKevin Krakauer
`ip6tables -t filter` is now usable. NAT support will come in a future CL. #3549 PiperOrigin-RevId: 332381801
2020-09-17{Set,Get} SO_LINGER on all endpoints.Nayana Bidari
SO_LINGER is a socket level option and should be stored on all endpoints even though it is used to linger only for TCP endpoints. PiperOrigin-RevId: 332369252
2020-09-17Complete vfs2 implementation of fallocate.Dean Deng
This change includes overlay, special regular gofer files, and hostfs. Fixes #3589. PiperOrigin-RevId: 332330860
2020-09-17Return ENOPROTOOPT for all SOL_PACKET options.Bhasker Hariharan
This is required to make tcpdump work. tcpdump falls back to not using things like PACKET_RX_RING if setsockopt returns ENOPROTOOPT. This used to be the case before https://github.com/google/gvisor/commit/6f8fb7e0db2790ff1f5ba835780c03fe245e437f. Fixes #3981 PiperOrigin-RevId: 332326517
2020-09-16Automated rollback of changelist 329526153Nayana Bidari
PiperOrigin-RevId: 332097286
2020-09-11Move the 'marshal' and 'primitive' packages to the 'pkg' directory.Rahat Mahmood
PiperOrigin-RevId: 331256608
2020-09-08Improve type safety for transport protocol optionsGhanan Gowripalan
The existing implementation for TransportProtocol.{Set}Option take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for transport protocol options that may be set or queried which transport protocol option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. RELNOTES: n/a PiperOrigin-RevId: 330559811
2020-09-02Fix Accept to not return error for sockets in accept queue.Bhasker Hariharan
Accept on gVisor will return an error if a socket in the accept queue was closed before Accept() was called. Linux will return the new fd even if the returned socket is already closed by the peer say due to a RST being sent by the peer. This seems to be intentional in linux more details on the github issue. Fixes #3780 PiperOrigin-RevId: 329828404
2020-09-01Automated rollback of changelist 328350576Nayana Bidari
PiperOrigin-RevId: 329526153
2020-08-28Fix kernfs.Dentry reference leak.Nicolas Lacasse
PiperOrigin-RevId: 329036994
2020-08-27Improve type safety for socket optionsGhanan Gowripalan
The existing implementation for {G,S}etSockOpt take arguments of an empty interface type which all types (implicitly) implement; any type may be passed to the functions. This change introduces marker interfaces for socket options that may be set or queried which socket option types implement to ensure that invalid types are caught at compile time. Different interfaces are used to allow the compiler to enforce read-only or set-only socket options. Fixes #3714. RELNOTES: n/a PiperOrigin-RevId: 328832161
2020-08-27Add function to get error from a tcpip.EndpointGhanan Gowripalan
In an upcoming CL, socket option types are made to implement a marker interface with pointer receivers. Since this results in calling methods of an interface with a pointer, we incur an allocation when attempting to get an Endpoint's last error with the current implementation. When calling the method of an interface, the compiler is unable to determine what the interface implementation does with the pointer (since calling a method on an interface uses virtual dispatch at runtime so the compiler does not know what the interface method will do) so it allocates on the heap to be safe incase an implementation continues to hold the pointer after the functioon returns (the reference escapes the scope of the object). In the example below, the compiler does not know what b.foo does with the reference to a it allocates a on the heap as the reference to a may escape the scope of a. ``` var a int var b someInterface b.foo(&a) ``` This change removes the opportunity for that allocation. RELNOTES: n/a PiperOrigin-RevId: 328796559
2020-08-27ip6tables: (de)serialize ip6tables structsKevin Krakauer
More implementation+testing to follow. #3549. PiperOrigin-RevId: 328770160
2020-08-25Use new reference count utility throughout gvisor.Dean Deng
This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted soon anyway. The following structs now use the new tool, with leak check enabled: devpts:rootInode fuse:inode kernfs:Dentry kernfs:dir kernfs:readonlyDir kernfs:StaticDirectory proc:fdDirInode proc:fdInfoDirInode proc:subtasksInode proc:taskInode proc:tasksInode vfs:FileDescription vfs:MountNamespace vfs:Filesystem sys:dir kernel:FSContext kernel:ProcessGroup kernel:Session shm:Shm mm:aioMappable mm:SpecialMappable transport:queue And the following use the template, but because they currently are not leak checked, a TODO is left instead of enabling leak check in this patch: kernel:FDTable tun:tunEndpoint Updates #1486. PiperOrigin-RevId: 328460377
2020-08-25remove iptables sockopt special casesKevin Krakauer
iptables sockopts were kludged into an unnecessary check, this properly relegates them to the {get,set}SockOptIP functions. PiperOrigin-RevId: 328395135
2020-08-25Support SO_LINGER socket option.Nayana Bidari
When SO_LINGER option is enabled, the close will not return until all the queued messages are sent and acknowledged for the socket or linger timeout is reached. If the option is not set, close will return immediately. This option is mainly supported for connection oriented protocols such as TCP. PiperOrigin-RevId: 328350576
2020-08-25Fix TCP_LINGER2 behavior to match linux.Bhasker Hariharan
We still deviate a bit from linux in how long we will actually wait in FIN-WAIT-2. Linux seems to cap it with TIME_WAIT_LEN and it's not completely obvious as to why it's done that way. For now I think we can ignore that and fix it if it really is an issue. PiperOrigin-RevId: 328324922
2020-08-20Skip listening TCP ports when trying to bind a free port.Bhasker Hariharan
PiperOrigin-RevId: 327686558
2020-08-19ip6tables: move ipv4-specific logic into its own fileKevin Krakauer
A later change will introduce the equivalent IPv6 logic. #3549 PiperOrigin-RevId: 327499064
2020-08-17Merge branch 'master' into ip-forwardingIan Lewis
- Merges aleksej-paschenko's with HEAD - Adds vfs2 support for ip_forward
2020-08-17Remove weak references from unix sockets.Dean Deng
The abstract socket namespace no longer holds any references on sockets. Instead, TryIncRef() is used when a socket is being retrieved in BoundEndpoint(). Abstract sockets are now responsible for removing themselves from the namespace they are in, when they are destroyed. Updates #1486. PiperOrigin-RevId: 327064173
2020-08-13Migrate to PacketHeader API for PacketBuffer.Ting-Yu Wang
Formerly, when a packet is constructed or parsed, all headers are set by the client code. This almost always involved prepending to pk.Header buffer or trimming pk.Data portion. This is known to prone to bugs, due to the complexity and number of the invariants assumed across netstack to maintain. In the new PacketHeader API, client will call Push()/Consume() method to construct/parse an outgoing/incoming packet. All invariants, such as slicing and trimming, are maintained by the API itself. NewPacketBuffer() is introduced to create new PacketBuffer. Zero value is no longer valid. PacketBuffer now assumes the packet is a concatenation of following portions: * LinkHeader * NetworkHeader * TransportHeader * Data Any of them could be empty, or zero-length. PiperOrigin-RevId: 326507688
2020-08-10ip6tables: move target-specific code to targets.goKevin Krakauer
This is purely moving code, no changes. netfilter.go is cluttered and targets.go is a good place for this. #3549 PiperOrigin-RevId: 325879965
2020-08-05Add loss recovery option for TCP.Nayana Bidari
/proc/sys/net/ipv4/tcp_recovery is used to enable RACK loss recovery in TCP. PiperOrigin-RevId: 325157807
2020-08-03Plumbing context.Context to DecRef() and Release().Nayana Bidari
context is passed to DecRef() and Release() which is needed for SO_LINGER implementation. PiperOrigin-RevId: 324672584
2020-07-31iptables: support SO_ORIGINAL_DSTKevin Krakauer
Envoy (#170) uses this to get the original destination of redirected packets.
2020-07-28Redirect TODO to GitHub issuesFabricio Voznika
PiperOrigin-RevId: 323715260
2020-07-24Enable automated marshalling for netstack.Ayush Ranjan
PiperOrigin-RevId: 322954792
2020-07-23Merge pull request #3142 from tanjianfeng:fix-3141gVisor bot
PiperOrigin-RevId: 322937495
2020-07-23Marshallable socket opitons.Ayush Ranjan
Socket option values are now required to implement marshal.Marshallable. Co-authored-by: Rahat Mahmood <rahat@google.com> PiperOrigin-RevId: 322831612
2020-07-22iptables: replace maps with arraysKevin Krakauer
For iptables users, Check() is a hot path called for every packet one or more times. Let's avoid a bunch of map lookups. PiperOrigin-RevId: 322678699
2020-07-22Support for receiving outbound packets in AF_PACKET.Bhasker Hariharan
Updates #173 PiperOrigin-RevId: 322665518
2020-07-16Add support to return protocol in recvmsg for AF_PACKET.Bhasker Hariharan
Updates #173 PiperOrigin-RevId: 321690756
2020-07-15Fix minor bugs in a couple of interface IOCTLs.Bhasker Hariharan
gVisor incorrectly returns the wrong ARP type for SIOGIFHWADDR. This breaks tcpdump as it tries to interpret the packets incorrectly. Similarly, SIOCETHTOOL is used by tcpdump to query interface properties which fails with an EINVAL since we don't implement it. For now change it to return EOPNOTSUPP to indicate that we don't support the query rather than return EINVAL. NOTE: ARPHRD types for link endpoints are distinct from NIC capabilities and NIC flags. In Linux all 3 exist eg. ARPHRD types are stored in dev->type field while NIC capabilities are more like the device features which can be queried using SIOCETHTOOL but not modified and NIC Flags are fields that can be modified from user space. eg. NIC status (UP/DOWN/MULTICAST/BROADCAST) etc. Updates #2746 PiperOrigin-RevId: 321436525
2020-07-15hostinet: fix fd leak in fdnotifier for VFS2Tiwei Bie
When we failed to create the new socket after adding the fd to fdnotifier, we should remove the fd from fdnotifier, because we are going to close the fd directly. Fixes: #3241 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-07-11Stub out SO_DETACH_FILTER.Bhasker Hariharan
Updates #2746 PiperOrigin-RevId: 320757963