Age | Commit message (Collapse) | Author |
|
Running multiple instances of netstack in the same network namespace can
cause collisions when enabling packet fanout for fdbased endpoints. The
only bulletproof fix is to run in different network namespaces, but by
using `getpid()` instead of 0 as the fanout ID starting point we can
avoid collisions in the common case, particularly when
testing/experimenting.
Addresses #6124
|
|
- Unused constants
- Unused functions
- Unused arguments
- Unkeyed literals
- Unnecessary conversions
PiperOrigin-RevId: 375253464
|
|
...as all GSO capable endpoints must implement GSOEndpoint.
PiperOrigin-RevId: 371804175
|
|
Co-Author: ayushranjan
PiperOrigin-RevId: 370785009
|
|
With this change, GSO options no longer needs to be passed around as
a function argument in the write path.
This change is done in preparation for a later change that defers
segmentation, and may change GSO options for a packet as it flows
down the stack.
Updates #170.
PiperOrigin-RevId: 369774872
|
|
On Linux these are meant to be equivalent to POLLIN/POLLOUT. Rather
than hack these on in sys_poll etc it felt cleaner to just cleanup
the call sites to notify for both events. This is what linux does
as well.
Fixes #5544
PiperOrigin-RevId: 364859977
|
|
PiperOrigin-RevId: 364859173
|
|
Doing so involved breaking dependencies between //pkg/tcpip and the rest
of gVisor, which are discouraged anyways.
Tested on the Go branch via:
gvisor.dev/gvisor/pkg/tcpip/...
Addresses #1446.
PiperOrigin-RevId: 363081778
|
|
- Implement Stringer for it so that we can improve error messages.
- Use TCPFlags through the code base. There used to be a mixed usage of byte,
uint8 and int as TCP flags.
PiperOrigin-RevId: 361940150
|
|
One of the preparation to decouple underlying buffer implementation.
There are still some methods that tie to VectorisedView, and they will be
changed gradually in later CLs.
This CL also introduce a new ICMPv6ChecksumParams to replace long list of
parameters when calling ICMPv6Checksum, aiming to be more descriptive.
PiperOrigin-RevId: 360778149
|
|
The syscall package has been deprecated in favor of golang.org/x/sys.
Note that syscall is still used in the following places:
- pkg/sentry/socket/hostinet/stack.go: some netlink related functionalities
are not yet available in golang.org/x/sys.
- syscall.Stat_t is still used in some places because os.FileInfo.Sys() still
returns it and not unix.Stat_t.
Updates #214
PiperOrigin-RevId: 360701387
|
|
- Use atomic add rather than CAS in every Gate method, which is slightly
faster in most cases.
- Implement Close wakeup using gopark/goready to avoid channel allocation.
New benchmarks:
name old time/op new time/op delta
GateEnterLeave-12 16.7ns ± 1% 10.3ns ± 1% -38.44% (p=0.000 n=9+8)
GateClose-12 50.2ns ± 8% 42.4ns ± 6% -15.44% (p=0.000 n=10+10)
GateEnterLeaveAsyncClose-12 972ns ± 2% 640ns ± 7% -34.15% (p=0.000 n=9+10)
PiperOrigin-RevId: 359336344
|
|
These are bumped to allow early testing of Go 1.17. Use will be audited closer
to the 1.17 release.
PiperOrigin-RevId: 358278615
|
|
Before this change, packets were delivered asynchronously to the remote
end of a pipe. This was to avoid a deadlock during link resolution where
the stack would attempt to double-lock a mutex (see removed comments in
the parent commit for details).
As of https://github.com/google/gvisor/commit/4943347137, we do not hold
locks while sending link resolution probes so the deadlock will no
longer occur.
PiperOrigin-RevId: 356066224
|
|
This makes it possible to add data to types that implement tcpip.Error.
ErrBadLinkEndpoint is removed as it is unused.
PiperOrigin-RevId: 354437314
|
|
PiperOrigin-RevId: 353755271
|
|
Test: integration_test.TestWritePacketsLinkResolution
Fixes #4458.
PiperOrigin-RevId: 353108826
|
|
fdbased and qdisc layers expect these fields to already be
populated before being reached.
PiperOrigin-RevId: 353099492
|
|
Test: integration_test.TestGetLinkAddress
PiperOrigin-RevId: 352119404
|
|
stack.Route is used to send network packets and resolve link addresses.
A LinkEndpoint does not need to do either of these and only needs the
route's fields at the time of the packet write request.
Since LinkEndpoints only need the route's fields when writing packets,
pass a stack.RouteInfo instead.
PiperOrigin-RevId: 352108405
|
|
We loop over the list of packets anyways so setting these aren't
expensive.
Now that they are populated only by the link endpoint that uses them,
TCP does not need to.
PiperOrigin-RevId: 352090853
|
|
Whether the variable was found is already returned by syscall.Getenv.
os.Getenv drops this value while os.Lookupenv passes it along.
PiperOrigin-RevId: 351674032
|
|
These are primarily simplification and lint mistakes. However, minor
fixes are also included and tests added where appropriate.
PiperOrigin-RevId: 351425971
|
|
Ethernet frames are usually filtered at the hardware-level so there is
no need to filter the frames in software.
For test purposes, a new link endpoint was introduced to filter frames
based on their destination.
PiperOrigin-RevId: 350422941
|
|
This condition was inverted in 360006d.
PiperOrigin-RevId: 348679088
|
|
Removes the period of time in which subseqeuent traffic to a Failed neighbor
immediately fails with ErrNoLinkAddress. A Failed neighbor is one in which
address resolution fails; or in other words, the neighbor's IP address cannot
be translated to a MAC address.
This means removing the Failed state for linkAddrCache and allowing transitiong
out of Failed into Incomplete for neighborCache. Previously, both caches would
transition entries to Failed after address resolution fails. In this state, any
subsequent traffic requested within an unreachable time would immediately fail
with ErrNoLinkAddress. This does not follow RFC 4861 section 7.3.3:
If address resolution fails, the entry SHOULD be deleted, so that subsequent
traffic to that neighbor invokes the next-hop determination procedure again.
Invoking next-hop determination at this point ensures that alternate default
routers are tried.
The API for getting a link address for a given address, whether through the link
address cache or the neighbor table, is updated to optionally take a callback
which will be called when address resolution completes. This allows `Route` to
handle completing link resolution internally, so callers of (*Route).Resolve
(e.g. endpoints) don’t have to keep track of when it completes and update the
Route accordingly.
This change also removes the wakers from LinkAddressCache, NeighborCache, and
Route in favor of the callbacks, and callers that previously used a waker can
now just pass a callback to (*Route).Resolve that will notify the waker on
resolution completion.
Fixes #4796
Startblock:
has LGTM from sbalana
and then
add reviewer ghanan
PiperOrigin-RevId: 348597478
|
|
fdbased endpoint was enabling fragment reassembly on the host AF_PACKET socket
to ensure that fragments are delivered inorder to the right dispatcher. But this
prevents fragments from being delivered to gvisor at all and makes testing of
gvisor's fragment reassembly code impossible.
The potential impact from this is minimal since IP Fragmentation is not really
that prevelant and in cases where we do get fragments we may deliver the
fragment out of order to the TCP layer as multiple network dispatchers may
process the fragments and deliver a reassembled fragment after the next packet
has been delivered to the TCP endpoint. While not desirable I believe the impact
from this is minimal due to low prevalence of fragmentation.
Also removed PktType and Hatype fields when binding the socket as these are not
used when binding. Its just confusing to have them specified.
See: https://man7.org/linux/man-pages/man7/packet.7.html
"Fields used for binding are
sll_family (should be AF_PACKET), sll_protocol, and sll_ifindex."
Fixes #5055
PiperOrigin-RevId: 346919439
|
|
Currently we rely on the user to take the lock on the endpoint that owns the
route, in order to modify it safely. We can instead move
`Route.RemoteLinkAddress` under `Route`'s mutex, and allow non-locking and
thread-safe access to other fields of `Route`.
PiperOrigin-RevId: 345461586
|
|
Multiple goroutines may use the same stack.Route concurrently so
the stack.Route should make sure that any functions called on it
are thread-safe.
Fixes #4073
PiperOrigin-RevId: 344320491
|
|
Redefine stack.WritePacket into stack.WritePacketToRemote which lets the NIC
decide whether to append link headers.
PiperOrigin-RevId: 343071742
|
|
A prefix associated with a sniffer instance can help debug situations where
more than one NIC (i.e. more than one sniffer) exists.
PiperOrigin-RevId: 342950027
|
|
- Make AddressableEndpoint optional for NetworkEndpoint.
Not all NetworkEndpoints need to support addressing (e.g. ARP), so
AddressableEndpoint should only be implemented for protocols that
support addressing such as IPv4 and IPv6.
With this change, tcpip.ErrNotSupported will be returned by the stack
when attempting to modify addresses on a network endpoint that does
not support addressing.
Now that packets are fully handled at the network layer, and (with this
change) addresses are optional for network endpoints, we no longer need
the workaround for ARP where a fake ARP address was added to each NIC
that performs ARP so that packets would be delivered to the ARP layer.
PiperOrigin-RevId: 342722547
|
|
This lets us avoid treating a value of 0 as one reference. All references
using the refsvfs2 template must call InitRefs() before the reference is
incremented/decremented, or else a panic will occur. Therefore, it should be
pretty easy to identify missing InitRef calls during testing.
Updates #1486.
PiperOrigin-RevId: 341411151
|
|
PiperOrigin-RevId: 341135083
|
|
Our current reference leak checker uses finalizers to verify whether an object
has reached zero references before it is garbage collected. There are multiple
problems with this mechanism, so a rewrite is in order.
With finalizers, there is no way to guarantee that a finalizer will run before
the program exits. When an unreachable object with a finalizer is garbage
collected, its finalizer will be added to a queue and run asynchronously. The
best we can do is run garbage collection upon sandbox exit to make sure that
all finalizers are enqueued.
Furthermore, if there is a chain of finalized objects, e.g. A points to B
points to C, garbage collection needs to run multiple times before all of the
finalizers are enqueued. The first GC run will register the finalizer for A but
not free it. It takes another GC run to free A, at which point B's finalizer
can be registered. As a result, we need to run GC as many times as the length
of the longest such chain to have a somewhat reliable leak checker.
Finally, a cyclical chain of structs pointing to one another will never be
garbage collected if a finalizer is set. This is a well-known issue with Go
finalizers (https://github.com/golang/go/issues/7358). Using leak checking on
filesystem objects that produce cycles will not work and even result in memory
leaks.
The new leak checker stores reference counted objects in a global map when
leak check is enabled and removes them once they are destroyed. At sandbox
exit, any remaining objects in the map are considered as leaked. This provides
a deterministic way of detecting leaks without relying on the complexities of
finalizers and garbage collection.
This approach has several benefits over the former, including:
- Always detects leaks of objects that should be destroyed very close to
sandbox exit. The old checker very rarely detected these leaks, because it
relied on garbage collection to be run in a short window of time.
- Panics if we forgot to enable leak check on a ref-counted object (we will try
to remove it from the map when it is destroyed, but it will never have been
added).
- Can store extra logging information in the map values without adding to the
size of the ref count struct itself. With the size of just an int64, the ref
count object remains compact, meaning frequent operations like IncRef/DecRef
are more cache-efficient.
- Can aggregate leak results in a single report after the sandbox exits.
Instead of having warnings littered in the log, which were
non-deterministically triggered by garbage collection, we can print all
warning messages at once. Note that this could also be a limitation--the
sandbox must exit properly for leaks to be detected.
Some basic benchmarking indicates that this change does not significantly
affect performance when leak checking is enabled, which is understandable
since registering/unregistering is only done once for each filesystem object.
Updates #1486.
PiperOrigin-RevId: 338685972
|
|
PiperOrigin-RevId: 338168977
|
|
Before this change, if a link header was included in an incoming packet
that is forwarded, the packet that gets sent out will take the original
packet and add a link header to it while keeping the old link header.
This would make the sent packet look like:
OUTGOING LINK HDR | INCOMING LINK HDR | NETWORK HDR | ...
Obviously this is incorrect as we should drop the incoming link header
and only include the outgoing link header. This change fixes this bug.
Test: integration_test.TestForwarding
PiperOrigin-RevId: 337571447
|
|
PiperOrigin-RevId: 336339194
|
|
PiperOrigin-RevId: 336304024
|
|
When a response needs to be sent to an incoming packet, the stack should
consult its neighbour table to determine the remote address's link
address.
When an entry does not exist in the stack's neighbor table, the stack
should queue the packet while link resolution completes. See comments.
PiperOrigin-RevId: 336185457
|
|
Extract parsing utilities so they can be used by the sniffer.
Fixes #3930
PiperOrigin-RevId: 332401880
|
|
Neither POSIX.1 nor Linux defines an upperbound for errno.
PiperOrigin-RevId: 332085017
|
|
This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and
vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted
soon anyway.
The following structs now use the new tool, with leak check enabled:
devpts:rootInode
fuse:inode
kernfs:Dentry
kernfs:dir
kernfs:readonlyDir
kernfs:StaticDirectory
proc:fdDirInode
proc:fdInfoDirInode
proc:subtasksInode
proc:taskInode
proc:tasksInode
vfs:FileDescription
vfs:MountNamespace
vfs:Filesystem
sys:dir
kernel:FSContext
kernel:ProcessGroup
kernel:Session
shm:Shm
mm:aioMappable
mm:SpecialMappable
transport:queue
And the following use the template, but because they currently are not leak
checked, a TODO is left instead of enabling leak check in this patch:
kernel:FDTable
tun:tunEndpoint
Updates #1486.
PiperOrigin-RevId: 328460377
|
|
This enables pre-release testing with 1.16. The intention is to replace these
with a nogo check before the next release.
PiperOrigin-RevId: 328193911
|
|
Formerly, when a packet is constructed or parsed, all headers are set by the
client code. This almost always involved prepending to pk.Header buffer or
trimming pk.Data portion. This is known to prone to bugs, due to the complexity
and number of the invariants assumed across netstack to maintain.
In the new PacketHeader API, client will call Push()/Consume() method to
construct/parse an outgoing/incoming packet. All invariants, such as slicing
and trimming, are maintained by the API itself.
NewPacketBuffer() is introduced to create new PacketBuffer. Zero value is no
longer valid.
PacketBuffer now assumes the packet is a concatenation of following portions:
* LinkHeader
* NetworkHeader
* TransportHeader
* Data
Any of them could be empty, or zero-length.
PiperOrigin-RevId: 326507688
|
|
PiperOrigin-RevId: 326129258
|
|
context is passed to DecRef() and Release() which is
needed for SO_LINGER implementation.
PiperOrigin-RevId: 324672584
|
|
Updates #173
PiperOrigin-RevId: 322665518
|
|
Now it calls pkt.Data.ToView() when writing the packet. This may require
copying when the packet is large, which puts the worse case in an even worse
situation.
This sent out in a separate preparation change as it requires syscall filter
changes. This change will be followed by the change for the adoption of the new
PacketHeader API.
PiperOrigin-RevId: 321447003
|
|
gVisor incorrectly returns the wrong ARP type for SIOGIFHWADDR. This breaks
tcpdump as it tries to interpret the packets incorrectly.
Similarly, SIOCETHTOOL is used by tcpdump to query interface properties which
fails with an EINVAL since we don't implement it. For now change it to return
EOPNOTSUPP to indicate that we don't support the query rather than return
EINVAL.
NOTE: ARPHRD types for link endpoints are distinct from NIC capabilities
and NIC flags. In Linux all 3 exist eg. ARPHRD types are stored in dev->type
field while NIC capabilities are more like the device features which can be
queried using SIOCETHTOOL but not modified and NIC Flags are fields that can
be modified from user space. eg. NIC status (UP/DOWN/MULTICAST/BROADCAST) etc.
Updates #2746
PiperOrigin-RevId: 321436525
|