Age | Commit message (Collapse) | Author |
|
Fixes #3779.
PiperOrigin-RevId: 330057268
|
|
stack.cleanupEndpoints is protected by the stack.mu but that can cause
contention as the stack mutex is already acquired in a lot of hot paths during
new endpoint creation /cleanup etc. Moving this to a fine grained mutex should
reduce contention on the stack.mu.
PiperOrigin-RevId: 330026151
|
|
b/166980357#comment56 shows:
- 837 goroutines blocked in:
gvisor/pkg/sync/sync.(*RWMutex).Lock
gvisor/pkg/tcpip/stack/stack.(*Stack).StartTransportEndpointCleanup
gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).cleanupLocked
gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).completeWorkerLocked
gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop.func1
gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop
- 695 goroutines blocked in:
gvisor/pkg/sync/sync.(*RWMutex).Lock
gvisor/pkg/tcpip/stack/stack.(*Stack).CompleteTransportEndpointCleanup
gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).cleanupLocked
gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).completeWorkerLocked
gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop.func1
gvisor/pkg/tcpip/transport/tcp/tcp.(*endpoint).protocolMainLoop
- 3882 goroutines blocked in:
gvisor/pkg/sync/sync.(*RWMutex).Lock
gvisor/pkg/tcpip/stack/stack.(*Stack).GetTCPProbe
gvisor/pkg/tcpip/transport/tcp/tcp.newEndpoint
gvisor/pkg/tcpip/transport/tcp/tcp.(*protocol).NewEndpoint
gvisor/pkg/tcpip/stack/stack.(*Stack).NewEndpoint
All of these are contending on Stack.mu. Stack.StartTransportEndpointCleanup()
and Stack.CompleteTransportEndpointCleanup() insert/delete TransportEndpoints
in a map (Stack.cleanupEndpoints), and the former also does endpoint
unregistration while holding Stack.mu, so it's not immediately clear how
feasible it is to replace the map with a mutex-less implementation or how much
doing so would help. However, Stack.GetTCPProbe() just reads a function object
(Stack.tcpProbeFunc) that is almost always nil (as far as I can tell,
Stack.AddTCPProbe() is only called in tests), and it's called for every new TCP
endpoint. So converting it to an atomic.Value should significantly reduce
contention on Stack.mu, improving TCP endpoint creation latency and allowing
TCP endpoint cleanup to proceed.
PiperOrigin-RevId: 330004140
|
|
blaze test <test_name>_fuchsia_test will run the corresponding packetimpact
test against fuchsia.
PiperOrigin-RevId: 329835290
|
|
Accept on gVisor will return an error if a socket in the accept queue was closed
before Accept() was called. Linux will return the new fd even if the returned
socket is already closed by the peer say due to a RST being sent by the peer.
This seems to be intentional in linux more details on the github issue.
Fixes #3780
PiperOrigin-RevId: 329828404
|
|
PiperOrigin-RevId: 329825497
|
|
Updates #1199
PiperOrigin-RevId: 329802274
|
|
PiperOrigin-RevId: 329801584
|
|
- Make sync.SeqCountEpoch not a struct. This allows sync.SeqCount.BeginRead()
to be inlined.
- Mark sync.SeqAtomicLoad<T> nosplit to mitigate the Go compiler's refusal to
inline it. (Best I could get was "cost 92 exceeds budget 80".)
- Use runtime-guided spinning in SeqCount.BeginRead().
Benchmarks:
name old time/op new time/op delta
pkg:pkg/sync/sync goos:linux goarch:amd64
SeqCountWriteUncontended-12 8.24ns ± 0% 11.40ns ± 0% +38.35% (p=0.000 n=10+10)
SeqCountReadUncontended-12 0.33ns ± 0% 0.14ns ± 3% -57.77% (p=0.000 n=7+8)
pkg:pkg/sync/seqatomictest/seqatomic goos:linux goarch:amd64
SeqAtomicLoadIntUncontended-12 0.64ns ± 1% 0.41ns ± 1% -36.40% (p=0.000 n=10+8)
SeqAtomicTryLoadIntUncontended-12 0.18ns ± 4% 0.18ns ± 1% ~ (p=0.206 n=10+8)
AtomicValueLoadIntUncontended-12 0.27ns ± 3% 0.27ns ± 0% -1.77% (p=0.000 n=10+8)
(atomic.Value.Load is, of course, inlined. We would expect an uncontended
inline SeqAtomicLoad<int> to perform identically to SeqAtomicTryLoad<int>.) The
"regression" in BenchmarkSeqCountWriteUncontended, despite this CL changing
nothing in that path, is attributed to microarchitectural subtlety; the
benchmark loop is unchanged except for its address:
Before this CL:
:0 0x4e62d1 48ffc2 INCQ DX
:0 0x4e62d4 48399110010000 CMPQ DX, 0x110(CX)
:0 0x4e62db 7e26 JLE 0x4e6303
:0 0x4e62dd 90 NOPL
:0 0x4e62de bb01000000 MOVL $0x1, BX
:0 0x4e62e3 f00fc118 LOCK XADDL BX, 0(AX)
:0 0x4e62e7 ffc3 INCL BX
:0 0x4e62e9 0fbae300 BTL $0x0, BX
:0 0x4e62ed 733a JAE 0x4e6329
:0 0x4e62ef 90 NOPL
:0 0x4e62f0 bb01000000 MOVL $0x1, BX
:0 0x4e62f5 f00fc118 LOCK XADDL BX, 0(AX)
:0 0x4e62f9 ffc3 INCL BX
:0 0x4e62fb 0fbae300 BTL $0x0, BX
:0 0x4e62ff 73d0 JAE 0x4e62d1
After this CL:
:0 0x4e6361 48ffc2 INCQ DX
:0 0x4e6364 48399110010000 CMPQ DX, 0x110(CX)
:0 0x4e636b 7e26 JLE 0x4e6393
:0 0x4e636d 90 NOPL
:0 0x4e636e bb01000000 MOVL $0x1, BX
:0 0x4e6373 f00fc118 LOCK XADDL BX, 0(AX)
:0 0x4e6377 ffc3 INCL BX
:0 0x4e6379 0fbae300 BTL $0x0, BX
:0 0x4e637d 733a JAE 0x4e63b9
:0 0x4e637f 90 NOPL
:0 0x4e6380 bb01000000 MOVL $0x1, BX
:0 0x4e6385 f00fc118 LOCK XADDL BX, 0(AX)
:0 0x4e6389 ffc3 INCL BX
:0 0x4e638b 0fbae300 BTL $0x0, BX
:0 0x4e638f 73d0 JAE 0x4e6361
PiperOrigin-RevId: 329754148
|
|
This is to cover the common pattern: open->read/write->close,
where SetAttr needs to be called to update atime/mtime before
the file is closed.
Benchmark results:
BM_OpenReadClose/10240 CPU
setattr+clunk: 63783 ns
VFS2: 68109 ns
VFS1: 72507 ns
Updates #1198
PiperOrigin-RevId: 329628461
|
|
On receiving an ACK with unacceptable ACK number, in a closing state,
TCP, needs to reply back with an ACK with correct seq and ack numbers and
remain in same state. This change is as per RFC793 page 37, but with a
difference that it does not apply to ESTABLISHED state, just as in Linux.
Also add more tests to check for OTW sequence number and unacceptable
ack numbers in these states.
Fixes #3785
PiperOrigin-RevId: 329616283
|
|
Updates #2972
PiperOrigin-RevId: 329584905
|
|
PiperOrigin-RevId: 329572337
|
|
PiperOrigin-RevId: 329564614
|
|
PiperOrigin-RevId: 329526153
|
|
As documented for gofer.dentry.hostFD.
PiperOrigin-RevId: 329372319
|
|
Implement walk directories in gvisor verity file system. For each step,
the child dentry is verified against a verified parent root hash.
PiperOrigin-RevId: 329358747
|
|
PiperOrigin-RevId: 329349158
|
|
PiperOrigin-RevId: 329036994
|
|
An earlier change considered the loopback bound to all addresses in an
assigned subnet. This should have only be done for IPv4 to maintain
compatability with Linux:
```
$ ip addr show dev lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group ...
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
$ ping 2001:db8::1
PING 2001:db8::1(2001:db8::1) 56 data bytes
^C
--- 2001:db8::1 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3062ms
$ ping 2001:db8::2
PING 2001:db8::2(2001:db8::2) 56 data bytes
^C
--- 2001:db8::2 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2030ms
$ sudo ip addr add 2001:db8::1/64 dev lo
$ ping 2001:db8::1
PING 2001:db8::1(2001:db8::1) 56 data bytes
64 bytes from 2001:db8::1: icmp_seq=1 ttl=64 time=0.055 ms
64 bytes from 2001:db8::1: icmp_seq=2 ttl=64 time=0.074 ms
64 bytes from 2001:db8::1: icmp_seq=3 ttl=64 time=0.073 ms
64 bytes from 2001:db8::1: icmp_seq=4 ttl=64 time=0.071 ms
^C
--- 2001:db8::1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3075ms
rtt min/avg/max/mdev = 0.055/0.068/0.074/0.007 ms
$ ping 2001:db8::2
PING 2001:db8::2(2001:db8::2) 56 data bytes
From 2001:db8::1 icmp_seq=1 Destination unreachable: No route
From 2001:db8::1 icmp_seq=2 Destination unreachable: No route
From 2001:db8::1 icmp_seq=3 Destination unreachable: No route
From 2001:db8::1 icmp_seq=4 Destination unreachable: No route
^C
--- 2001:db8::2 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3070ms
```
Test: integration_test.TestLoopbackAcceptAllInSubnet
PiperOrigin-RevId: 329011566
|
|
This mainly involved enabling kernfs' client filesystems to provide a
StatFS implementation.
Fixes #3411, #3515.
PiperOrigin-RevId: 329009864
|
|
The existing implementation for NetworkProtocol.{Set}Option take
arguments of an empty interface type which all types (implicitly)
implement; any type may be passed to the functions.
This change introduces marker interfaces for network protocol options
that may be set or queried which network protocol option types implement
to ensure that invalid types are caught at compile time. Different
interfaces are used to allow the compiler to enforce read-only or
set-only socket options.
PiperOrigin-RevId: 328980359
|
|
Also, add corresponding EOF tests for splice/sendfile.
Discovered by syzkaller.
PiperOrigin-RevId: 328975990
|
|
Reported-by: syzbot+074ec22c42305725b79f@syzkaller.appspotmail.com
PiperOrigin-RevId: 328963899
|
|
This change was already done as of
https://github.com/google/gvisor/commit/1736b2208f but
https://github.com/google/gvisor/commit/a174aa7597 conflicted with that
change and it was missed in reviews.
This change fixes the conflict.
PiperOrigin-RevId: 328920372
|
|
PiperOrigin-RevId: 328863725
|
|
Fixes *.sh Java runtime tests, where splice()-ing from a pipe to /dev/zero
would not actually empty the pipe.
There was no guarantee that the data would actually be consumed on a splice
operation unless the output file's implementation of Write/PWrite actually
called VFSPipeFD.CopyIn. Now, whatever bytes are "written" are consumed
regardless of whether CopyIn is called or not.
Furthermore, the number of bytes in the IOSequence for reads is now capped at
the amount of data actually available. Before, splicing to /dev/zero would
always return the requested splice size without taking the actual available
data into account.
This change also refactors the case where an input file is spliced into an
output pipe so that it follows a similar pattern, which is arguably cleaner
anyway.
Updates #3576.
PiperOrigin-RevId: 328843954
|
|
PiperOrigin-RevId: 328843560
|
|
PiperOrigin-RevId: 328839759
|
|
The existing implementation for {G,S}etSockOpt take arguments of an
empty interface type which all types (implicitly) implement; any
type may be passed to the functions.
This change introduces marker interfaces for socket options that may be
set or queried which socket option types implement to ensure that invalid
types are caught at compile time. Different interfaces are used to allow
the compiler to enforce read-only or set-only socket options.
Fixes #3714.
RELNOTES: n/a
PiperOrigin-RevId: 328832161
|
|
In an upcoming CL, socket option types are made to implement a marker
interface with pointer receivers. Since this results in calling methods
of an interface with a pointer, we incur an allocation when attempting
to get an Endpoint's last error with the current implementation.
When calling the method of an interface, the compiler is unable to
determine what the interface implementation does with the pointer
(since calling a method on an interface uses virtual dispatch at runtime
so the compiler does not know what the interface method will do) so it
allocates on the heap to be safe incase an implementation continues to
hold the pointer after the functioon returns (the reference escapes the
scope of the object).
In the example below, the compiler does not know what b.foo does with
the reference to a it allocates a on the heap as the reference to a may
escape the scope of a.
```
var a int
var b someInterface
b.foo(&a)
```
This change removes the opportunity for that allocation.
RELNOTES: n/a
PiperOrigin-RevId: 328796559
|
|
More implementation+testing to follow.
#3549.
PiperOrigin-RevId: 328770160
|
|
Use reflection and tags to provide automatic conversion from
Config to flags. This makes adding new flags less error-prone,
skips flags using default values (easier to read), and makes
tests correctly use default flag values for test Configs.
Updates #3494
PiperOrigin-RevId: 328662070
|
|
Signed-off-by: Bin Lu <bin.lu@arm.com>
|
|
This immediately revealed an escape analysis violation (!), where
the sync.Map was being used in a context that escapes were not
allowed. This is a relatively minor fix and is included.
PiperOrigin-RevId: 328611237
|
|
PiperOrigin-RevId: 328583461
|
|
This is needed to support the overlay opaque attribute.
PiperOrigin-RevId: 328552985
|
|
This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and
vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted
soon anyway.
The following structs now use the new tool, with leak check enabled:
devpts:rootInode
fuse:inode
kernfs:Dentry
kernfs:dir
kernfs:readonlyDir
kernfs:StaticDirectory
proc:fdDirInode
proc:fdInfoDirInode
proc:subtasksInode
proc:taskInode
proc:tasksInode
vfs:FileDescription
vfs:MountNamespace
vfs:Filesystem
sys:dir
kernel:FSContext
kernel:ProcessGroup
kernel:Session
shm:Shm
mm:aioMappable
mm:SpecialMappable
transport:queue
And the following use the template, but because they currently are not leak
checked, a TODO is left instead of enabling leak check in this patch:
kernel:FDTable
tun:tunEndpoint
Updates #1486.
PiperOrigin-RevId: 328460377
|
|
This does not implement accepting or enforcing any size limit, which will be
more complex and has performance implications; it just returns a fixed non-zero
size.
Updates #1936
PiperOrigin-RevId: 328428588
|
|
In Linux, a kernel configuration is set that compiles the kernel with a
custom function that is called at the beginning of every basic block, which
updates the memory-mapped coverage information. The Go coverage tool does not
allow us to inject arbitrary instructions into basic blocks, but it does
provide data that we can convert to a kcov-like format and transfer them to
userspace through a memory mapping.
Note that this is not a strict implementation of kcov, which is especially
tricky to do because we do not have the same coverage tools available in Go
that that are available for the actual Linux kernel. In Linux, a kernel
configuration is set that compiles the kernel with a custom function that is
called at the beginning of every basic block to write program counters to the
kcov memory mapping. In Go, however, coverage tools only give us a count of
basic blocks as they are executed. Every time we return to userspace, we
collect the coverage information and write out PCs for each block that was
executed, providing userspace with the illusion that the kcov data is always
up to date. For convenience, we also generate a unique synthetic PC for each
block instead of using actual PCs. Finally, we do not provide thread-specific
coverage data (each kcov instance only contains PCs executed by the thread
owning it); instead, we will supply data for any file specified by --
instrumentation_filter.
Also, fix issue in nogo that was causing pkg/coverage:coverage_nogo
compilation to fail.
PiperOrigin-RevId: 328426526
|
|
Test:
- TestV4UnknownDestination
- TestV6UnknownDestination
PiperOrigin-RevId: 328424137
|
|
PiperOrigin-RevId: 328415633
|
|
PiperOrigin-RevId: 328410065
|
|
The actual values used for this field in Netstack are actually EtherType values
of the protocol in an Ethernet frame. Eg. header.IPv4ProtocolNumber is 0x0800
and not the number of the IPv4 Protocol Number itself which is 4. Similarly
header.IPv6ProtocolNumber is set to 0x86DD whereas the IPv6 protocol number is
41.
See:
- https://www.iana.org/assignments/ieee-802-numbers/ieee-802-numbers.xhtml (For EtherType)
- https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml (For ProtocolNumbers)
PiperOrigin-RevId: 328407293
|
|
iptables sockopts were kludged into an unnecessary check, this properly
relegates them to the {get,set}SockOptIP functions.
PiperOrigin-RevId: 328395135
|
|
Updates #3374
PiperOrigin-RevId: 328378700
|
|
PiperOrigin-RevId: 328374775
|
|
This change adds an option to replace the current implementation of ARP through
linkAddrCache, with an implementation of NUD through neighborCache. Switching
to using NUD for both ARP and NDP is beneficial for the reasons described by
RFC 4861 Section 3.1:
"[Using NUD] significantly improves the robustness of packet delivery in the
presence of failing routers, partially failing or partitioned links, or nodes
that change their link-layer addresses. For instance, mobile nodes can move
off-link without losing any connectivity due to stale ARP caches."
"Unlike ARP, Neighbor Unreachability Detection detects half-link failures and
avoids sending traffic to neighbors with which two-way connectivity is
absent."
Along with these changes exposes the API for querying and operating the
neighbor cache. Operations include:
- Create a static entry
- List all entries
- Delete all entries
- Remove an entry by address
This also exposes the API to change the NUD protocol constants on a per-NIC
basis to allow Neighbor Discovery to operate over links with widely varying
performance characteristics. See [RFC 4861 Section 10][1] for the list of
constants.
Finally, an API for subscribing to NUD state changes is exposed through
NUDDispatcher. See [RFC 4861 Appendix C][3] for the list of edges.
Tests:
pkg/tcpip/network/arp:arp_test
+ TestDirectRequest
pkg/tcpip/network/ipv6:ipv6_test
+ TestLinkResolution
+ TestNDPValidation
+ TestNeighorAdvertisementWithTargetLinkLayerOption
+ TestNeighorSolicitationResponse
+ TestNeighorSolicitationWithSourceLinkLayerOption
+ TestRouterAdvertValidation
pkg/tcpip/stack:stack_test
+ TestCacheWaker
+ TestForwardingWithFakeResolver
+ TestForwardingWithFakeResolverManyPackets
+ TestForwardingWithFakeResolverManyResolutions
+ TestForwardingWithFakeResolverPartialTimeout
+ TestForwardingWithFakeResolverTwoPackets
+ TestIPv6SourceAddressSelectionScopeAndSameAddress
[1]: https://tools.ietf.org/html/rfc4861#section-10
[2]: https://tools.ietf.org/html/rfc4861#appendix-C
Fixes #1889
Fixes #1894
Fixes #1895
Fixes #1947
Fixes #1948
Fixes #1949
Fixes #1950
PiperOrigin-RevId: 328365034
|
|
When SO_LINGER option is enabled, the close will not return until all the
queued messages are sent and acknowledged for the socket or linger timeout is
reached. If the option is not set, close will return immediately. This option
is mainly supported for connection oriented protocols such as TCP.
PiperOrigin-RevId: 328350576
|
|
We still deviate a bit from linux in how long we will actually wait in
FIN-WAIT-2. Linux seems to cap it with TIME_WAIT_LEN and it's not completely
obvious as to why it's done that way. For now I think we can ignore that and
fix it if it really is an issue.
PiperOrigin-RevId: 328324922
|