Age | Commit message (Collapse) | Author |
|
For iptables users, Check() is a hot path called for every packet one or more
times. Let's avoid a bunch of map lookups.
PiperOrigin-RevId: 322678699
|
|
Updates #173
PiperOrigin-RevId: 322665518
|
|
Updates #173
PiperOrigin-RevId: 321690756
|
|
gVisor incorrectly returns the wrong ARP type for SIOGIFHWADDR. This breaks
tcpdump as it tries to interpret the packets incorrectly.
Similarly, SIOCETHTOOL is used by tcpdump to query interface properties which
fails with an EINVAL since we don't implement it. For now change it to return
EOPNOTSUPP to indicate that we don't support the query rather than return
EINVAL.
NOTE: ARPHRD types for link endpoints are distinct from NIC capabilities
and NIC flags. In Linux all 3 exist eg. ARPHRD types are stored in dev->type
field while NIC capabilities are more like the device features which can be
queried using SIOCETHTOOL but not modified and NIC Flags are fields that can
be modified from user space. eg. NIC status (UP/DOWN/MULTICAST/BROADCAST) etc.
Updates #2746
PiperOrigin-RevId: 321436525
|
|
When we failed to create the new socket after adding the fd to
fdnotifier, we should remove the fd from fdnotifier, because we
are going to close the fd directly.
Fixes: #3241
Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
|
|
Updates #2746
PiperOrigin-RevId: 320757963
|
|
RFC-1122 (and others) specify that UDP should not receive
datagrams that have a source address that is a multicast address.
Packets should never be received FROM a multicast address.
See also, RFC 768: 'User Datagram Protocol'
J. Postel, ISI, 28 August 1980
A UDP datagram received with an invalid IP source address
(e.g., a broadcast or multicast address) must be discarded
by UDP or by the IP layer (see rfc 1122 Section 3.2.1.3).
This CL does not address TCP or broadcast which is more complicated.
Also adds a test for both ipv6 and ipv4 UDP.
Fixes #3154
PiperOrigin-RevId: 320547674
|
|
Updates #2746
Fixes #3158
PiperOrigin-RevId: 320497190
|
|
PiperOrigin-RevId: 319283715
|
|
SO_NO_CHECK is used to skip the UDP checksum generation on a TX socket
(UDP checksum is optional on IPv4).
Test:
- TestNoChecksum
- SoNoCheckOffByDefault (UdpSocketTest)
- SoNoCheck (UdpSocketTest)
Fixes #3055
PiperOrigin-RevId: 318575215
|
|
- Split connTrackForPacket into 2 functions instead of switching on flag
- Replace hash with struct keys.
- Remove prefixes where possible
- Remove unused connStatus, timeout
- Flatten ConnTrack struct a bit - some intermediate structs had no meaning
outside of the context of their parent.
- Protect conn.tcb with a mutex
- Remove redundant error checking (e.g. when is pkt.NetworkHeader valid)
- Clarify that HandlePacket and CreateConnFor are the expected entrypoints for
ConnTrack
PiperOrigin-RevId: 318407168
|
|
Linux controls socket send/receive buffers using a few sysctl variables
- net.core.rmem_default
- net.core.rmem_max
- net.core.wmem_max
- net.core.wmem_default
- net.ipv4.tcp_rmem
- net.ipv4.tcp_wmem
The first 4 control the default socket buffer sizes for all sockets
raw/packet/tcp/udp and also the maximum permitted socket buffer that can be
specified in setsockopt(SOL_SOCKET, SO_(RCV|SND)BUF,...).
The last two control the TCP auto-tuning limits and override the default
specified in rmem_default/wmem_default as well as the max limits.
Netstack today only implements tcp_rmem/tcp_wmem and incorrectly uses it
to limit the maximum size in setsockopt() as well as uses it for raw/udp
sockets.
This changelist introduces the other 4 and updates the udp/raw sockets to use
the newly introduced variables. The values for min/max match the current
tcp_rmem/wmem values and the default value buffers for UDP/RAW sockets is
updated to match the linux value of 212KiB up from the really low current value
of 32 KiB.
Updates #3043
Fixes #3043
PiperOrigin-RevId: 318089805
|
|
Test:
- TestIncrementChecksumErrors
Fixes #2943
PiperOrigin-RevId: 317348158
|
|
It accesses e.receiver which is protected by the endpoint lock.
WARNING: DATA RACE
Write at 0x00c0006aa2b8 by goroutine 189:
pkg/sentry/socket/unix/transport.(*connectionedEndpoint).Connect.func1()
pkg/sentry/socket/unix/transport/connectioned.go:359 +0x50
pkg/sentry/socket/unix/transport.(*connectionedEndpoint).BidirectionalConnect()
pkg/sentry/socket/unix/transport/connectioned.go:327 +0xa3c
pkg/sentry/socket/unix/transport.(*connectionedEndpoint).Connect()
pkg/sentry/socket/unix/transport/connectioned.go:363 +0xca
pkg/sentry/socket/unix.(*socketOpsCommon).Connect()
pkg/sentry/socket/unix/unix.go:420 +0x13a
pkg/sentry/socket/unix.(*SocketOperations).Connect()
<autogenerated>:1 +0x78
pkg/sentry/syscalls/linux.Connect()
pkg/sentry/syscalls/linux/sys_socket.go:286 +0x251
Previous read at 0x00c0006aa2b8 by goroutine 270:
pkg/sentry/socket/unix/transport.(*baseEndpoint).Connected()
pkg/sentry/socket/unix/transport/unix.go:789 +0x42
pkg/sentry/socket/unix/transport.(*connectionedEndpoint).State()
pkg/sentry/socket/unix/transport/connectioned.go:479 +0x2f
pkg/sentry/socket/unix.(*socketOpsCommon).State()
pkg/sentry/socket/unix/unix.go:714 +0xc3e
pkg/sentry/socket/unix.(*socketOpsCommon).SendMsg()
pkg/sentry/socket/unix/unix.go:466 +0xc44
pkg/sentry/socket/unix.(*SocketOperations).SendMsg()
<autogenerated>:1 +0x173
pkg/sentry/syscalls/linux.sendTo()
pkg/sentry/syscalls/linux/sys_socket.go:1121 +0x4c5
pkg/sentry/syscalls/linux.SendTo()
pkg/sentry/syscalls/linux/sys_socket.go:1134 +0x87
Reported-by: syzbot+c2be37eedc672ed59a86@syzkaller.appspotmail.com
PiperOrigin-RevId: 317236996
|
|
Metadata was useful for debugging and safety, but enough tests exist that we
should see failures when (de)serialization is broken. It made stack
initialization more cumbersome and it's also getting in the way of ip6tables.
PiperOrigin-RevId: 317210653
|
|
Updates #2972
PiperOrigin-RevId: 317113059
|
|
Updates #173,#6
Fixes #2888
PiperOrigin-RevId: 317087652
|
|
- Change FileDescriptionImpl Lock/UnlockPOSIX signature to
take {start,length,whence}, so the correct offset can be
calculated in the implementations.
- Create PosixLocker interface to make it possible to share
the same locking code from different implementations.
Closes #1480
PiperOrigin-RevId: 316910286
|
|
TCP_KEEPCNT is used to set the maximum keepalive probes to be
sent before dropping the connection.
WANT_LGTM=jchacon
PiperOrigin-RevId: 315758094
|
|
In case of SOCK_SEQPACKET, it has to be ignored.
In case of SOCK_STREAM, EISCONN or EOPNOTSUPP has to be returned.
PiperOrigin-RevId: 315755972
|
|
LockFD is the generic implementation that can be embedded in
FileDescriptionImpl implementations. Unique lock ID is
maintained in vfs.FileDescription and is created on demand.
Updates #1480
PiperOrigin-RevId: 315604825
|
|
Netstack has traditionally parsed headers on-demand as a packet moves up the
stack. This is conceptually simple and convenient, but incompatible with
iptables, where headers can be inspected and mangled before even a routing
decision is made.
This changes header parsing to happen early in the incoming packet path, as soon
as the NIC gets the packet from a link endpoint. Even if an invalid packet is
found (e.g. a TCP header of insufficient length), the packet is passed up the
stack for proper stats bookkeeping.
PiperOrigin-RevId: 315179302
|
|
For TCP sockets gVisor incorrectly returns EAGAIN when no ephemeral ports are
available to bind during a connect. Linux returns EADDRNOTAVAIL. This change
fixes gVisor to return the correct code and adds a test for the same.
This change also fixes a minor bug for ping sockets where connect() would fail
with EINVAL unless the socket was bound first.
Also added tests for testing UDP Port exhaustion and Ping socket port
exhaustion.
PiperOrigin-RevId: 314988525
|
|
IPTables.connections contains a sync.RWMutex. Copying it will trigger copylocks
analysis. Tested by manually enabling nogo tests.
sync.RWMutex is added to IPTables for the additional race condition discovered.
PiperOrigin-RevId: 314817019
|
|
Historically we've been passing PacketBuffer by shallow copying through out
the stack. Right now, this is only correct as the caller would not use
PacketBuffer after passing into the next layer in netstack.
With new buffer management effort in gVisor/netstack, PacketBuffer will
own a Buffer (to be added). Internally, both PacketBuffer and Buffer may
have pointers and shallow copying shouldn't be used.
Updates #2404.
PiperOrigin-RevId: 314610879
|
|
PiperOrigin-RevId: 314450191
|
|
|
|
On native Linux, calling recv/read right after send/write sometimes returns
EWOULDBLOCK, if the data has not made it to the receiving socket (even though
the endpoints are on the same host). Poll before reading to avoid this.
Making this change also uncovered a hostinet bug (gvisor.dev/issue/2726),
which is noted in this CL.
PiperOrigin-RevId: 312320587
|
|
* Aggregate architecture Overview in "What is gVisor?" as it makes more sense
in one place.
* Drop "user-space kernel" and use "application kernel". The term "user-space
kernel" is confusing when some platform implementation do not run in
user-space (instead running in guest ring zero).
* Clear up the relationship between the Platform page in the user guide and the
Platform page in the architecture guide, and ensure they are cross-linked.
* Restore the call-to-action quick start link in the main page, and drop the
GitHub link (which also appears in the top-right).
* Improve image formatting by centering all doc and blog images, and move the
image captions to the alt text.
PiperOrigin-RevId: 311845158
|
|
This change adds support for TCP_SYNCNT and TCP_WINDOW_CLAMP options
in GetSockOpt/SetSockOpt. This change does not really change any
behaviour in Netstack and only stores/returns the stored value.
Actual honoring of these options will be added as required.
Fixes #2626, #2625
PiperOrigin-RevId: 311453777
|
|
PiperOrigin-RevId: 311203776
|
|
PiperOrigin-RevId: 311181084
|
|
kernel.Task.Block() requires that the caller is running on the task goroutine.
netstack.SocketOperations.Write() uses kernel.TaskFromContext() to call
kernel.Task.Block() even if it's not running on the task goroutine. Stop doing
that.
PiperOrigin-RevId: 311178335
|
|
- Added support for matching gid owner and invert flag for uid
and gid.
$ iptables -A OUTPUT -p tcp -m owner --gid-owner root -j ACCEPT
$ iptables -A OUTPUT -p tcp -m owner ! --uid-owner root -j ACCEPT
$ iptables -A OUTPUT -p tcp -m owner ! --gid-owner root -j DROP
- Added tests for uid, gid and invert flags.
|
|
We weren't properly checking whether the inserted default rule was
unconditional.
|
|
Enables commands with -o (--out-interface) for iptables rules.
$ iptables -A OUTPUT -o eth0 -j ACCEPT
PiperOrigin-RevId: 310642286
|
|
Updates #1197, #1198, #1672
PiperOrigin-RevId: 310432006
|
|
Three updates:
- Mark all vfs2 socket syscalls as supported.
- Use the same dev number and ino number generator for all types of sockets,
unlike in VFS1.
- Do not use host fd for hostinet metadata.
Fixes #1476, #1478, #1484, 1485, #2017.
PiperOrigin-RevId: 309994579
|
|
Connection tracking is used to track packets in prerouting and
output hooks of iptables. The NAT rules modify the tuples in
connections. The connection tracking code modifies the packets by
looking at the modified tuples.
|
|
PiperOrigin-RevId: 309491861
|
|
All three follow the same pattern:
1. Refactor VFS1 sockets into socketOpsCommon, so that most of the methods can
be shared with VFS2.
2. Create a FileDescriptionImpl with the corresponding socket operations,
rewriting the few that cannot be shared with VFS1.
3. Set up a VFS2 socket provider that creates a socket by setting up a dentry
in the global Kernel.socketMount and connecting it with a new
FileDescription.
This mostly completes the work for porting sockets to VFS2, and many syscall
tests can be enabled as a result.
There are several networking-related syscall tests that are still not passing:
1. net gofer tests
2. socketpair gofer tests
2. sendfile tests (splice is not implemented in VFS2 yet)
Updates #1478, #1484, #1485
PiperOrigin-RevId: 309457331
|
|
The netfilter package uses logs to make debugging the (de)serialization of
structs easier. This generates a lot of (usually irrelevant) logs. Logging is
now hidden behind a debug flag.
PiperOrigin-RevId: 309087115
|
|
Enforce write permission checks in BoundEndpointAt, which corresponds to the
permission checks in Linux (net/unix/af_unix.c:unix_find_other).
Also, create bound socket files with the correct permissions in VFS2.
Fixes #2324.
PiperOrigin-RevId: 308949084
|
|
PiperOrigin-RevId: 308932254
|
|
Named pipes and sockets can be represented in two ways in gofer fs:
1. As a file on the remote filesystem. In this case, all file operations are
passed through 9p.
2. As a synthetic file that is internal to the sandbox. In this case, the
dentry stores an endpoint or VFSPipe for sockets and pipes respectively,
which replaces interactions with the remote fs through the gofer.
In gofer.filesystem.MknodAt, we attempt to call mknod(2) through 9p,
and if it fails, fall back to the synthetic version.
Updates #1200.
PiperOrigin-RevId: 308828161
|
|
The FileDescription implementation for hostfs sockets uses the standard Unix
socket implementation (unix.SocketVFS2), but is also tied to a hostfs dentry.
Updates #1672, #1476
PiperOrigin-RevId: 308716426
|
|
PiperOrigin-RevId: 308674219
|
|
Fixes #1477.
PiperOrigin-RevId: 308317511
|
|
These methods let users eaily break the VectorisedView abstraction, and
allowed netstack to slip into pseudo-enforcement of the "all headers are
in the first View" invariant. Removing them and replacing with PullUp(n)
breaks this reliance and will make it easier to add iptables support and
rework network buffer management.
The new View.PullUp(n) method is low cost in the common case, when when
all the headers fit in the first View.
PiperOrigin-RevId: 308163542
|
|
Sentry metrics with nanoseconds units are labeled as such, and non-cumulative
sentry metrics are supported.
PiperOrigin-RevId: 307621080
|