gvisor - Container Runtime Sandbox

Age	Commit message (Collapse)	Author
2020-02-24	Add support for tearing down protocol dispatchers and TIME_WAIT endpoints.	Ian Gudger
	Protocol dispatchers were previously leaked. Bypassing TIME_WAIT is required to test this change. Also fix a race when a socket in SYN-RCVD is closed. This is also required to test this change. PiperOrigin-RevId: 296922548
2020-02-05	Add notes to relevant tests.	Adin Scannell
	These were out-of-band notes that can help provide additional context and simplify automated imports. PiperOrigin-RevId: 293525915
2020-02-05	recv() on a closed TCP socket returns ENOTCONN	Eyal Soha
	From RFC 793 s3.9 p58 Event Processing: If RECEIVE Call arrives in CLOSED state and the user has access to such a connection, the return should be "error: connection does not exist" Fixes #1598 PiperOrigin-RevId: 293494287
2020-01-30	Fix for panic in endpoint.Close().	Bhasker Hariharan
	When sending a RST on shutdown we need to double check the state after acquiring the work mutex as the endpoint could have transitioned out of a connected state from the time we checked it and we acquired the workMutex. I added two tests but sadly neither reproduce the panic. I am going to leave the tests in as they are good to have anyway. PiperOrigin-RevId: 292393800
2020-01-29	Add support for TCP_DEFER_ACCEPT.	Bhasker Hariharan
	PiperOrigin-RevId: 292233574
2020-01-21	Add a new TCP stat for current open connections.	Mithun Iyer
	Such a stat accounts for all connections that are currently established and not yet transitioned to close state. Also fix bug in double increment of CurrentEstablished stat. Fixes #1579 PiperOrigin-RevId: 290827365
2020-01-15	Bugfix to terminate the protocol loop on StateError.	Bhasker Hariharan
	The change to introduce worker goroutines can cause the endpoint to transition to StateError and we should terminate the loop rather than let the endpoint transition to a CLOSED state as we do in case the endpoint enters TIME-WAIT/CLOSED. Moving to a closed state would cause the actual error to not be propagated to any read() calls etc. PiperOrigin-RevId: 289923568
2020-01-14	Changes TCP packet dispatch to use a pool of goroutines.	Bhasker Hariharan
	All inbound segments for connections in ESTABLISHED state are delivered to the endpoint's queue but for every segment delivered we also queue the endpoint for processing to a selected processor. This ensures that when there are a large number of connections in ESTABLISHED state the inbound packets are all handled by a small number of goroutines and significantly reduces the amount of work the goscheduler has to perform. We let connections in other states follow the current path where the endpoint's goroutine directly handles the segments. Updates #231 PiperOrigin-RevId: 289728325
2020-01-13	Allow dual stack sockets to operate on AF_INET	Tamir Duberstein
	Fixes #1490 Fixes #1495 PiperOrigin-RevId: 289523250
2020-01-09	New sync package.	Ian Gudger
	* Rename syncutil to sync. * Add aliases to sync types. * Replace existing usage of standard library sync package. This will make it easier to swap out synchronization primitives. For example, this will allow us to use primitives from github.com/sasha-s/go-deadlock to check for lock ordering violations. Updates #1472 PiperOrigin-RevId: 289033387
2020-01-09	Merge pull request #1523 from majek:fix-1522-silly-window-rx	gVisor bot
	PiperOrigin-RevId: 289019953
2020-01-09	Change BindToDeviceOption to store NICID	Eyal Soha
	This makes it possible to call the sockopt from go even when the NIC has no name. PiperOrigin-RevId: 288955236
2020-01-08	Introduce tcpip.SockOptBool	Tamir Duberstein
	...and port V6OnlyOption to it. PiperOrigin-RevId: 288789451
2020-01-08	Rename tcpip.SockOpt{,Int}	Tamir Duberstein
	PiperOrigin-RevId: 288772878
2020-01-08	Fix #1522 - implement silly window sydrome protection on rx side	Marek Majkowski
	Before, each of small read()'s that raises window either from zero or above threshold of aMSS, would generate an ACK. In a classic silly-window-syndrome scenario, we can imagine a pessimistic case when small read()'s generate a stream of ACKs. This PR fixes that, essentially treating window size < aMSS as zero. We send ACK exactly in a moment when window increases to >= aMSS or half of receive buffer size (whichever smaller).
2020-01-07	#1398 - send ACK when available buffer space gets larger than 1 MSS	Marek Majkowski
	When receiving data, netstack avoids sending spurious acks. When user does recv() should netstack send ack telling the sender that the window was increased? It depends. Before this patch, netstack _will_ send the ack in the case when window was zero or window >> scale was zero. Basically - when recv space increased from zero. This is not working right with silly-window-avoidance on the sender side. Some network stacks refuse to transmit segments, that will fill the window but are below MSS. Before this patch, this confuses netstack. On one hand if the window was like 3 bytes, netstack will _not_ send ack if the window increases. On the other hand sending party will refuse to transmit 3-byte packet. This patch changes that, making netstack will send an ACK when the available buffer size increases to or above 1*MSS. This will inform other party buffer is large enough, and hopefully uncork it. Signed-off-by: Marek Majkowski <marek@cloudflare.com>
2019-12-18	net/tcp: allow to call listen without bind	Andrei Vagin
	When listen(2) is called on an unbound socket, the socket is automatically bound to a random free port with the local address set to INADDR_ANY. PiperOrigin-RevId: 286305906
2019-12-11	Add support for TCP_USER_TIMEOUT option.	Bhasker Hariharan
	The implementation follows the linux behavior where specifying a TCP_USER_TIMEOUT will cause the resend timer to honor the user specified timeout rather than the default rto based timeout. Further it alters when connections are timedout due to keepalive failures. It does not alter the behavior of when keepalives are sent. This is as per the linux behavior. PiperOrigin-RevId: 285099795
2019-12-09	Add UDP SO_REUSEADDR support to the port manager.	Ian Gudger
	Next steps include adding support to the transport demuxer and the UDP endpoint. PiperOrigin-RevId: 284652151
2019-11-22	Store SO_BINDTODEVICE state at bind.	Ian Gudger
	This allows us to ensure that the correct port reservation is released. Fixes #1217 PiperOrigin-RevId: 282048155
2019-11-07	Add support for TIME_WAIT timeout.	Bhasker Hariharan
	This change adds explicit support for honoring the 2MSL timeout for sockets in TIME_WAIT state. It also adds support for the TCP_LINGER2 option that allows modification of the FIN_WAIT2 state timeout duration for a given socket. It also adds an option to modify the Stack wide TIME_WAIT timeout but this is only for testing. On Linux this is fixed at 60s. Further, we also now correctly process RST's in CLOSE_WAIT and close the socket similar to linux without moving it to error state. We also now handle SYN in ESTABLISHED state as per RFC5961#section-4.1. Earlier we would just drop these SYNs. Which can result in some tests that pass on linux to fail on gVisor. Netstack now honors TIME_WAIT correctly as well as handles the following cases correctly. - TCP RSTs in TIME_WAIT are ignored. - A duplicate TCP FIN during TIME_WAIT extends the TIME_WAIT and a dup ACK is sent in response to the FIN as the dup FIN indicates potential loss of the original final ACK. - An out of order segment during TIME_WAIT generates a dup ACK. - A new SYN w/ a sequence number > the highest sequence number in the previous connection closes the TIME_WAIT early and opens a new connection. Further to make the SYN case work correctly the ISN (Initial Sequence Number) generation for Netstack has been updated to be as per RFC. Its not a pure random number anymore and follows the recommendation in https://tools.ietf.org/html/rfc6528#page-3. The current hash used is not a cryptographically secure hash function. A separate change will update the hash function used to Siphash similar to what is used in Linux. PiperOrigin-RevId: 279106406
2019-11-06	Rename nicid to nicID to follow go-readability initialisms	Ghanan Gowripalan
	https://github.com/golang/go/wiki/CodeReviewComments#initialisms This change does not introduce any new functionality. It just renames variables from `nicid` to `nicID`. PiperOrigin-RevId: 278992966
2019-11-06	Internal change.	gVisor bot
	PiperOrigin-RevId: 278979065
2019-11-06	Use PacketBuffers, rather than VectorisedViews, in netstack.	Kevin Krakauer
	PacketBuffers are analogous to Linux's sk_buff. They hold all information about a packet, headers, and payload. This is important for: * iptables to access various headers of packets * Preventing the clutter of passing different net and link headers along with VectorisedViews to packet handling functions. This change only affects the incoming packet path, and a future change will change the outgoing path. Benchmark Regular PacketBufferPtr PacketBufferConcrete -------------------------------------------------------------------------------- BM_Recvmsg 400.715MB/s 373.676MB/s 396.276MB/s BM_Sendmsg 361.832MB/s 333.003MB/s 335.571MB/s BM_Recvfrom 453.336MB/s 393.321MB/s 381.650MB/s BM_Sendto 378.052MB/s 372.134MB/s 341.342MB/s BM_SendmsgTCP/0/1k 353.711MB/s 316.216MB/s 322.747MB/s BM_SendmsgTCP/0/2k 600.681MB/s 588.776MB/s 565.050MB/s BM_SendmsgTCP/0/4k 995.301MB/s 888.808MB/s 941.888MB/s BM_SendmsgTCP/0/8k 1.517GB/s 1.274GB/s 1.345GB/s BM_SendmsgTCP/0/16k 1.872GB/s 1.586GB/s 1.698GB/s BM_SendmsgTCP/0/32k 1.017GB/s 1.020GB/s 1.133GB/s BM_SendmsgTCP/0/64k 475.626MB/s 584.587MB/s 627.027MB/s BM_SendmsgTCP/0/128k 416.371MB/s 503.434MB/s 409.850MB/s BM_SendmsgTCP/0/256k 323.449MB/s 449.599MB/s 388.852MB/s BM_SendmsgTCP/0/512k 243.992MB/s 267.676MB/s 314.474MB/s BM_SendmsgTCP/0/1M 95.138MB/s 95.874MB/s 95.417MB/s BM_SendmsgTCP/0/2M 96.261MB/s 94.977MB/s 96.005MB/s BM_SendmsgTCP/0/4M 96.512MB/s 95.978MB/s 95.370MB/s BM_SendmsgTCP/0/8M 95.603MB/s 95.541MB/s 94.935MB/s BM_SendmsgTCP/0/16M 94.598MB/s 94.696MB/s 94.521MB/s BM_SendmsgTCP/0/32M 94.006MB/s 94.671MB/s 94.768MB/s BM_SendmsgTCP/0/64M 94.133MB/s 94.333MB/s 94.746MB/s BM_SendmsgTCP/0/128M 93.615MB/s 93.497MB/s 93.573MB/s BM_SendmsgTCP/0/256M 93.241MB/s 95.100MB/s 93.272MB/s BM_SendmsgTCP/1/1k 303.644MB/s 316.074MB/s 308.430MB/s BM_SendmsgTCP/1/2k 537.093MB/s 584.962MB/s 529.020MB/s BM_SendmsgTCP/1/4k 882.362MB/s 939.087MB/s 892.285MB/s BM_SendmsgTCP/1/8k 1.272GB/s 1.394GB/s 1.296GB/s BM_SendmsgTCP/1/16k 1.802GB/s 2.019GB/s 1.830GB/s BM_SendmsgTCP/1/32k 2.084GB/s 2.173GB/s 2.156GB/s BM_SendmsgTCP/1/64k 2.515GB/s 2.463GB/s 2.473GB/s BM_SendmsgTCP/1/128k 2.811GB/s 3.004GB/s 2.946GB/s BM_SendmsgTCP/1/256k 3.008GB/s 3.159GB/s 3.171GB/s BM_SendmsgTCP/1/512k 2.980GB/s 3.150GB/s 3.126GB/s BM_SendmsgTCP/1/1M 2.165GB/s 2.233GB/s 2.163GB/s BM_SendmsgTCP/1/2M 2.370GB/s 2.219GB/s 2.453GB/s BM_SendmsgTCP/1/4M 2.005GB/s 2.091GB/s 2.214GB/s BM_SendmsgTCP/1/8M 2.111GB/s 2.013GB/s 2.109GB/s BM_SendmsgTCP/1/16M 1.902GB/s 1.868GB/s 1.897GB/s BM_SendmsgTCP/1/32M 1.655GB/s 1.665GB/s 1.635GB/s BM_SendmsgTCP/1/64M 1.575GB/s 1.547GB/s 1.575GB/s BM_SendmsgTCP/1/128M 1.524GB/s 1.584GB/s 1.580GB/s BM_SendmsgTCP/1/256M 1.579GB/s 1.607GB/s 1.593GB/s PiperOrigin-RevId: 278940079
2019-10-30	Store endpoints inside multiPortEndpoint in a sorted order	Andrei Vagin
	It is required to guarantee the same order of endpoints after save/restore. PiperOrigin-RevId: 277598665
2019-10-29	Add endpoint tracking to the stack.	Ian Gudger
	In the future this will replace DanglingEndpoints. DanglingEndpoints must be kept for now due to issues with save/restore. This is arguably a cleaner design and allows the stack to know which transport endpoints might still be using its link endpoints. Updates #837 PiperOrigin-RevId: 277386633
2019-10-29	Allow waiting for Endpoint worker goroutines to finish.	Ian Gudger
	Updates #837 PiperOrigin-RevId: 277325162
2019-10-28	Use the user supplied TCP MSS when creating a new active socket	Ghanan Gowripalan
	This change supports using a user supplied TCP MSS for new active TCP connections. Note, the user supplied MSS must be less than or equal to the maximum possible MSS for a TCP connection's route. If it is greater than the maximum possible MSS, the maximum possible MSS will be used as the connection's MSS instead. This change does not use this user supplied MSS for connections accepted from listening sockets - that will come in a later change. Test: Test that outgoing TCP SYN segments contain a TCP MSS option with the user supplied MSS if it is not greater than the maximum possible MSS for the route. PiperOrigin-RevId: 277185125
2019-10-25	Convert DelayOption to the newer/faster SockOpt int type.	Ian Gudger
	DelayOption is set on all new endpoints in gVisor. PiperOrigin-RevId: 276746791
2019-10-23	Merge pull request #641 from tanjianfeng:master	gVisor bot
	PiperOrigin-RevId: 276380008
2019-10-22	netstack/tcp: software segmentation offload	Andrei Vagin
	Right now, we send each tcp packet separately, we call one system call per-packet. This patch allows to generate multiple tcp packets and send them by sendmmsg. The arguable part of this CL is a way how to handle multiple headers. This CL adds the next field to the Prepandable buffer. Nginx test results: Server Software: nginx/1.15.9 Server Hostname: 10.138.0.2 Server Port: 8080 Document Path: /10m.txt Document Length: 10485760 bytes w/o gso: Concurrency Level: 5 Time taken for tests: 5.491 seconds Complete requests: 100 Failed requests: 0 Total transferred: 1048600200 bytes HTML transferred: 1048576000 bytes Requests per second: 18.21 [#/sec] (mean) Time per request: 274.525 [ms] (mean) Time per request: 54.905 [ms] (mean, across all concurrent requests) Transfer rate: 186508.03 [Kbytes/sec] received sw-gso: Concurrency Level: 5 Time taken for tests: 3.852 seconds Complete requests: 100 Failed requests: 0 Total transferred: 1048600200 bytes HTML transferred: 1048576000 bytes Requests per second: 25.96 [#/sec] (mean) Time per request: 192.576 [ms] (mean) Time per request: 38.515 [ms] (mean, across all concurrent requests) Transfer rate: 265874.92 [Kbytes/sec] received w/o gso: $ ./tcp_benchmark --client --duration 15 --ideal [SUM] 0.0-15.1 sec 2.20 GBytes 1.25 Gbits/sec software gso: $ tcp_benchmark --client --duration 15 --ideal --gso $((1<<16)) --swgso [SUM] 0.0-15.1 sec 3.99 GBytes 2.26 Gbits/sec PiperOrigin-RevId: 276112677
2019-10-15	netstack: add counters for tcp CurrEstab and EstabResets	Jianfeng Tan
	Signed-off-by: Jianfeng Tan <henry.tjf@antfin.com>
2019-10-14	Internal change.	gVisor bot
	PiperOrigin-RevId: 274700093
2019-10-09	Internal change.	gVisor bot
	PiperOrigin-RevId: 273861936
2019-10-07	Implement IP_TTL.	Ian Gudger
	Also change the default TTL to 64 to match Linux. PiperOrigin-RevId: 273430341
2019-09-30	Fix bugs in PickEphemeralPort for TCP.	Bhasker Hariharan
	Netstack always picks a random start point everytime PickEphemeralPort is called. While this is required for UDP so that DNS requests go out through a randomized set of ports it is not required for TCP. Infact Linux explicitly hashes the (srcip, dstip, dstport) and a one time secret initialized at start of the application to get a random offset. But to ensure it doesn't start from the same point on every scan it uses a static hint that is incremented by 2 in every call to pick ephemeral ports. The reason for 2 is Linux seems to split the port ranges where active connects seem to use even ones while odd ones are used by listening sockets. This CL implements a similar strategy where we use a hash + hint to generate the offset to start the search for a free Ephemeral port. This ensures that we cycle through the available port space in order for repeated connects to the same destination and significantly reduces the chance of picking a recently released port. PiperOrigin-RevId: 272058370
2019-09-27	Implement SO_BINDTODEVICE sockopt	gVisor bot
	PiperOrigin-RevId: 271644926
2019-09-23	netstack: convert more socket options to {Set,Get}SockOptInt	Andrei Vagin
	PiperOrigin-RevId: 270763208
2019-09-12	Implement splice methods for pipes and sockets.	Adin Scannell
	This also allows the tee(2) implementation to be enabled, since dup can now be properly supported via WriteTo. Note that this change necessitated some minor restructoring with the fs.FileOperations splice methods. If the *fs.File is passed through directly, then only public API methods are accessible, which will deadlock immediately since the locking is already done by fs.Splice. Instead, we pass through an abstract io.Reader or io.Writer, which elide locks and use the underlying fs.FileOperations directly. PiperOrigin-RevId: 268805207
2019-08-21	Use tcpip.Subnet in tcpip.Route	Tamir Duberstein
	This is the first step in replacing some of the redundant types with the standard library equivalents. PiperOrigin-RevId: 264706552
2019-08-16	netstack: disconnect an unix socket only if the address family is AF_UNSPEC	Andrei Vagin
	Linux allows to call connect for ANY and the zero port. PiperOrigin-RevId: 263892534
2019-08-15	netstack: move resumption logic into *_state.go	Tamir Duberstein
	13a98df rearranged some of this code in a way that broke compilation of the netstack-only export at github.com/google/netstack because _state.go files are not included in that export. This commit moves resumption logic back into _state.go, fixing the compilation breakage. PiperOrigin-RevId: 263601629
2019-08-14	Replace uinptr with int64 when returning lengths	Tamir Duberstein
	This is in accordance with newer parts of the standard library. PiperOrigin-RevId: 263449916
2019-08-14	Improve SendMsg performance.	Bhasker Hariharan
	SendMsg before this change would copy all the data over into a new slice even if the underlying socket could only accept a small amount of data. This is really inefficient with non-blocking sockets and under high throughput where large writes could get ErrWouldBlock or if there was say a timeout associated with the sendmsg() syscall. With this change we delay copying bytes in till they are needed and only copy what can be potentially sent/held in the socket buffer. Reducing the need to repeatedly copy data over. Also a minor fix to change state FIN-WAIT-1 when shutdown(..., SHUT_WR) is called instead of when we transmit the actual FIN. Otherwise the socket could remain in CONNECTED state even though the user has called shutdown() on the socket. Updates #627 PiperOrigin-RevId: 263430505
2019-08-08	netstack: Don't start endpoint goroutines too soon on restore.	Rahat Mahmood
	Endpoint protocol goroutines were previously started as part of loading the endpoint. This is potentially too soon, as resources used by these goroutine may not have been loaded. Protocol goroutines may perform meaningful work as soon as they're started (ex: incoming connect) which can cause them to indirectly access resources that haven't been loaded yet. This CL defers resuming all protocol goroutines until the end of restore. PiperOrigin-RevId: 262409429
2019-08-06	Fix for a panic due to writing to a closed accept channel.	Bhasker Hariharan
	This can happen because endpoint.Close() closes the accept channel first and then drains/resets any accepted but not delivered connections. But there can be connections that are connected but not delivered to the channel as the channel was full. But closing the channel can cause these writes to fail with a write to a closed channel. The correct solution is to abort any connections in SYN-RCVD state and drain/abort all completed connections before closing the accept channel. PiperOrigin-RevId: 261951132
2019-08-02	Plumbing for iptables sockopts.	Kevin Krakauer
	PiperOrigin-RevId: 261413396
2019-08-02	Automated rollback of changelist 261191548	Rahat Mahmood
	PiperOrigin-RevId: 261373749
2019-08-01	Implement getsockopt(TCP_INFO).	Rahat Mahmood
	Export some readily-available fields for TCP_INFO and stub out the rest. PiperOrigin-RevId: 261191548
2019-07-23	Deduplicate EndpointState.connected some	Tamir Duberstein
	This fixes a bug introduced in cl/251934850 that caused connect-accept-close-connect races to result in the second connect call failiing when it should have succeeded. PiperOrigin-RevId: 259584525