summaryrefslogtreecommitdiffhomepage
path: root/pkg
AgeCommit message (Collapse)Author
2019-04-04Set fixed field in CPUID function 2Michael Pratt
From the SDM: "The least-significant byte in register EAX (register AL) will always return 01H. Software should ignore this value and not interpret it as an informational descriptor." Unfortunately, online docs [1] [2] (likely based on an old version of the SDM) say: "The least-significant byte in register EAX (register AL) indicates the number of times the CPUID instruction must be executed with an input value of 2 to get a complete description of the processor's caches and TLBs." dlang uses this second interpretation [3] and will loop 2^32 times if we return zero. Fix this by specifying the fixed value of one. We still don't support exposing the actual cache information, leaving all other bytes empty. A zero byte means: "Null descriptor, this byte contains no information." [1] http://www.sandpile.org/x86/cpuid.htm#level_0000_0002h [2] https://c9x.me/x86/html/file_module_x86_id_45.html [3] https://github.com/dlang/druntime/blob/424640864c2aa001731467e96f637bd3e704e481/src/core/cpuid.d#L533-L534 PiperOrigin-RevId: 242046629 Change-Id: Ic0f0a5f974b20f71391cb85645bdcd4003e5fe88
2019-04-04gvisor: Add support for the MS_NOEXEC mount optionAndrei Vagin
https://github.com/google/gvisor/issues/145 PiperOrigin-RevId: 242044115 Change-Id: I8f140fe05e32ecd438b6be218e224e4b7fe05878
2019-04-04Remove defer from trivial ThreadID methodsMichael Pratt
In particular, ns.IDOfTask and tg.ID are used for gettid and getpid, respectively, where removing defer saves ~100ns. This may be a small improvement to application logging, which may call gettid/getpid frequently. PiperOrigin-RevId: 242039616 Change-Id: I860beb62db3fe077519835e6bafa7c74cba6ca80
2019-04-04BUILD: Add useful go_path targetAdin Scannell
Change-Id: Ibd6d8a1a63826af6e62a0f0669f8f0866c8091b4 PiperOrigin-RevId: 242037969
2019-04-03Internal change.Googler
PiperOrigin-RevId: 241867632 Change-Id: I29459f2758ac4835882b491ff25c6aca9a37d41d
2019-04-03Only CopyOut CPU when it changesMichael Pratt
This will save copies when preemption is not caused by a CPU migration. PiperOrigin-RevId: 241844399 Change-Id: I2ba3b64aa377846ab763425bd59b61158f576851
2019-04-03Don't release d.mu in checks for child-existence.Nicolas Lacasse
Dirent.exists() is called in Create to check whether a child with the given name already exists. Dirent.exists() calls walk(), and before this CL allowed walk() to drop d.mu while calling d.Inode.Lookup. During this existence check, a racing Rename() can acquire d.mu and create a new child of the dirent with the same name. (Note that the source and destination of the rename must be in the same directory, otherwise renameMu will be taken preventing the race.) In this case, d.exists() can return false, even though a child with the same name actually does exist. This CL changes d.exists() so that it does not release d.mu while walking, thus preventing the race with Rename. It also adds comments noting that lockForRename may not take renameMu if the source and destination are in the same directory, as this is a bit surprising (at least it was to me). PiperOrigin-RevId: 241842579 Change-Id: I56524870e39dfcd18cab82054eb3088846c34813
2019-04-03Cache ThreadGroups in PIDNamespaceMichael Pratt
If there are thousands of threads, ThreadGroupsAppend becomes very expensive as it must iterate over all Tasks to find the ThreadGroup leaders. Reduce the cost by maintaining a map of ThreadGroups which can be used to grab them all directly. The one somewhat visible change is to convert PID namespace init children zapping to a group-directed SIGKILL, as Linux did in 82058d668465 "signal: Use group_send_sig_info to kill all processes in a pid namespace". In a benchmark that creates N threads which sleep for two minutes, we see approximately this much CPU time in ThreadGroupsAppend: Before: 1 thread: 0ms 1024 threads: 30ms - 9130ms 4096 threads: 50ms - 2000ms 8192 threads: 18160ms 16384 threads: 17210ms After: 1 thread: 0ms 1024 threads: 0ms 4096 threads: 0ms 8192 threads: 0ms 16384 threads: 0ms The profiling is actually extremely noisy (likely due to cache effects), as some runs show almost no samples at 1024, 4096 threads, but obviously this does not scale to lots of threads. PiperOrigin-RevId: 241828039 Change-Id: I17827c90045df4b3c49b3174f3a05bca3026a72c
2019-04-03Fix index out of bounds in tty implementation.Kevin Krakauer
The previous implementation revolved around runes instead of bytes, which caused weird behavior when converting between the two. For example, peekRune would read the byte 0xff from a buffer, convert it to a rune, then return it. As rune is an alias of int32, 0xff was 0-padded to int32(255), which is the hex code point for ?. However, peekRune also returned the length of the byte (1). When calling utf8.EncodeRune, we only allocated 1 byte, but tried the write the 2-byte character ?. tl;dr: I apparently didn't understand runes when I wrote this. PiperOrigin-RevId: 241789081 Change-Id: I14c788af4d9754973137801500ef6af7ab8a8727
2019-04-03Addresses data race in tty implementation.Kevin Krakauer
Also makes the safemem reading and writing inline, as it makes it easier to see what locks are held. PiperOrigin-RevId: 241775201 Change-Id: Ib1072f246773ef2d08b5b9a042eb7e9e0284175c
2019-04-03Add syscall annotations for unimplemented syscallsIan Lewis
Added syscall annotations for unimplemented syscalls for later generation into reference docs. Annotations are of the form: @Syscall(<name>, <key:value>, ...) Supported args and values are: - arg: A syscall option. This entry only applies to the syscall when given this option. - support: Indicates support level - UNIMPLEMENTED: Unimplemented (implies returns:ENOSYS) - PARTIAL: Partial support. Details should be provided in note. - FULL: Full support - returns: Indicates a known return value. Values are syscall errors. This is treated as a string so you can use something like "returns:EPERM or ENOSYS". - issue: A Github issue number. - note: A note Example: // @Syscall(mmap, arg:MAP_PRIVATE, support:FULL, note:Private memory fully supported) // @Syscall(mmap, arg:MAP_SHARED, support:UNIMPLEMENTED, issue:123, note:Shared memory not supported) // @Syscall(setxattr, returns:ENOTSUP, note:Requires file system support) Annotations should be placed as close to their implementation as possible (preferrably as part of a supporting function's Godoc) and should be updated as syscall support changes. PiperOrigin-RevId: 241697482 Change-Id: I7a846135db124e1271dc5057d788cba82ca312d4
2019-04-02Set options on the correct Task in PTRACE_SEIZE.Jamie Liu
$ docker run --rm --runtime=runsc -it --cap-add=SYS_PTRACE debian bash -c "apt-get update && apt-get install strace && strace ls" ... Setting up strace (4.15-2) ... execve("/bin/ls", ["ls"], [/* 6 vars */]) = 0 brk(NULL) = 0x5646d8c1e000 uname({sysname="Linux", nodename="114ef93d2db3", ...}) = 0 ... PiperOrigin-RevId: 241643321 Change-Id: Ie4bce27a7fb147eef07bbae5895c6ef3f529e177
2019-04-02Add test that symlinking over a directory returns EEXIST.Nicolas Lacasse
Also remove comments in InodeOperations that required that implementation of some Create* operations ensure that the name does not already exist, since these checks are all centralized in the Dirent. PiperOrigin-RevId: 241637335 Change-Id: Id098dc6063ff7c38347af29d1369075ad1e89a58
2019-04-02Fix more data races in shm debug messages.Rahat Mahmood
PiperOrigin-RevId: 241630409 Change-Id: Ie0df5f5a2f20c2d32e615f16e2ba43c88f963181
2019-04-02device: fix device major/minorWei Zhang
Current gvisor doesn't give devices a right major and minor number. When testing golang supporting of gvisor, I run the test case below: ``` $ docker run -ti --runtime runsc golang:1.12.1 bash -c "cd /usr/local/go/src && ./run.bash " ``` And it reports some errors, one of them is: "--- FAIL: TestDevices (0.00s) --- FAIL: TestDevices//dev/null_1:3 (0.00s) dev_linux_test.go:45: for /dev/null Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/null Minor(0x0) == 0, want 3 dev_linux_test.go:51: for /dev/null Mkdev(1, 3) == 0x103, want 0x0 --- FAIL: TestDevices//dev/zero_1:5 (0.00s) dev_linux_test.go:45: for /dev/zero Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/zero Minor(0x0) == 0, want 5 dev_linux_test.go:51: for /dev/zero Mkdev(1, 5) == 0x105, want 0x0 --- FAIL: TestDevices//dev/random_1:8 (0.00s) dev_linux_test.go:45: for /dev/random Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/random Minor(0x0) == 0, want 8 dev_linux_test.go:51: for /dev/random Mkdev(1, 8) == 0x108, want 0x0 --- FAIL: TestDevices//dev/full_1:7 (0.00s) dev_linux_test.go:45: for /dev/full Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/full Minor(0x0) == 0, want 7 dev_linux_test.go:51: for /dev/full Mkdev(1, 7) == 0x107, want 0x0 --- FAIL: TestDevices//dev/urandom_1:9 (0.00s) dev_linux_test.go:45: for /dev/urandom Major(0x0) == 0, want 1 dev_linux_test.go:48: for /dev/urandom Minor(0x0) == 0, want 9 dev_linux_test.go:51: for /dev/urandom Mkdev(1, 9) == 0x109, want 0x0 " So I think we'd better assign to them correct major/minor numbers following linux spec. Signed-off-by: Wei Zhang <zhangwei198900@gmail.com> Change-Id: I4521ee7884b4e214fd3a261929e3b6dac537ada9 PiperOrigin-RevId: 241609021
2019-04-02Add a raw socket transport endpoint and use it for raw ICMP sockets.Kevin Krakauer
Having raw socket code together will make it easier to add support for other raw network protocols. Currently, only ICMP uses the raw endpoint. However, adding support for other protocols such as UDP shouldn't be much more difficult than adding a few switch cases. PiperOrigin-RevId: 241564875 Change-Id: I77e03adafe4ce0fd29ba2d5dfdc547d2ae8f25bf
2019-04-01Save/restore simple devices.Rahat Mahmood
We weren't saving simple devices' last allocated inode numbers, which caused inode number reuse across S/R. PiperOrigin-RevId: 241414245 Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
2019-04-01Don't expand COW-break on executable VMAs.Jamie Liu
PiperOrigin-RevId: 241403847 Change-Id: I4631ca05734142da6e80cdfa1a1d63ed68aa05cc
2019-04-01gvisor: convert ilist to ilist:generic_listAndrei Vagin
ilist:generic_list works faster (cl/240185278) and the code looks cleaner without type casting. PiperOrigin-RevId: 241381175 Change-Id: I8487ab1d73637b3e9733c253c56dce9e79f0d35f
2019-03-29Use kernel.Task.CopyScratchBuffer in syscalls/linux where possible.Jamie Liu
PiperOrigin-RevId: 241072126 Change-Id: Ib4d9f58f550732ac4c5153d3cf159a5b1a9749da
2019-03-29Treat fsync errors during save as SaveRejection errors.Nicolas Lacasse
PiperOrigin-RevId: 241055485 Change-Id: I70259e9fef59bdf9733b35a2cd3319359449dd45
2019-03-29Drop reference on shared anon mappableMichael Pratt
We call NewSharedAnonMappable simply to use it for Mappable/MappingIdentity for shared anon mmap. From MMapOpts.MappingIdentity: "If MMapOpts is used to successfully create a memory mapping, a reference is taken on MappingIdentity." mm.createVMALocked (below) takes this additional reference, so we don't need the reference returned by NewSharedAnonMappable. Holding it leaks the mappable. PiperOrigin-RevId: 241038108 Change-Id: I78ee3af78e0cc7aac4063b274b30d0e41eb5677d
2019-03-29Return srclen in proc.idMapFileOperations.Write.Jamie Liu
PiperOrigin-RevId: 241037926 Change-Id: I4b0381ac1c7575e8b861291b068d3da22bc03850
2019-03-29Treat ENOSPC as a state-file error during save.Nicolas Lacasse
PiperOrigin-RevId: 241028806 Change-Id: I770bf751a2740869a93c3ab50370a727ae580470
2019-03-29Fix incorrect checksums in TCP and UDP tests.Bhasker Hariharan
PiperOrigin-RevId: 241025361 Change-Id: I292e7aea9a4b294b11e4f736e107010d9524586b
2019-03-28Fix Panic in SACKScoreboard.Delete.Bhasker Hariharan
The panic was caused by modifying the tree while iterating which invalidated the iterator. Also fixes another bug in SACKScoreboard.Insert() which was causing blocks to be merged incorrectly. PiperOrigin-RevId: 240895053 Change-Id: Ia72b8244297962df5c04283346da5226434740af
2019-03-28set task's name when forkchris.zn
When fork a child process, the name filed of TaskContext is not set. It results in that when we cat /proc/{pid}/status, the name filed is null. Like this: Name: State: S (sleeping) Tgid: 28 Pid: 28 PPid: 26 TracerPid: 0 FDSize: 8 VmSize: 89712 kB VmRSS: 6648 kB Threads: 1 CapInh: 00000000a93d35fb CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000a93d35fb Seccomp: 0 Change-Id: I5d469098c37cedd19da16b7ffab2e546a28a321e PiperOrigin-RevId: 240893304
2019-03-28Setting timestamps should trigger an inotify event.Nicolas Lacasse
PiperOrigin-RevId: 240850187 Change-Id: I1458581b771a1031e47bba439e480829794927b8
2019-03-28Add ICMP statsBert Muthalaly
PiperOrigin-RevId: 240848882 Change-Id: I23dd4599f073263437aeab357c3f767e1a432b82
2019-03-28Internal change.Googler
PiperOrigin-RevId: 240842801 Change-Id: Ibbd6f849f9613edc1b1dd7a99a97d1ecdb6e9188
2019-03-28Clean up gofer handle caching.Jamie Liu
- Document fsutil.CachedFileObject.FD() requirements on access permissions, and change gofer.inodeFileState.FD() to honor them. Fixes #147. - Combine gofer.inodeFileState.readonly and gofer.inodeFileState.readthrough, and simplify handle caching logic. - Inline gofer.cachePolicy.cacheHandles into gofer.inodeFileState.setSharedHandles, because users with access to gofer.inodeFileState don't necessarily have access to the fs.Inode (predictably, this is a save/restore problem). Before this CL: $ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash root@34d51017ed67:/# /root/repro/runsc-b147 mmap: 0x7f3c01e45000 Segmentation fault After this CL: $ docker run --runtime=runsc-d -v $(pwd)/gvisor/repro:/root/repro -it ubuntu bash root@d3c3cb56bbf9:/# /root/repro/runsc-b147 mmap: 0x7f78987ec000 o PiperOrigin-RevId: 240818413 Change-Id: I49e1d4a81a0cb9177832b0a9f31a10da722a896b
2019-03-28netstack/fdbased: add generic segmentation offload (GSO) supportAndrei Vagin
The linux packet socket can handle GSO packets, so we can segment packets to 64K instead of the MTU which is usually 1500. Here are numbers for the nginx-1m test: runsc: 579330.01 [Kbytes/sec] received runsc-gso: 1794121.66 [Kbytes/sec] received runc: 2122139.06 [Kbytes/sec] received and for tcp_benchmark: $ tcp_benchmark --duration 15 --ideal [ 4] 0.0-15.0 sec 86647 MBytes 48456 Mbits/sec $ tcp_benchmark --client --duration 15 --ideal [ 4] 0.0-15.0 sec 2173 MBytes 1214 Mbits/sec $ tcp_benchmark --client --duration 15 --ideal --gso 65536 [ 4] 0.0-15.0 sec 19357 MBytes 10825 Mbits/sec PiperOrigin-RevId: 240809103 Change-Id: I2637f104db28b5d4c64e1e766c610162a195775a
2019-03-27Add rsslim field in /proc/pid/stat.Nicolas Lacasse
PiperOrigin-RevId: 240681675 Change-Id: Ib214106e303669fca2d5c744ed5c18e835775161
2019-03-27Avoid mutating memory passed to DeliverTransportPacketTamir Duberstein
PiperOrigin-RevId: 240642903 Change-Id: I16625015123a827d267d60b328a202057264bbd6
2019-03-27Add start time to /proc/<pid>/stat.Nicolas Lacasse
The start time is the number of clock ticks between the boot time and application start time. PiperOrigin-RevId: 240619475 Change-Id: Ic8bd7a73e36627ed563988864b0c551c052492a5
2019-03-27Dev device methods should take pointer receiver.Nicolas Lacasse
PiperOrigin-RevId: 240600504 Change-Id: I7dd5f27c8da31f24b68b48acdf8f1c19dbd0c32d
2019-03-27Convert []byte to string without copying in usermem.CopyStringIn.Jamie Liu
This is the same technique used by Go's strings.Builder (https://golang.org/src/strings/builder.go#L45), and for the same reason. (We can't just use strings.Builder because there's no way to get the underlying []byte to pass to usermem.IO.CopyIn.) PiperOrigin-RevId: 240594892 Change-Id: Ic070e7e480aee53a71289c7c120850991358c52c
2019-03-26Remove polling from ICMP testTamir Duberstein
PiperOrigin-RevId: 240483396 Change-Id: Ie75d3ae38af83f1d92f167ff9ba58fa10f5b372b
2019-03-26Automated rollback of changelist 234892473Michael Pratt
PiperOrigin-RevId: 240462667 Change-Id: I3d1c5c0d80a3badced963ae1d450c20ed8a767ed
2019-03-26netstack: Don't exclude length when a pseudo-header checksum is calculatedAndrei Vagin
This is a preparation for GSO changes (cl/234508902). RELNOTES[gofers]: Refactor checksum code to include length, which it already did, but in a convoluted way. Should be a no-op. PiperOrigin-RevId: 240460794 Change-Id: I537381bc670b5a9f5d70a87aa3eb7252e8f5ace2
2019-03-26Implement memfd_create.Rahat Mahmood
Memfds are simply anonymous tmpfs files with no associated mounts. Also implementing file seals, which Linux only implements for memfds at the moment. PiperOrigin-RevId: 240450031 Change-Id: I31de78b950101ae8d7a13d0e93fe52d98ea06f2f
2019-03-26Remove echoReplierTamir Duberstein
Mirror the ICMPv6 echo implementation in ICMPv4 echo. This removes unnecessary asynchrony, reduces copying, and reduces complexity. PiperOrigin-RevId: 240394525 Change-Id: If8f53254154f86772f5e51159765aa23b3b328b8
2019-03-25Resolve stringer TODOTamir Duberstein
PiperOrigin-RevId: 240224782 Change-Id: Iab4e4e7047b2d022f15e807c2348685d8e972020
2019-03-25Call memmap.Mappable.Translate with more conservative usermem.AccessType.Jamie Liu
MM.insertPMAsLocked() passes vma.maxPerms to memmap.Mappable.Translate (although it unsets AccessType.Write if the vma is private). This somewhat simplifies handling of pmas, since it means only COW-break needs to replace existing pmas. However, it also means that a MAP_SHARED mapping of a file opened O_RDWR dirties the file, regardless of the mapping's permissions and whether or not the mapping is ever actually written to with I/O that ignores permissions (e.g. ptrace(PTRACE_POKEDATA)). To fix this: - Change the pma-getting path to request only the permissions that are required for the calling access. - Change memmap.Mappable.Translate to take requested permissions, and return allowed permissions. This preserves the existing behavior in the common cases where the memmap.Mappable isn't fsutil.CachingInodeOperations and doesn't care if the translated platform.File pages are written to. - Change the MM.getPMAsLocked path to support permission upgrading of pmas outside of copy-on-write. PiperOrigin-RevId: 240196979 Change-Id: Ie0147c62c1fbc409467a6fa16269a413f3d7d571
2019-03-25epoll: use ilist:generic_list instead of ilist:ilistAndrei Vagin
ilist:generic_list works faster than ilist:ilist. Here is a beanchmark test to measure performance of epoll_wait, when readyList isn't empty. It shows about 30% better performance with these changes. Benchmark Time(ns) CPU(ns) Iterations Before: BM_EpollAllEvents 46725 46899 14286 After: BM_EpollAllEvents 33167 33300 18919 PiperOrigin-RevId: 240185278 Change-Id: I3e33f9b214db13ab840b91613400525de5b58d18
2019-03-22lstat should resolve the final path component if it ends in a slash.Nicolas Lacasse
PiperOrigin-RevId: 239896221 Change-Id: I0949981fe50c57131c5631cdeb10b225648575c0
2019-03-22Implement PTRACE_SEIZE, PTRACE_INTERRUPT, and PTRACE_LISTEN.Jamie Liu
PiperOrigin-RevId: 239803092 Change-Id: I42d612ed6a889e011e8474538958c6de90c6fcab
2019-03-21Allow BP and OF can be called from user spaceYong He
Change the DPL from 0 to 3 for Breakpoint and Overflow, then user space could trigger Breakpoint and Overflow as excepected. Change-Id: Ibead65fb8c98b32b7737f316db93b3a8d9dcd648 PiperOrigin-RevId: 239736648
2019-03-21Replace manual pty copies to/from userspace with safemem operations.Kevin Krakauer
Also, changing queue.writeBuf from a buffer.Bytes to a [][]byte should reduce copying and reallocating of slices. PiperOrigin-RevId: 239713547 Change-Id: I6ee5ff19c3ee2662f1af5749cae7b73db0569e96
2019-03-21Clear msghdr flags on successful recvmsg.Ian Gudger
.net sets these flags to -1 and then uses their result, especting it to be zero. Does not set actual flags (e.g. MSG_TRUNC), but setting to zero is more correct than what we did before. PiperOrigin-RevId: 239657951 Change-Id: I89c5f84bc9b94a2cd8ff84e8ecfea09e01142030