summaryrefslogtreecommitdiffhomepage
path: root/pkg/sentry/kernel/BUILD
AgeCommit message (Collapse)Author
2021-06-29[syserror] Change syserror to linuxerr for E2BIG, EADDRINUSE, and EINVALZach Koopmans
Remove three syserror entries duplicated in linuxerr. Because of the linuxerr.Equals method, this is a mere change of return values from syserror to linuxerr definitions. Done with only these three errnos as CLs removing all grow to a significantly large size. PiperOrigin-RevId: 382173835
2021-06-22[syserror] Add conversions to linuxerr with temporary Equals method.Zach Koopmans
Add Equals method to compare syserror and unix.Errno errors to linuxerr errors. This will facilitate removal of syserror definitions in a followup, and finding needed conversions from unix.Errno to linuxerr. PiperOrigin-RevId: 380909667
2021-06-16[syserror] Refactor linuxerr and error package.Zach Koopmans
Move Error struct to pkg/errors package for use in multiple places. Move linuxerr static definitions under pkg/errors/linuxerr. Add a lookup list for quick lookup of *errors.Error by errno. This is useful when converting syserror errors and unix.Errno/syscall.Errrno values to *errors.Error. Update benchmarks routines to include conversions. The below benchmarks show *errors.Error usage to be comparable to using unix.Errno. BenchmarkAssignUnix BenchmarkAssignUnix-32 787875022 1.284 ns/op BenchmarkAssignLinuxerr BenchmarkAssignLinuxerr-32 1000000000 1.209 ns/op BenchmarkAssignSyserror BenchmarkAssignSyserror-32 759269229 1.429 ns/op BenchmarkCompareUnix BenchmarkCompareUnix-32 1000000000 1.310 ns/op BenchmarkCompareLinuxerr BenchmarkCompareLinuxerr-32 1000000000 1.241 ns/op BenchmarkCompareSyserror BenchmarkCompareSyserror-32 147196165 8.248 ns/op BenchmarkSwitchUnix BenchmarkSwitchUnix-32 373233556 3.664 ns/op BenchmarkSwitchLinuxerr BenchmarkSwitchLinuxerr-32 476323929 3.294 ns/op BenchmarkSwitchSyserror BenchmarkSwitchSyserror-32 39293408 29.62 ns/op BenchmarkReturnUnix BenchmarkReturnUnix-32 1000000000 0.5042 ns/op BenchmarkReturnLinuxerr BenchmarkReturnLinuxerr-32 1000000000 0.8152 ns/op BenchmarkConvertUnixLinuxerr BenchmarkConvertUnixLinuxerr-32 739948875 1.547 ns/op BenchmarkConvertUnixLinuxerrZero BenchmarkConvertUnixLinuxerrZero-32 977733974 1.489 ns/op PiperOrigin-RevId: 379806801
2021-06-01Move sync generics to their own packagesTamir Duberstein
The presence of multiple packages in a single directory sometimes confuses `go mod`, producing output like: go: downloading gvisor.dev/gvisor v0.0.0-20210601174640-77dc0f5bc94d $GOMODCACHE/gvisor.dev/gvisor@v0.0.0-20210601174640-77dc0f5bc94d/pkg/linewriter/linewriter.go:21:2: found packages sync (aliases.go) and seqatomic (generic_atomicptr_unsafe.go) in $GOMODCACHE/gvisor.dev/gvisor@v0.0.0-20210601174640-77dc0f5bc94d/pkg/sync imports.go:67:2: found packages tcp (accept.go) and rcv (rcv_test.go) in $GOMODCACHE/gvisor.dev/gvisor@v0.0.0-20210601174640-77dc0f5bc94d/pkg/tcpip/transport/tcp PiperOrigin-RevId: 376956213
2021-04-02Implement cgroupfs.Rahat Mahmood
A skeleton implementation of cgroupfs. It supports trivial cpu and memory controllers with no support for hierarchies. PiperOrigin-RevId: 366561126
2021-03-29[syserror] Split usermem packageZach Koopmans
Split usermem package to help remove syserror dependency in go_marshal. New hostarch package contains code not dependent on syserror. PiperOrigin-RevId: 365651233
2021-03-03[op] Replace syscall package usage with golang.org/x/sys/unix in pkg/.Ayush Ranjan
The syscall package has been deprecated in favor of golang.org/x/sys. Note that syscall is still used in the following places: - pkg/sentry/socket/hostinet/stack.go: some netlink related functionalities are not yet available in golang.org/x/sys. - syscall.Stat_t is still used in some places because os.FileInfo.Sys() still returns it and not unix.Stat_t. Updates #214 PiperOrigin-RevId: 360701387
2021-01-22Implement F_GETLK fcntl.Dean Deng
Fixes #5113. PiperOrigin-RevId: 353313374
2020-11-13Check for misuse of kernel.Task as context.Context.Jamie Liu
Checks in Task.block() and Task.Value() are conditional on race detection being enabled, since these functions are relatively hot. Checks in Task.SleepStart() and Task.UninterruptibleSleepStart() are enabled unconditionally, since these functions are not thought to lie on any critical paths, and misuse of these functions is required for b/168241471 to manifest. PiperOrigin-RevId: 342342175
2020-11-12Rename kernel.TaskContext to kernel.TaskImage.Jamie Liu
This reduces confusion with context.Context (which is also relevant to kernel.Tasks) and is consistent with existing function kernel.LoadTaskImage(). PiperOrigin-RevId: 342167298
2020-10-23Rewrite reference leak checker without finalizers.Dean Deng
Our current reference leak checker uses finalizers to verify whether an object has reached zero references before it is garbage collected. There are multiple problems with this mechanism, so a rewrite is in order. With finalizers, there is no way to guarantee that a finalizer will run before the program exits. When an unreachable object with a finalizer is garbage collected, its finalizer will be added to a queue and run asynchronously. The best we can do is run garbage collection upon sandbox exit to make sure that all finalizers are enqueued. Furthermore, if there is a chain of finalized objects, e.g. A points to B points to C, garbage collection needs to run multiple times before all of the finalizers are enqueued. The first GC run will register the finalizer for A but not free it. It takes another GC run to free A, at which point B's finalizer can be registered. As a result, we need to run GC as many times as the length of the longest such chain to have a somewhat reliable leak checker. Finally, a cyclical chain of structs pointing to one another will never be garbage collected if a finalizer is set. This is a well-known issue with Go finalizers (https://github.com/golang/go/issues/7358). Using leak checking on filesystem objects that produce cycles will not work and even result in memory leaks. The new leak checker stores reference counted objects in a global map when leak check is enabled and removes them once they are destroyed. At sandbox exit, any remaining objects in the map are considered as leaked. This provides a deterministic way of detecting leaks without relying on the complexities of finalizers and garbage collection. This approach has several benefits over the former, including: - Always detects leaks of objects that should be destroyed very close to sandbox exit. The old checker very rarely detected these leaks, because it relied on garbage collection to be run in a short window of time. - Panics if we forgot to enable leak check on a ref-counted object (we will try to remove it from the map when it is destroyed, but it will never have been added). - Can store extra logging information in the map values without adding to the size of the ref count struct itself. With the size of just an int64, the ref count object remains compact, meaning frequent operations like IncRef/DecRef are more cache-efficient. - Can aggregate leak results in a single report after the sandbox exits. Instead of having warnings littered in the log, which were non-deterministically triggered by garbage collection, we can print all warning messages at once. Note that this could also be a limitation--the sandbox must exit properly for leaks to be detected. Some basic benchmarking indicates that this change does not significantly affect performance when leak checking is enabled, which is understandable since registering/unregistering is only done once for each filesystem object. Updates #1486. PiperOrigin-RevId: 338685972
2020-10-19[vfs2] Fix fork reference leaks.Dean Deng
PiperOrigin-RevId: 337919424
2020-10-14Fix shm reference leak.Dean Deng
All shm segments in an IPC namespace should be released once that namespace is destroyed. Add reference counting to IPCNamespace so that once the last task with a reference on it exits, we can trigger a destructor that will clean up all shm segments that have not been explicitly freed by the application. PiperOrigin-RevId: 337032977
2020-10-02Convert uses of the binary package in kernel to go-marshal.Rahat Mahmood
PiperOrigin-RevId: 335077195
2020-09-24Fix socket record leak in VFS2Tiwei Bie
VFS2 socket record is not removed from the system-wide socket table when the socket is released, which will lead to a memory leak. This patch fixes this issue. Fixes: #3874 Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-24Rename kernel.SocketEntry to kernel.SocketRecordTiwei Bie
SocketEntry can be confusing with the template types as the 'Entry' is usually used as a suffix for list element types, e.g. socketEntry in the same package. Suggested by Dean (@dean-deng). Signed-off-by: Tiwei Bie <tiwei.btw@antgroup.com>
2020-09-15Enable automated marshalling for the syscall package.Rahat Mahmood
PiperOrigin-RevId: 331940975
2020-09-11Move the 'marshal' and 'primitive' packages to the 'pkg' directory.Rahat Mahmood
PiperOrigin-RevId: 331256608
2020-08-25Use new reference count utility throughout gvisor.Dean Deng
This uses the refs_vfs2 template in vfs2 as well as objects common to vfs1 and vfs2. Note that vfs1-only refcounts are not replaced, since vfs1 will be deleted soon anyway. The following structs now use the new tool, with leak check enabled: devpts:rootInode fuse:inode kernfs:Dentry kernfs:dir kernfs:readonlyDir kernfs:StaticDirectory proc:fdDirInode proc:fdInfoDirInode proc:subtasksInode proc:taskInode proc:tasksInode vfs:FileDescription vfs:MountNamespace vfs:Filesystem sys:dir kernel:FSContext kernel:ProcessGroup kernel:Session shm:Shm mm:aioMappable mm:SpecialMappable transport:queue And the following use the template, but because they currently are not leak checked, a TODO is left instead of enabling leak check in this patch: kernel:FDTable tun:tunEndpoint Updates #1486. PiperOrigin-RevId: 328460377
2020-08-25Expose basic coverage information to userspace through kcov interface.Dean Deng
In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block, which updates the memory-mapped coverage information. The Go coverage tool does not allow us to inject arbitrary instructions into basic blocks, but it does provide data that we can convert to a kcov-like format and transfer them to userspace through a memory mapping. Note that this is not a strict implementation of kcov, which is especially tricky to do because we do not have the same coverage tools available in Go that that are available for the actual Linux kernel. In Linux, a kernel configuration is set that compiles the kernel with a custom function that is called at the beginning of every basic block to write program counters to the kcov memory mapping. In Go, however, coverage tools only give us a count of basic blocks as they are executed. Every time we return to userspace, we collect the coverage information and write out PCs for each block that was executed, providing userspace with the illusion that the kcov data is always up to date. For convenience, we also generate a unique synthetic PC for each block instead of using actual PCs. Finally, we do not provide thread-specific coverage data (each kcov instance only contains PCs executed by the thread owning it); instead, we will supply data for any file specified by -- instrumentation_filter. Also, fix issue in nogo that was causing pkg/coverage:coverage_nogo compilation to fail. PiperOrigin-RevId: 328426526
2020-08-17Remove weak references from unix sockets.Dean Deng
The abstract socket namespace no longer holds any references on sockets. Instead, TryIncRef() is used when a socket is being retrieved in BoundEndpoint(). Abstract sockets are now responsible for removing themselves from the namespace they are in, when they are destroyed. Updates #1486. PiperOrigin-RevId: 327064173
2020-07-23Add task work mechanism.Dean Deng
Like task_work in Linux, this allows us to register callbacks to be executed before returning to userspace. This is needed for kcov support, which requires coverage information to be up-to-date whenever we are in user mode. We will provide coverage data through the kcov interface to enable coverage-directed fuzzing in syzkaller. One difference from Linux is that task work cannot queue work before the transition to userspace that it precedes; queued work will be picked up before the next transition. PiperOrigin-RevId: 322889984
2020-06-23Support for saving pointers to fields in the state package.Adin Scannell
Previously, it was not possible to encode/decode an object graph which contained a pointer to a field within another type. This was because the encoder was previously unable to disambiguate a pointer to an object and a pointer within the object. This CL remedies this by constructing an address map tracking the full memory range object occupy. The encoded Refvalue message has been extended to allow references to children objects within another object. Because the encoding process may learn about object structure over time, we cannot encode any objects under the entire graph has been generated. This CL also updates the state package to use standard interfaces intead of reflection-based dispatch in order to improve performance overall. This includes a custom wire protocol to significantly reduce the number of allocations and take advantage of structure packing. As part of these changes, there are a small number of minor changes in other places of the code base: * The lists used during encoding are changed to use intrusive lists with the objectEncodeState directly, which required that the ilist Len() method is updated to work properly with the ElementMapper mechanism. * A bug is fixed in the list code wherein Remove() called on an element that is already removed can corrupt the list (removing the element if there's only a single element). Now the behavior is correct. * Standard error wrapping is introduced. * Compressio was updated to implement the new wire.Reader and wire.Writer inteface methods directly. The lack of a ReadByte and WriteByte caused issues not due to interface dispatch, but because underlying slices for a Read or Write call through an interface would always escape to the heap! * Statify has been updated to support the new APIs. See README.md for a description of how the new mechanism works. PiperOrigin-RevId: 318010298
2020-06-16Port aio to VFS2.Nicolas Lacasse
In order to make sure all aio goroutines have stopped during S/R, a new WaitGroup was added to TaskSet, analagous to runningGoroutines. This WaitGroup is incremented with each aio goroutine, and waited on during kernel.Pause. The old VFS1 aio code was changed to use this new WaitGroup, rather than fs.Async. The only uses of fs.Async are now inode and mount Release operations, which do not call fs.Async recursively. This fixes a lock-ordering violation that can cause deadlocks. Updates #1035. PiperOrigin-RevId: 316689380
2020-05-14Port memfd_create to vfs2 and finish implementation of file seals.Nicolas Lacasse
Closes #2612. PiperOrigin-RevId: 311548074
2020-05-07Move pkg/sentry/vfs/{eventfd,timerfd} to new packages in pkg/sentry/fsimpl.Nicolas Lacasse
They don't depend on anything in VFS2, so they should be their own packages. PiperOrigin-RevId: 310416807
2020-04-16Implement pipe(2) and pipe2(2) for VFS2.Jamie Liu
Updates #1035 PiperOrigin-RevId: 306968644
2020-04-03Add FileDescriptionImpl for Unix sockets.Dean Deng
This change involves several steps: - Refactor the VFS1 unix socket implementation to share methods between VFS1 and VFS2 where possible. Re-implement the rest. - Override the default PRead, Read, PWrite, Write, Ioctl, Release methods in FileDescriptionDefaultImpl. - Add functions to create and initialize a new Dentry/Inode and FileDescription for a Unix socket file. Updates #1476 PiperOrigin-RevId: 304689796
2020-03-31Add socket filesystem and global disconnected socket mount for VFS2.Dean Deng
A socket mount where anonymous sockets will reside is added to the VirtualFilesystem. Socketfs is built on top of kernfs. Updates #1476, #1478, #1484, #1485. PiperOrigin-RevId: 304095251
2020-02-14Enable automated marshalling for struct stat.gVisor bot
This requires fixing a few build issues for non-am64 platforms. PiperOrigin-RevId: 295196922
2020-02-14Plumb VFS2 inside the SentrygVisor bot
- Added fsbridge package with interface that can be used to open and read from VFS1 and VFS2 files. - Converted ELF loader to use fsbridge - Added VFS2 types to FSContext - Added vfs.MountNamespace to ThreadGroup Updates #1623 PiperOrigin-RevId: 295183950
2020-02-05Add notes to relevant tests.Adin Scannell
These were out-of-band notes that can help provide additional context and simplify automated imports. PiperOrigin-RevId: 293525915
2020-01-28Add vfs.FileDescription to FD tableFabricio Voznika
FD table now holds both VFS1 and VFS2 types and uses the correct one based on what's set. Parts of this CL are just initial changes (e.g. sys_read.go, runsc/main.go) to serve as a template for the remaining changes. Updates #1487 Updates #1623 PiperOrigin-RevId: 292023223
2020-01-27Update package locations.Adin Scannell
Because the abi will depend on the core types for marshalling (usermem, context, safemem, safecopy), these need to be flattened from the sentry directory. These packages contain no sentry-specific details. PiperOrigin-RevId: 291811289
2020-01-27Standardize on tools directory.Adin Scannell
PiperOrigin-RevId: 291745021
2020-01-09New sync package.Ian Gudger
* Rename syncutil to sync. * Add aliases to sync types. * Replace existing usage of standard library sync package. This will make it easier to swap out synchronization primitives. For example, this will allow us to use primitives from github.com/sasha-s/go-deadlock to check for lock ordering violations. Updates #1472 PiperOrigin-RevId: 289033387
2019-11-21Import and structure cleanup.Adin Scannell
PiperOrigin-RevId: 281795269
2019-10-16Reorder BUILD license and load functions in gvisor.Kevin Krakauer
PiperOrigin-RevId: 275139066
2019-09-23internal BUILD file cleanup.gVisor bot
PiperOrigin-RevId: 270680704
2019-09-19Job control: controlling TTYs and foreground process groups.Kevin Krakauer
Adresses a deadlock with the rolled back change: https://github.com/google/gvisor/commit/b6a5b950d28e0b474fdad160b88bc15314cf9259 Creating a session from an orphaned process group was causing a lock to be acquired twice by a single goroutine. This behavior is addressed, and a test (OrphanRegression) has been added to pty.cc. Implemented the following ioctls: - TIOCSCTTY - set controlling TTY - TIOCNOTTY - remove controlling tty, maybe signal some other processes - TIOCGPGRP - get foreground process group. Also enables tcgetpgrp(). - TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp(). Next steps are to actually turn terminal-generated control characters (e.g. C^c) into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when appropriate. PiperOrigin-RevId: 270088599
2019-09-12Remove go_test from go_stateify and go_marshalMichael Pratt
They are no-ops, so the standard rule works fine. PiperOrigin-RevId: 268776264
2019-08-30Automated rollback of changelist 261387276Bhasker Hariharan
PiperOrigin-RevId: 266491264
2019-08-02Job control: controlling TTYs and foreground process groups.Kevin Krakauer
(Don't worry, this is mostly tests.) Implemented the following ioctls: - TIOCSCTTY - set controlling TTY - TIOCNOTTY - remove controlling tty, maybe signal some other processes - TIOCGPGRP - get foreground process group. Also enables tcgetpgrp(). - TIOCSPGRP - set foreground process group. Also enabled tcsetpgrp(). Next steps are to actually turn terminal-generated control characters (e.g. C^c) into signals to the proper process groups, and to send SIGTTOU and SIGTTIN when appropriate. PiperOrigin-RevId: 261387276
2019-07-09build: add nogo for static validationAdin Scannell
PiperOrigin-RevId: 257297820
2019-07-02Remove map from fd_map, change to fd_table.Adin Scannell
This renames FDMap to FDTable and drops the kernel.FD type, which had an entire package to itself and didn't serve much use (it was freely cast between types, and served as more of an annoyance than providing any protection.) Based on BenchmarkFDLookupAndDecRef-12, we can expect 5-10 ns per lookup operation, and 10-15 ns per concurrent lookup operation of savings. This also fixes two tangential usage issues with the FDMap. Namely, non-atomic use of NewFDFrom and associated calls to Remove (that are both racy and fail to drop the reference on the underlying file.) PiperOrigin-RevId: 256285890
2019-06-13Update canonical repository.Adin Scannell
This can be merged after: https://github.com/google/gvisor-website/pull/77 or https://github.com/google/gvisor-website/pull/78 PiperOrigin-RevId: 253132620
2019-06-10Store more information in the kernel socket table.Rahat Mahmood
Store enough information in the kernel socket table to distinguish between different types of sockets. Previously we were only storing the socket family, but this isn't enough to classify sockets. For example, TCPv4 and UDPv4 sockets are both AF_INET, and ICMP sockets are SOCK_DGRAM sockets with a particular protocol. Instead of creating more sub-tables, flatten the socket table and provide a filtering mechanism based on the socket entry. Also generate and store a socket entry index ("sl" in linux) which allows us to output entries in a stable order from procfs. PiperOrigin-RevId: 252495895
2019-04-01Save/restore simple devices.Rahat Mahmood
We weren't saving simple devices' last allocated inode numbers, which caused inode number reuse across S/R. PiperOrigin-RevId: 241414245 Change-Id: I964289978841ef0a57d2fa48daf8eab7633c1284
2019-03-14Decouple filemem from platform and move it to pgalloc.MemoryFile.Jamie Liu
This is in preparation for improved page cache reclaim, which requires greater integration between the page cache and page allocator. PiperOrigin-RevId: 238444706 Change-Id: Id24141b3678d96c7d7dc24baddd9be555bffafe4
2019-02-20Make some ptrace commands x86-onlyHaibo Xu
Signed-off-by: Haibo Xu <haibo.xu@arm.com> Change-Id: I9751f859332d433ca772d6b9733f5a5a64398ec7 PiperOrigin-RevId: 234877624