From ffa6a51c58e7880a330e2bc3da22e5e5174e2100 Mon Sep 17 00:00:00 2001 From: Maria Matejka Date: Wed, 22 Dec 2021 15:35:29 +0100 Subject: Thread documentation: Chapter 4 on memory management --- doc/threads/03_coroutines.md | 4 +- doc/threads/04_memory_management.md | 223 ++++++++++++++++++++++++++++++++++++ 2 files changed, 225 insertions(+), 2 deletions(-) create mode 100644 doc/threads/04_memory_management.md (limited to 'doc') diff --git a/doc/threads/03_coroutines.md b/doc/threads/03_coroutines.md index 94ea3eda..fccc3bbf 100644 --- a/doc/threads/03_coroutines.md +++ b/doc/threads/03_coroutines.md @@ -235,5 +235,5 @@ Anyway, for now it is good enough as it is. *It's still a long road to the version 2.1. This series of texts should document what is needed to be changed, why we do it and how. The [previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/) -showed how the route export had to change to allow parallel execution. In the next chapter, we're most likely going -to show performance difference between BIRD v2.0.8 and the parallelized implementation. Stay tuned!* +showed how the route export had to change to allow parallel execution. In the next chapter, some memory management +details are to be explained together with the reasons why memory management matters. Stay tuned!* diff --git a/doc/threads/04_memory_management.md b/doc/threads/04_memory_management.md new file mode 100644 index 00000000..24ef89d3 --- /dev/null +++ b/doc/threads/04_memory_management.md @@ -0,0 +1,223 @@ +# BIRD Journey to Threads. Chapter 4: Memory and other resource management. + +BIRD is mostly a large specialized database engine, storing mega/gigabytes of +Internet routing data in memory. To keep accounts of every byte of allocated data, +BIRD has its own resource management system which must be adapted to the +multithreaded environment. The resource system has not changed much, yet it +deserves a short chapter. + +BIRD is a fast, robust and memory-efficient routing daemon designed and +implemented at the end of 20th century. We're doing a significant amount of +BIRD's internal structure changes to make it run in multiple threads in parallel. + +## Resources + +Inside BIRD, (almost) every piece of allocated memory is a resource. To achieve this, +every such memory block includes a generic `struct resource` header. The node +is enlisted inside a linked list of a *resource pool* (see below), the class +pointer defines basic operations done on resources. + +``` +typedef struct resource { + node n; /* Inside resource pool */ + struct resclass *class; /* Resource class */ +} resource; + +struct resclass { + char *name; /* Resource class name */ + unsigned size; /* Standard size of single resource */ + void (*free)(resource *); /* Freeing function */ + void (*dump)(resource *); /* Dump to debug output */ + resource *(*lookup)(resource *, unsigned long); /* Look up address (only for debugging) */ + struct resmem (*memsize)(resource *); /* Return size of memory used by the resource, may be NULL */ +}; + +void *ralloc(pool *, struct resclass *); +``` + +Resource cycle begins with an allocation of a resource. To do that, you should call `ralloc()`, +passing the parent pool and the appropriate resource class as arguments. BIRD +allocates a memory block of size given by the given class member `size`. +Beginning of the block is reserved for `struct resource` itself and initialized +by the given arguments. Therefore, you may sometimes see an idiom where a structure +has a first member `struct resource r;`, indicating that this item should be +allocated as a resource. + +The counterpart is resource freeing. This may be implicit (by resource pool +freeing) or explicit (by `rfree()`). In both cases, the `free()` function of +the appropriate class is called to cleanup the resource before final freeing. + +To account for `dump` and `memsize` calls, there are CLI commands `dump +resources` and `show memory`, using these to dump resources or show memory +usage as perceived by BIRD. + +The last, `lookup`, is quite an obsolete way to identify a specific pointer +from a debug interface. You may call `rlookup(pointer)` and BIRD should dump +that resource to the debug output. This mechanism is probably incomplete as no +developer uses it actively for debugging. + +Resources can be also moved between pools by `rmove` when needed. + +## Resource pools + +The first internal resource class is a recursive resource – a resource pool. In +the singlethreaded version, this is just a simple structure: + +``` +struct pool { + resource r; + list inside; + struct birdloop *loop; /* In multithreaded version only */ + const char *name; +}; +``` + +Resource pools are used for grouping resources together. There are pools everywhere +and it is a common idiom inside BIRD to just `rfree` the appropriate pool when +e.g. a protocol or table is going down. Everything left there is cleaned up. + +There are anyway several classes which must be freed with care. In the +singlethreaded version, the *slab* allocator (see below) must be empty before +it may be freed and this is kept to the multithreaded version while other +restrictions have been added. + +There is also a global pool, `root_pool`, containing every single resource BIRD +knows about, either directly or via another resource pool. + +### Thread safety in resource pools + +In the multithreaded version, every resource pool is bound to a specific IO +loop and therefore includes an IO loop pointer. This is important for allocations +as the resource list inside the pool is thread-unsafe. All pool operations +therefore require the IO loop to be entered to do anything with them, if possible. +(In case of `rfree`, the pool data structure is not accessed at all so no +assert is possible. We're currently relying on the caller to ensure proper locking. +In future, this may change.) + +Each IO loop also has its base resource pool for its allocations. All pools +inside the IO loop pool must belong to the same loop or to a loop with a +subordinate lock (see the previous chapter for lock ordering). If there is a +need for multiple IO loops to access one shared data structure, it must be +locked by another lock and allocated in such a way that is independent on these +accessor loops. + +The pool structure should follow the locking order. Any pool should belong to +either the same loop as its parent or its loop lock should be after its parent +loop lock in the locking order. This is not enforced explicitly, yet it is +virtually impossible to write some working code violating this recommendation. + +### Resource pools in the wilderness + +Root pool contains (among others): + +* route attributes and sources +* routing tables +* protocols +* interfaces +* configuration data + +Each table has its IO loop and uses the loop base pool for allocations. +The same holds for protocols. Each protocol has its pool; it is either its IO +loop base pool or an ordinary pool bound to main loop. + +## Memory allocators + +BIRD stores data in memory blocks allocated by several allocators. There are 3 +of them: simple memory blocks, linear pools and slabs. + +### Simple memory block + +When just a chunk of memory is needed, `mb_alloc()` or `mb_allocz()` is used +to get it. The first with `malloc()` semantics, the other is also zeroed. +There is also `mb_realloc()` available, `mb_free()` to explicitly free such a +memory and `mb_move()` to move that memory to another pool. + +Simple memory blocks consume a fixed amount of overhead memory (32 bytes on +systems with 64-bit pointers) so they are suitable mostly for big chunks, +taking advantage of the default *stdlib* allocator which is used by this +allocation strategy. There are anyway some parts of BIRD (in all versions) +where this allocator is used for little blocks. This will be fixed some day. + +### Linear pools + +Sometimes, memory is allocated temporarily. When the data may just sit on +stack, we put it there. Anyway, many tasks need more structured execution where +stack allocation is incovenient or even impossible (e.g. when callbacks from +parsers are involved). For such a case, a *linpool* is the best choice. + +This data structure allocates memory blocks of requested size with negligible +overhead in functions `lp_alloc()` (uninitialized) or `lp_allocz()` (zeroed). +There is anyway no `realloc` and no `free` call; to have a larger chunk, you +need to allocate another block. All this memory is freed at once by `lp_flush()` +when it is no longer needed. + +You may see linpools in parsers (BGP, Linux netlink, config) or in filters. + +In the multithreaded version, linpools have received an update, allocating +memory pages directly by `mmap()` instead of calling `malloc()`. More on memory +pages below. + +### Slabs + +To allocate lots of same-sized objects, a [slab allocator](https://en.wikipedia.org/wiki/Slab_allocation) +is an ideal choice. In versions until 2.0.8, our slab allocator used blocks +allocated by `malloc()`, every object included a *slab head* pointer and free objects +were linked into a single-linked list. This led to memory inefficiency and to +contra-intuitive behavior where a use-after-free bug could do lots of damage +before finally crashing. + +Versions from 2.0.9, and also all the multithreaded versions, are coming with +slabs using directly allocated memory pages and usage bitmaps instead of +single-linking the free objects. This approach however relies on the fact that +pointers returned by `mmap()` are always divisible by page size. Freeing of a +slab object involves zeroing (mostly) 13 least significant bits of its pointer +to get the page pointer where the slab head resides. + +This update helps with memory consumption by about 5% compared to previous +versions; exact numbers depend on the usage pattern. + +## Raw memory pages + +Until 2.0.8 (incl.), BIRD allocated all memory by `malloc()`. This method is +suitable for lots of use cases, yet when gigabytes of memory should be +allocated by little pieces, BIRD uses its internal allocators to keep track +about everything. This brings some ineffectivity as stdlib allocator has its +own overhead and doesn't allocate aligned memory unless asked for. + +Slabs and linear pools are backed by blocks of memory of kilobyte sizes. As a +typical memory page size is 4 kB, it is a logical step to drop stdlib +allocation from these allocators and to use `mmap()` directly. This however has +some drawbacks, most notably the need of a syscall for every memory mapping and +unmapping. For allocations, this is not much a case and the syscall time is typically +negligible compared to computation time. When freeing memory, this is much +worse as BIRD sometimes frees gigabytes of data in a blink of eye. + +To minimize the needed number of syscalls, there is a per-thread page cache, +keeping pages for future use: + +* When a new page is requested, first the page cache is tried. +* When a page is freed, the per-thread page cache keeps it without telling the kernel. +* When the number of pages in any per-thread page cache leaves a pre-defined range, + a cleanup routine is scheduled to free excessive pages or request more in advance. + +This method gives the multithreaded BIRD not only faster memory management than +ever before but also almost immediate shutdown times as the cleanup routine is +not scheduled on shutdown at all. + +## Other resources + +Some objects are not only a piece of memory; notable items are sockets, owning +the underlying mechanism of I/O, and *object locks*, owning *the right to use a +specific I/O*. This ensures that collisions on e.g. TCP port numbers and +addresses are resolved in a predictable way. + +All these resources should be used with the same locking principles as the +memory blocks. There aren't many checks inside BIRD code to ensure that yet, +nevertheless violating this recommendation may lead to multiple-access issues. + +*It's still a long road to the version 2.1. This series of texts should document +what is needed to be changed, why we do it and how. The +[previous chapter](TODO) +showed the locking system and how the parallel execution is done. +The next chapter will cover a bit more detailed explanation about route sources +and route attributes and how lockless data structures are employed there. Stay tuned!* -- cgit v1.2.3