diff options
author | Maria Matejka <mq@ucw.cz> | 2023-04-18 09:43:06 +0200 |
---|---|---|
committer | Maria Matejka <mq@ucw.cz> | 2023-04-18 09:43:06 +0200 |
commit | d975827f5f3af5127f12c7c45c99b89c869d0f95 (patch) | |
tree | 77d6fd13e60d7663976b05f5e25a524977630326 /doc/threads/03_coroutines.md | |
parent | 787fb56da37bfe9da4ea8d1a5a0d0cab02c9523a (diff) | |
parent | 7404a215806972fb5e7205aaa99a3ff7b00ede16 (diff) |
Merge tag '3.0-alpha0' into HEAD
3.0-alpha0
Diffstat (limited to 'doc/threads/03_coroutines.md')
-rw-r--r-- | doc/threads/03_coroutines.md | 235 |
1 files changed, 235 insertions, 0 deletions
diff --git a/doc/threads/03_coroutines.md b/doc/threads/03_coroutines.md new file mode 100644 index 00000000..8d8e6c96 --- /dev/null +++ b/doc/threads/03_coroutines.md @@ -0,0 +1,235 @@ +# BIRD Journey to Threads. Chapter 3: Parallel execution and message passing. + +Parallel execution in BIRD uses an underlying mechanism of dedicated IO loops +and hierarchical locks. The original event scheduling module has been converted +to do message passing in multithreaded environment. These mechanisms are +crucial for understanding what happens inside BIRD and how its internal API changes. + +BIRD is a fast, robust and memory-efficient routing daemon designed and +implemented at the end of 20th century. We're doing a significant amount of +BIRD's internal structure changes to make it run in multiple threads in parallel. + +## Locking and deadlock prevention + +Most of BIRD data structures and algorithms are thread-unsafe and not even +reentrant. Checking and possibly updating all of these would take an +unreasonable amount of time, thus the multithreaded version uses standard mutexes +to lock all the parts which have not been checked and updated yet. + +The authors of original BIRD concepts wisely chose a highly modular structure +which allows to create a hierarchy for locks. The main chokepoint was between +protocols and tables and it has been removed by implementing asynchronous exports +as described in the [previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/). + +Locks in BIRD (called domains, as they always lock some defined part of BIRD) +are partially ordered. Every *domain* has its *type* and all threads are +strictly required to lock the domains in the order of their respective types. +The full order is defined in `lib/locking.h`. It's forbidden to lock more than +one domain of a type (these domains are uncomparable) and recursive locking is +forbidden as well. + +The locking hiearchy is (roughly; as of February 2022) like this: + +1. The BIRD Lock (for everything not yet checked and/or updated) +2. Protocols (as of February 2022, it is BFD, RPKI, Pipe and BGP) +3. Routing tables +4. Global route attribute cache +5. Message passing +6. Internals and memory management + +There are heavy checks to ensure proper locking and to help debugging any +problem when any code violates the hierarchy rules. This impedes performance +depending on how much that domain is contended and in some cases I have already +implemented lockless (or partially lockless) data structures to overcome this. + +You may ask, why are these heavy checks then employed in production builds? +Risks arising from dropping some locking checks include: + +* deadlocks; these are deadly in BIRD anyway so it should just fail with a meaningful message, or +* data corruption; it either kills BIRD anyway, or it results into a slow and vicious death, + leaving undebuggable corefiles behind. + +To be honest, I believe in principles like *"every nontrivial software has at least one bug"* +and I also don't trust my future self or anybody else to always write bugless code when +it comes to proper locking. I also believe that if a lock becomes a bottle-neck, +then we should think about what is locked inside and how to optimize that, +possibly implementing a lockless or waitless data structure instead of dropping +thorough consistency checks, especially in a multithreaded environment. + +### Choosing the right locking order + +When considering the locking order of protocols and route tables, the answer +was quite easy. We had to make either import or export asynchronous (or both). +Major reasons for asynchronous export have been stated in the previous chapter, +therefore it makes little sense to allow entering protocol context from table code. + +As I write further in this text, even accessing table context from protocol +code leads to contention on table locks, yet for now, it is good enough and the +lock order features routing tables after protocols to make the multithreading +goal easier to achieve. + +The major lock level is still The BIRD Lock, containing not only the +not-yet-converted protocols (like Babel, OSPF or RIP) but also processing CLI +commands and reconfiguration. This involves an awful lot of direct access into +other contexts which would be unnecessarily complicated to implement by message +passing. Therefore, this lock is simply *"the director"*, sitting on the top +with its own category. + +The lower lock levels under routing tables are mostly for shared global data +structures accessed from everywhere. We'll address some of these later. + +## IO Loop + +There has been a protocol, BFD, running in its own thread since 2013. This +separation has a good reason; it needs low latency and the main BIRD loop just +walks round-robin around all the available sockets and one round-trip may take +a long time (even more than a minute with large configurations). BFD had its +own IO loop implementation and simple message passing routines. This code could +be easily updated for general use so I did it. + +To understand the internal principles, we should say that in the `master` +branch, there is a big loop centered around a `poll()` call, dispatching and +executing everything as needed. In the `sark` branch, there are multiple loops +of this kind. BIRD has several means how to get something dispatched from a +loop. + +1. Requesting to read from a **socket** makes the main loop call your hook when there is some data received. + The same happens when a socket refuses to write data. Then the data is buffered and you are called when + the buffer is free to continue writing. There is also a third callback, an error hook, for obvious reasons. + +2. Requesting to be called back after a given amount of time. This is called **timer**. + As is common with all timers, they aren't precise and the callback may be + delayed significantly. This was also the reason to have BFD loop separate + since the very beginning, yet now the abundance of threads may lead to + problems with BFD latency in large-scale configurations. We haven't tested + this yet. + +3. Requesting to be called back from a clean context when possible. This is + useful to run anything not reentrant which might mess with the caller's + data, e.g. when a protocol decides to shutdown due to some inconsistency + in received data. This is called **event**. + +4. Requesting to do some work when possible. These are also events, there is only + a difference where this event is enqueued; in the main loop, there is a + special *work queue* with an execution limit, allowing sockets and timers to be + handled with a reasonable latency while still doing all the work needed. + Other loops don't have designated work queues (we may add them later). + +All these, sockets, timers and events, are tightly bound to some domain. +Sockets typically belong to a protocol, timers and events to a protocol or table. +With the modular structure of BIRD, the easy and convenient approach to multithreading +is to get more IO loops, each bound to a specific domain, running their events, timers and +socket hooks in their threads. + +## Message passing and loop entering + +To request some work in another module, the standard way is to pass a message. +For this purpose, events have been modified to be sent to a given loop without +locking that loop's domain. In fact, every event queue has its own lock with a +low priority, allowing to pass messages from almost any part of BIRD, and also +an assigned loop which executes the events enqueued. When a message is passed +to a queue executed by another loop, that target loop must be woken up so we +must know what loop to wake up to avoid unnecessary delays. Then the target +loop opens its mailbox and processes the task in its context. + +The other way is a direct access of another domain. This approach blocks the +appropriate loop from doing anything and we call it *entering a birdloop* to +remember that the task must be fast and *leave the birdloop* as soon as possible. +Route import is done via direct access from protocols to tables; in large +setups with fast filters, this is a major point of contention (after filters +have been parallelized) and will be addressed in future optimization efforts. +Reconfiguration and interface updates also use direct access; more on that later. +In general, this approach should be avoided unless there are good reasons to use it. + +Even though direct access is bad, sending lots of messages may be even worse. +Imagine one thousand post(wo)men, coming one by one every minute, ringing your +doorbell and delivering one letter each to you. Horrible! Asynchronous message +passing works exactly this way. After queuing the message, the source sends a +byte to a pipe to wakeup the target loop to process the task. We could also +periodically poll for messages instead of waking up the targets, yet it would +add quite a lot of latency which we also don't like. + +Messages in BIRD don't typically suffer from the problem of amount and the +overhead is negligible compared to the overall CPU consumption. With one notable +exception: route import/export. + +### Route export message passing + +If we had to send a ping for every route we import to every exporting channel, +we'd spend more time pinging than doing anything else. Been there, seen +those unbelievable 80%-like figures in Perf output. Never more. + +Route update is quite a complicated process. BIRD must handle large-scale +configurations with lots of importers and exporters. Therefore, a +triple-indirect delayed route announcement is employed: + +1. First, when a channel imports a route by entering a loop, it sends an event + to its own loop (no ping needed in such case). This operation is idempotent, + thus for several routes in a row, only one event is enqueued. This reduces + several route import announcements (even hundreds in case of massive BGP + withdrawals) to one single event. +2. When the channel is done importing (or at least takes a coffee break and + checks its mailbox), the scheduled event in its own loop is run, sending + another event to the table's loop, saying basically *"Hey, table, I've just + imported something."*. This event is also idempotent and further reduces + route import announcements from multiple sources to one single event. +3. The table's announcement event is then executed from its loop, enqueuing export + events for all connected channels, finally initiating route exports. As we + already know, imports are done by direct access, therefore if protocols keep + importing, export announcements are slowed down. +4. The actual data on what has been updated is stored in a table journal. This + peculiar technique is used only for informing the exporting channels that + *"there is something to do"*. + +This may seem overly complicated, yet it should work and it seems to work. In +case of low load, all these notifications just come through smoothly. In case +of high load, it's common that multiple updates come for the same destination. +Delaying the exports allows for the updates to settle down and export just the +final result, reducing CPU load and export traffic. + +## Cork + +Route propagation is involved in yet another problem which has to be addressed. +In the old versions with synchronous route propagation, all the buffering +happened after exporting routes to BGP. When a packet arrived, all the work was +done in BGP receive hook – parsing, importing into a table, running all the +filters and possibly sending to the peers. No more routes until the previous +was done. This self-regulating mechanism doesn't work any more. + +Route table import now returns immediately after inserting the route into a +table, creating a buffer there. These buffers have to be processed by other protocols' +export events. In large-scale configurations, one route import has to be +processed by hundreds, even thousands of exports. Unlimited imports are a major +cause of buffer bloating. This is even worse in configurations with pipes, +as these multiply the exports by propagating them all the way down to other +tables, eventually eating about twice the amount of memory than the single-threaded version. + +There is therefore a cork to make this stop. Every table is checking how many +exports it has pending, and when adding a new export to the queue, it may request +a cork, saying simply "please stop the flow for a while". When the export buffer +size is reduced low enough, the table uncorks. + +On the other side, there are events and sockets with a cork assigned. When +trying to enqueue an event and the cork is applied, the event is instead put +into the cork's queue and released only when the cork is released. In case of +sockets, when read is indicated or when `poll` arguments are recalculated, +the corked socket is simply not checked for received packets, effectively +keeping them in the TCP queue and slowing down the flow until cork is released. + +The cork implementation is quite crude and rough and fragile. It may get some +rework while stabilizing the multi-threaded version of BIRD or we may even +completely drop it for some better mechanism. One of these candidates is this +kind of API: + +* (table to protocol) please do not import +* (table to protocol) you may resume imports +* (protocol to table) not processing any exports +* (protocol to table) resuming export processing + +Anyway, cork works as intended in most cases at least for now. + +*It's a long road to the version 2.1. This series of texts should document what +is changing, why we do it and how. The +[previous chapter](https://en.blog.nic.cz/2021/06/14/bird-journey-to-threads-chapter-2-asynchronous-route-export/) +shows how the route export had to change to allow parallel execution. In the next chapter, some memory management +details are to be explained together with the reasons why memory management matters. Stay tuned!* |