diff options
Diffstat (limited to 'doc/threads/02_asynchronous_export.md')
-rw-r--r-- | doc/threads/02_asynchronous_export.md | 437 |
1 files changed, 437 insertions, 0 deletions
diff --git a/doc/threads/02_asynchronous_export.md b/doc/threads/02_asynchronous_export.md new file mode 100644 index 00000000..5a10ce5a --- /dev/null +++ b/doc/threads/02_asynchronous_export.md @@ -0,0 +1,437 @@ +# BIRD Journey to Threads. Chapter 2: Asynchronous route export + +Route export is a core algorithm of BIRD. This chapter covers how we are making +this procedure multithreaded. Desired outcomes are mostly lower latency of +route import, flap dampening and also faster route processing in large +configurations with lots of export from one table. + +BIRD is a fast, robust and memory-efficient routing daemon designed and +implemented at the end of 20th century. We're doing a significant amount of +BIRD's internal structure changes to make it possible to run in multiple +threads in parallel. + +## How routes are propagated through BIRD + +In the [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/), you could learn how the route import works. We should +now extend that process by the route export. + +1. (In protocol code.) Create the route itself and propagate it through the + right channel by calling `rte_update`. +2. The channel runs its import filter. +3. New best route is selected. +4. For each channel: + 1. The channel runs its preexport hook and export filter. + 2. (Optionally.) The channel merges the nexthops to create an ECMP route. + 3. The channel calls the protocol's `rt_notify` hook. +5. After all exports are finished, the `rte_update` call finally returns and + the source protocol may do anything else. + +Let's imagine that all the protocols are running in parallel. There are two +protocols with a route prepared to import. One of those wins the table lock, +does the import and then the export touches the other protocol which must +either: + +* store the route export until it finishes its own imports, or +* have independent import and export parts. + +Both of these conditions are infeasible for common use. Implementing them would +make protocols much more complicated with lots of new code to test and release +at once and also quite a lot of corner cases. Risk of deadlocks is also worth +mentioning. + +## Asynchronous route export + +We decided to make it easier for protocols and decouple the import and export +this way: + +1. The import is done. +2. Best route is selected. +3. Resulting changes are stored. + +Then, after the importing protocol returns, the exports are processed for each +exporting channel. In future, this will be possible in parallel: Some protocols +may process the export directly after it is stored, other protocols will wait +until they finish another job. + +This eliminates the risk of deadlocks and all protocols' `rt_notify` hooks can +rely on their independence. There is only one question. How to store the changes? + +## Route export modes + +To find a good data structure for route export storage, we shall first know the +readers. The exporters may request different modes of route export. + +### Export everything + +This is the most simple route export mode. The exporter wants to know about all +the routes as they're changing. We therefore simply store the old route until +the change is fully exported and then we free the old stored route. + +To manage this, we can simply queue the changes one after another and postpone +old route cleanup after all channels have exported the change. The queue member +would look like this: + +``` +struct { + struct rte_storage *new; + struct rte_storage *old; +}; +``` + +### Export best + +This is another simple route export mode. We check whether the best route has +changed; if not, no export happens. Otherwise, the export is propagated as the +old best route changing to the new best route. + +To manage this, we could use the queue from the previous point by adding new +best and old best pointers. It is guaranteed that both the old best and new +best pointers are always valid in time of export as all the changes in them +must be stored in future changes which have not been exported yet by this +channel and therefore not freed yet. + +``` +struct { + struct rte_storage *new; + struct rte_storage *new_best; + struct rte_storage *old; + struct rte_storage *old_best; +}; +``` + +Anyway, we're getting to the complicated export modes where this simple +structure is simply not enough. + +### Export merged + +Here we're getting to some kind of problems. The exporting channel requests not +only the best route but also all routes that are good enough to be considered +ECMP-eligible (we call these routes *mergable*). The export is then just one +route with just the nexthops merged. Export filters are executed before +merging and if the best route is rejected, nothing is exported at all. + +To achieve this, we have to re-evaluate export filters any time the best route +or any mergable route changes. Until now, the export could just do what it wanted +as there was only one thread working. To change this, we need to access the +whole route list and process it. + +### Export first accepted + +In this mode, the channel runs export filters on a sorted list of routes, best first. +If the best route gets rejected, it asks for the next one until it finds an +acceptable route or exhausts the list. This export mode requires a sorted table. +BIRD users will know this export mode as `secondary` in BGP. + +For now, BIRD stores two bits per route for each channel. The *export bit* is set +if the route has been really exported to that channel. The *reject bit* is set +if the route was rejected by the export filter. + +When processing a route change for accepted, the algorithm first checks the +export bit for the old route. If this bit is set, the old route is that one +exported so we have to find the right one to export. Therefore the sorted route +list is walked best to worst to find a new route to export, using the reject +bit to evaluate only routes which weren't rejected in previous runs of this +algorithm. + +If the old route bit is not set, the algorithm walks the sorted route list best +to worst, checking the position of new route with respect to the exported route. +If the new route is worse, nothing happens, otherwise the new route is sent to +filters and finally exported if passes. + +### Export by feed + +To resolve problems arising from previous two export modes (merged and first accepted), +we introduce a way to process a whole route list without locking the table +while export filters are running. To achieve this, we follow this algorithm: + +1. The exporting channel sees a pending export. +2. *The table is locked.* +3. The channel counts all the routes for the given destination. +4. The channel stores pointers to that routes to a local array. +5. *The table is unlocked.* +6. The channel processes the local array of route pointers. +7. The pending export is marked as processed by this channel. + +After unlocking the table, the pointed-to routes are implicitly guarded by the +sole fact that the pending export has not yet been processed by all channels +and the cleanup routine never frees any resource related to a pending export. + +## Pending export data structure + +As the two complicated export modes use the export-by-feed algorithm, the +pending export data structure may be quite minimalistic. + +``` +struct rt_pending_export { + struct rt_pending_export * _Atomic next; /* Next export for the same destination */ + struct rte_storage *new; /* New route */ + struct rte_storage *new_best; /* New best route in unsorted table */ + struct rte_storage *old; /* Old route */ + struct rte_storage *old_best; /* Old best route in unsorted table */ + _Atomic u64 seq; /* Sequential ID (table-local) of the pending export */ +}; +``` + +To allow for squashing outdated pending exports (e.g. for flap dampening +purposes), there is a `next` pointer to the next export for the same +destination. This is also needed for the export-by-feed algorithm: if there are +several exports for one net at once, all of them are processed by one +export-by-feed automatically marked as done. + +We should also add several items into `struct channel`. + +``` + struct coroutine *export_coro; /* Exporter and feeder coroutine */ + struct bsem *export_sem; /* Exporter and feeder semaphore */ + struct rt_pending_export * _Atomic last_export; /* Last export processed */ + struct bmap export_seen_map; /* Keeps track which exports were already processed */ + u64 flush_seq; /* Table export seq when the channel announced flushing */ +``` + +To run the exports in parallel, `export_coro` is run and `export_sem` is +used for signalling new exports to it. The exporter coroutine also marks all +seen sequential IDs in its `export_seen_map` to make it possible to skip over +them if seen again. + +There is also a table cleaner routine +(see [previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/)) +which must cleanup also the pending exports after all the channels are finished with them. +To signal that, there is `last_export` working as a release point: the channel +guarantees that it doesn't touch the pointed-to pending export, nor any data +from it. + +The last tricky point is channel flushing. When any channel stops, all its +routes are automatically freed and withdrawals are exported if appropriate. +Until now, the routes could be flushed synchronously, anyway now flush has +several phases: + +1. Flush started (here the channel stores the `seq` of last current pending export). +2. All routes unlinked yet not freed, withdrawals pending. +3. Pending withdrawals processed and cleaned up. Channel may safely stop and free its structures. + +Finally, some additional information has to be stored in tables: + +``` + _Atomic byte export_used; /* Export journal cleanup scheduled */ \ + struct rt_pending_export * _Atomic first_export; /* First export to announce */ \ + byte export_scheduled; /* Export is scheduled */ + list pending_exports; /* List of packed struct rt_pending_export */ + struct fib export_fib; /* Auxiliary fib for storing pending exports */ + u64 next_export_seq; /* The next export will have this ID */ +``` + +The exports are: +1. Assigned the `next_export_seq` sequential ID, incrementing this item by one. +2. Put into `pending_exports` and `export_fib` for both sequential and by-destination access. +3. Signalled by setting `export_scheduled` and also `first_export` if not `NULL`. + +After processing several exports, `export_used` is set and route table maintenance +coroutine is woken up to possibly do cleanup. + +The `struct rt_pending_export` seems to be best allocated by requesting a whole +memory page, containing a common list node, a simple header and packed all the +structures in the rest of the page. This may save a significant amount of memory. +In case of congestion, there will be lots of exports and every spare kilobyte +counts. If BIRD is almost idle, the optimization does nothing on the overall performance. + +## Export algorithm + +As we have explained at the beginning, the current export algorithm is +table-driven. The table walks the channel list and propagates the update. +The now export algorithm is channel-driven. The table just indicates that it +has something new in export queue and the channel decides what to do with that and when. + +### Pushing an export + +When a table has something to export, it enqueues an appropriate instance of +`struct rt_pending_export`. Then it pings its maintenance coroutine +(`rt_event`) to notify the exporting channels about a new route. This coroutine +runs with only the table locked so the protocol may e.g. prepare the next route inbetween. +The maintenance coroutine, when it wakes up, walks the list of channels and +wakes their export coroutines. + +These two levels of asynchronicity are here for two reasons. + +1. There may be lots of channels (hundreds of them) and we don't want to + inefficiently iterate the list for each export. +2. The notification is going to wait until the route author finishes, hopefully + importing more routes and therefore allowing more exports to be processed at once. + +The table also stores the first and last exports for every destination. These +come handy mostly for cleanup purposes and for channel feeding. + +### Processing an export + +After these two pings, the channel finally knows that there is an export pending. +The channel checks whether there is a `last_export` stored. If yes, it proceeds +with the next one, otherwise it takes `first_export` from the table (which is +atomic for this reason). + +1. The channel checks its `export_seen_map` whether this export has been + already processed. If so, it skips to the next export. No action is needed. +2. The export chain is scanned for the current first and last export. This is + done by following the `next` pointer in the exports as accessing the + auxiliary export fib needs locking. +3. In the best-only and all-routes export mode, the changes are processed + without further table locking. +4. If export-by-feed is used, the current state of routes in table are fetched. + Then the table is unlocked and the changes are processed without further locking. +5. All processed exports are marked as seen. +6. The channel stores the `last_export` and returns to beginning.to wait for next export. + +## The full route life-cycle + +Until now, we're always assuming that the channels *just exist*. In real life, +any channel may go up or down and we must handle it, flushing the routes +appropriately and freeing all the memory just in time to avoid both +use-after-free and memory leaks. It should be noted here explicitly that BIRD +is written in C which has no garbage collector or other modern features alike. + +### Protocols and channels as viewed from a route + +BIRD consists effectively of protocols and tables. **Protocols** are active parts, +kind-of subprocesses manipulating routes and other data. **Tables** are passive, +serving as a database of routes. To connect a protocol to a table, a +**channel** is created. + +Every route has its `sender` storing the channel which has put the route into +the current table. This comes handy when the channel goes down to know which +routes to flush. + +Every route also has its `src`, a route source allocated by the protocol which +originated it first. This is kept when a route is passed through a *pipe*. + +Both `src` and `sender` must point to active protocols and channels as inactive +protocols and channels may be deleted in any time. + +### Protocol and channel lifecycle + +In the beginning, all channels and protocols are down. Until they fully start, +no route from them is allowed to any table. When the protocol and channel is up, +they may originate routes freely. However, the transitions are worth mentioning. + +### Channel startup and feed + +When protocols and channels start, they need to get the current state of the +appropriate table. Therefore, after a protocol and channel start, also the +export-feed coroutine is initiated. + +Tables can contain millions of routes. It may lead to long import latency if a channel +was feeding itself in one step. The table structure is (at least for now) too +complicated to be implemented as lockless, thus even read access needs locking. +To mitigate this, the feeds are split to allow for regular route propagation +with a reasonable latency. + +When the exports were synchronous, we simply didn't care and just announced the +exports to the channels from the time they started feeding. When making exports +asynchronous, it was crucial to avoid most of the possible race conditions +which could arise from simultaneous feed and export. As the feeder routines had +to be rewritten, it was a good opportunity to make this precise. + +Therefore, when a channel goes up, it also starts exports: + +1. Start the feed-export coroutine. +2. *Lock the table.* +3. Store the last export in queue. +4. Read a limited number of routes to local memory together with their pending exports. +5. If there are some routes to process: + 1. *Unlock the table.* + 2. Process the loaded routes. + 3. Set the appropriate pending exports as seen. + 4. *Lock the table* + 5. Go to 4. to continue feeding. +6. If there was a last export stored, load the next one to be processed. Otherwise take the table's `first_export`. +7. *Unlock the table.* +8. Run the exporter loop. + +*Note: There are some nuances not mentioned here how to do things in right +order to avoid missing some events while changing state. Precise description +would be too detailed for this text.* + +When the feeder loop finishes, it continues smoothly to process all the exports +that have been queued while the feed was running. Step 5.3 ensures that already +seen exports are skipped, steps 3 and 6 ensure that no export is missed. + +### Channel flush + +Protocols and channels need to stop for a handful of reasons, All of these +cases follow the same routine. + +1. (Maybe.) The protocol requests to go down or restart. +2. The channel requests to go down or restart. +3. The channel requests to stop export. +4. In the feed-export coroutine: + 1. At a designated cancellation point, check cancellation. + 2. Clean up local data. + 3. *Lock main BIRD context locked* + 4. If shutdown requested, switch the channel to *flushing* state and request table maintenance. +5. In the table maintenance coroutine: + 1. Walk across all channels and check them for *flushing* state. + 2. Walk across the table (split to allow for low latency updates) and + generate a withdrawal for each route sent by the flushing channels. + 3. Wait until all the withdrawals are processed. + 4. Mark the flushing channels as *down* and eventually proceed to the protocol shutdown or restart. + +There is also a separate routine that handles bulk cleanup of `src`'s which +contain a pointer to the originating protocol. This routine may get reworked in +future, yet for now it is good enough. + +### Route export cleanup + +Last but not least is the export cleanup routine. Until now, the withdrawn +routes were exported synchronously and freed directly after the import was +done. This is not possible anymore. The export is stored and the import returns +to let the importing protocol continue its work. We therefore need a routine to +cleanup the withdrawn routes and also the processed exports. + +First of all, this routine refuses to cleanup when any export is feeding or +shutting down. This may change in future, anyway for now we aren't sure about +possible race conditions. + +Anyway, when all the exports are in a steady state, the routine works as follows: + +1. Walk the active exports and find a minimum (oldest export) between their `last_export` values. +2. If there is nothing to clear between the actual oldest export and channels' oldest export, do nothing. +3. Find the table's new `first_export` and set it. Now there is nobody pointing to the old exports. +4. Free the withdrawn routes. +5. Free the old exports, removing them also from the first-last list of exports for the same destination. + +## Results of these changes + +This step is a first major step to move forward. Using just this version may be +still as slow as the single-threaded version, at least if your export filters are trivial. +Anyway, the main purpose of this step is not an immediate speedup. It is more +of a base for the next steps: + +* Unlocking of pipes should enable parallel execution of all the filters on + pipes, limited solely by the principle *one thread for every direction of + pipe*. +* Conversion of CLI's `show route` to the new feed-export coroutines should + enable faster table queries. Moreover, this approach will allow for + better splitting of model and view in CLI with a good opportunity to + implement more output formats, e.g. JSON. +* Unlocking of kernel route synchronization should fix latency issues induced + by long-lasting kernel queries. +* Partial unlocking of BGP packet processing should finally allow for parallel + execution in almost all phases of BGP route propagation. + +The development is now being done mostly in the branch `alderney`. If you asked +why such strange branch names like `jersey`, `guernsey` and `alderney`, here is +a kind-of reason. Yes, these names could be named `mq-async-export`, +`mq-async-export-new`, `mq-async-export-new-new`, `mq-another-async-export` and +so on. That's so ugly, isn't it? + +Also why so many branches? The development process is quite messy. BIRD's code +heavily depends on single-threaded approach. This is exceptionally good for +performance, as long as you have one thread only. On the other hand, lots of +these assumptions are not documented so in many cases one desired +change yields a chain of other unforeseen changes which must precede. This brings +lots of backtracking, branch rebasing and other Git magic. There is always a +can of worms somewhere in the code. + +*It's still a long road to the version 2.1. This series of texts should document +what is needed to be changed, why we do it and how. The +[previous chapter](https://en.blog.nic.cz/2021/03/23/bird-journey-to-threads-chapter-1-the-route-and-its-attributes/) +showed the necessary changes in route storage. In the next chapter, we're going +to describe how the coroutines are implemented and what kind of locking system +are we employing to prevent deadlocks. Stay tuned!* |