diff options
author | Maria Matejka <mq@ucw.cz> | 2022-02-07 22:27:42 +0100 |
---|---|---|
committer | Maria Matejka <mq@ucw.cz> | 2022-02-07 22:35:41 +0100 |
commit | a6fc31f153f4d6ad2fa7f63a4ba137fc263f4200 (patch) | |
tree | 233e4d6a9374b2e68cf2f857c3526bf775990f06 /doc/threads/03b_performance.md | |
parent | 3f6462ad35ca49dc4898546bba59147792481b25 (diff) |
Blogpost about performance + data.
Diffstat (limited to 'doc/threads/03b_performance.md')
-rw-r--r-- | doc/threads/03b_performance.md | 153 |
1 files changed, 153 insertions, 0 deletions
diff --git a/doc/threads/03b_performance.md b/doc/threads/03b_performance.md new file mode 100644 index 00000000..07fd5bb0 --- /dev/null +++ b/doc/threads/03b_performance.md @@ -0,0 +1,153 @@ +# BIRD Journey to Threads. Chapter 3½: Route server performance + +All the work on multithreading shall be justified by performance improvements. +This chapter tries to compare times reached by version 3.0-alpha0 and 2.0.8, +showing some data and thinking about them. + +BIRD is a fast, robust and memory-efficient routing daemon designed and +implemented at the end of 20th century. We're doing a significant amount of +BIRD's internal structure changes to make it run in multiple threads in parallel. + +## Testing setup + +There are two machines in one rack. One of these simulates the peers of +a route server, the other runs BIRD in a route server configuration. First, the +peers are launched, then the route server is started and one of the peers +measures the convergence time until routes are fully propagated. Other peers +drop all incoming routes. + +There are four configurations. *Single* where all BGPs are directly +connected to the main table, *Multi* where every BGP has its own table and +filters are done on pipes between them, and finally *Imex* and *Mulimex* which are +effectively *Single* and *Multi* where all BGPs have also their auxiliary +import and export tables enabled. + +All of these use the same short dummy filter for route import to provide a +consistent load. This filter includes no meaningful logic, it's just some dummy +data to run the CPU with no memory contention. Real filters also do not suffer from +memory contention, with an exception of ROA checks. Optimization of ROA is a +task for another day. + +There is also other stuff in BIRD waiting for performance assessment. As the +(by far) most demanding setup of BIRD is route server in IXP, we chose to +optimize and measure BGP and filters first. + +Hardware used for testing is Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 8 +physical cores, two hyperthreads on each. Memory is 32 GB RAM. + +## Test parameters and statistics + +BIRD setup may scale on two major axes. Number of peers and number of routes / +destinations. *(There are more axes, e.g.: complexity of filters, routes / +destinations ratio, topology size in IGP)* + +Scaling the test on route count is easy, just by adding more routes to the +testing peers. Currently, the largest test data I feed BIRD with is about 2M +routes for around 800K destinations, due to memory limitations. The routes / +destinations ratio is around 2.5 in this testing setup, trying to get close to +real-world routing servers.[^1] + +[^1]: BIRD can handle much more in real life, the actual software limit is currently + a 32-bit unsigned route counter in the table structure. Hardware capabilities + are already there and checking how BIRD handles more than 4G routes is + certainly going to be a real thing soon. + +Scaling the test on peer count is easy, until you get to higher numbers. When I +was setting up the test, I configured one Linux network namespace for each peer, +connecting them by virtual links to a bridge and by a GRE tunnel to the other +machine. This works well for 10 peers but setting up and removing 1000 network +namespaces takes more than 15 minutes in total. (Note to myself: try this with +a newer Linux kernel than 4.9.) + +Another problem of test scaling is bandwidth. With 10 peers, everything is OK. +With 1000 peers, version 3.0-alpha0 does more than 600 Mbps traffic in peak +which is just about the bandwidth of the whole setup. I'm planning to design a +better test setup with less chokepoints in future. + +## Hypothesis + +There are two versions subjected to the test. One of these is `2.0.8` as an +initial testpoint. The other is version 3.0-alpha0, named `bgp` as parallel BGP +is implemented there. + +The major problem of large-scale BIRD setups is convergence time on startup. We +assume that a multithreaded version should reduce the overall convergence time, +at most by a factor equal to number of cores involved. Here we have 16 +hyperthreads, in theory we should reduce the times up to 16-fold, yet this is +almost impossible as a non-negligible amount of time is spent in bottleneck +code like best route selection or some cleanup routines. This has become a +bottleneck by making other parts parallel. + +## Data + +Four charts are included here, one for each setup. All axes have a +logarithmic scale. The route count on X scale is the total route count in +tested BIRD, different color shades belong to different versions and peer +counts. Time is plotted on Y scale. + +Raw data is available in Git, as well as the chart generator. Strange results +caused by testbed bugs are already omitted. + +There is also a line drawn on a 2-second mark. Convergence is checked by +periodically requesting `birdc show route count` on one of the peers and BGP +peers have also a 1-second connect delay time (default is 5 seconds). All +measured times shorter than 2 seconds are highly unreliable. + +![Plotted data for Single](03b_stats_2d_single.png) +[Plotted data for Single in PDF](03b_stats_2d_single.pdf) + +Single-table setup has times reduced to about 1/8 when comparing 3.0-alpha0 to +2.0.8. Speedup for 10-peer setup is slightly worse than expected and there is +still some room for improvement, yet 8-fold speedup on 8 physical cores and 16 +hyperthreads is good for me now. + +The most demanding case with 2M routes and 1k peers failed. On 2.0.8, my +configuration converges after almost two hours on 2.0.8, with the speed of +route processing steadily decreasing until only several routes per second are +done. Version 3.0-alpha0 is memory-bloating for some non-obvious reason and +couldn't fit into 32G RAM. There is definitely some work ahead to stabilize +BIRD behavior with extreme setups. + +![Plotted data for Multi](03b_stats_2d_multi.png) +[Plotted data for Multi in PDF](03b_stats_2d_multi.pdf) + +Multi-table setup got the same speedup as single-table setup, no big +surprise. Largest cases were not tested at all as they don't fit well into 32G +RAM even with 2.0.8. + +![Plotted data for Imex](03b_stats_2d_imex.png) +[Plotted data for Imex in PDF](03b_stats_2d_imex.pdf) + +![Plotted data for Mulimex](03b_stats_2d_mulimex.png) +[Plotted data for Mulimex in PDF](03b_stats_2d_mulimex.pdf) + +Setups with import / export tables are also sped up by a factor +about 6-8. Data on largest setups (2M routes) are showing some strangely +ineffective behaviour. Considering that both single-table and multi-table +setups yield similar performance data, there is probably some unwanted +inefficiency in the auxiliary table code. + +## Conclusion + +BIRD 3.0-alpha0 is a good version for preliminary testing in IXPs. There is +some speedup in every testcase and code stability is enough to handle typical +use cases. Some test scenarios went out of available memory and there is +definitely a lot of work to stabilize this, yet for now it makes no sense to +postpone this alpha version any more. + +We don't recommend upgrading a production machine to this version +yet, anyway if you have a test setup, getting version 3.0-alpha0 there and +reporting bugs is much welcome. + +Notice: Multithreaded BIRD, at least in version 3.0-alpha0, doesn't limit its number of +threads. It will spawn at least one thread per every BGP, RPKI and Pipe +protocol, one thread per every routing table (including auxiliary tables) and +possibly several more. It's up to the machine administrator to setup a limit on +CPU core usage by BIRD. When running with many threads and protocols, you may +need also to raise the filedescriptor limit: BIRD uses 2 filedescriptors per +every thread for internal messaging. + +*It's a long road to the version 3. By releasing this alpha version, we'd like +to encourage every user to try this preview. If you want to know more about +what is being done and why, you may also check the full +[blogpost series about multithreaded BIRD](https://en.blog.nic.cz/2021/03/15/bird-journey-to-threads-chapter-0-the-reason-why/). Thank you for your ongoing support!* |