summaryrefslogtreecommitdiff
path: root/doc/threads/03b_performance.md
diff options
context:
space:
mode:
authorMaria Matejka <mq@ucw.cz>2022-02-07 22:27:42 +0100
committerMaria Matejka <mq@ucw.cz>2022-02-07 22:35:41 +0100
commita6fc31f153f4d6ad2fa7f63a4ba137fc263f4200 (patch)
tree233e4d6a9374b2e68cf2f857c3526bf775990f06 /doc/threads/03b_performance.md
parent3f6462ad35ca49dc4898546bba59147792481b25 (diff)
Blogpost about performance + data.
Diffstat (limited to 'doc/threads/03b_performance.md')
-rw-r--r--doc/threads/03b_performance.md153
1 files changed, 153 insertions, 0 deletions
diff --git a/doc/threads/03b_performance.md b/doc/threads/03b_performance.md
new file mode 100644
index 00000000..07fd5bb0
--- /dev/null
+++ b/doc/threads/03b_performance.md
@@ -0,0 +1,153 @@
+# BIRD Journey to Threads. Chapter 3½: Route server performance
+
+All the work on multithreading shall be justified by performance improvements.
+This chapter tries to compare times reached by version 3.0-alpha0 and 2.0.8,
+showing some data and thinking about them.
+
+BIRD is a fast, robust and memory-efficient routing daemon designed and
+implemented at the end of 20th century. We're doing a significant amount of
+BIRD's internal structure changes to make it run in multiple threads in parallel.
+
+## Testing setup
+
+There are two machines in one rack. One of these simulates the peers of
+a route server, the other runs BIRD in a route server configuration. First, the
+peers are launched, then the route server is started and one of the peers
+measures the convergence time until routes are fully propagated. Other peers
+drop all incoming routes.
+
+There are four configurations. *Single* where all BGPs are directly
+connected to the main table, *Multi* where every BGP has its own table and
+filters are done on pipes between them, and finally *Imex* and *Mulimex* which are
+effectively *Single* and *Multi* where all BGPs have also their auxiliary
+import and export tables enabled.
+
+All of these use the same short dummy filter for route import to provide a
+consistent load. This filter includes no meaningful logic, it's just some dummy
+data to run the CPU with no memory contention. Real filters also do not suffer from
+memory contention, with an exception of ROA checks. Optimization of ROA is a
+task for another day.
+
+There is also other stuff in BIRD waiting for performance assessment. As the
+(by far) most demanding setup of BIRD is route server in IXP, we chose to
+optimize and measure BGP and filters first.
+
+Hardware used for testing is Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz with 8
+physical cores, two hyperthreads on each. Memory is 32 GB RAM.
+
+## Test parameters and statistics
+
+BIRD setup may scale on two major axes. Number of peers and number of routes /
+destinations. *(There are more axes, e.g.: complexity of filters, routes /
+destinations ratio, topology size in IGP)*
+
+Scaling the test on route count is easy, just by adding more routes to the
+testing peers. Currently, the largest test data I feed BIRD with is about 2M
+routes for around 800K destinations, due to memory limitations. The routes /
+destinations ratio is around 2.5 in this testing setup, trying to get close to
+real-world routing servers.[^1]
+
+[^1]: BIRD can handle much more in real life, the actual software limit is currently
+ a 32-bit unsigned route counter in the table structure. Hardware capabilities
+ are already there and checking how BIRD handles more than 4G routes is
+ certainly going to be a real thing soon.
+
+Scaling the test on peer count is easy, until you get to higher numbers. When I
+was setting up the test, I configured one Linux network namespace for each peer,
+connecting them by virtual links to a bridge and by a GRE tunnel to the other
+machine. This works well for 10 peers but setting up and removing 1000 network
+namespaces takes more than 15 minutes in total. (Note to myself: try this with
+a newer Linux kernel than 4.9.)
+
+Another problem of test scaling is bandwidth. With 10 peers, everything is OK.
+With 1000 peers, version 3.0-alpha0 does more than 600 Mbps traffic in peak
+which is just about the bandwidth of the whole setup. I'm planning to design a
+better test setup with less chokepoints in future.
+
+## Hypothesis
+
+There are two versions subjected to the test. One of these is `2.0.8` as an
+initial testpoint. The other is version 3.0-alpha0, named `bgp` as parallel BGP
+is implemented there.
+
+The major problem of large-scale BIRD setups is convergence time on startup. We
+assume that a multithreaded version should reduce the overall convergence time,
+at most by a factor equal to number of cores involved. Here we have 16
+hyperthreads, in theory we should reduce the times up to 16-fold, yet this is
+almost impossible as a non-negligible amount of time is spent in bottleneck
+code like best route selection or some cleanup routines. This has become a
+bottleneck by making other parts parallel.
+
+## Data
+
+Four charts are included here, one for each setup. All axes have a
+logarithmic scale. The route count on X scale is the total route count in
+tested BIRD, different color shades belong to different versions and peer
+counts. Time is plotted on Y scale.
+
+Raw data is available in Git, as well as the chart generator. Strange results
+caused by testbed bugs are already omitted.
+
+There is also a line drawn on a 2-second mark. Convergence is checked by
+periodically requesting `birdc show route count` on one of the peers and BGP
+peers have also a 1-second connect delay time (default is 5 seconds). All
+measured times shorter than 2 seconds are highly unreliable.
+
+![Plotted data for Single](03b_stats_2d_single.png)
+[Plotted data for Single in PDF](03b_stats_2d_single.pdf)
+
+Single-table setup has times reduced to about 1/8 when comparing 3.0-alpha0 to
+2.0.8. Speedup for 10-peer setup is slightly worse than expected and there is
+still some room for improvement, yet 8-fold speedup on 8 physical cores and 16
+hyperthreads is good for me now.
+
+The most demanding case with 2M routes and 1k peers failed. On 2.0.8, my
+configuration converges after almost two hours on 2.0.8, with the speed of
+route processing steadily decreasing until only several routes per second are
+done. Version 3.0-alpha0 is memory-bloating for some non-obvious reason and
+couldn't fit into 32G RAM. There is definitely some work ahead to stabilize
+BIRD behavior with extreme setups.
+
+![Plotted data for Multi](03b_stats_2d_multi.png)
+[Plotted data for Multi in PDF](03b_stats_2d_multi.pdf)
+
+Multi-table setup got the same speedup as single-table setup, no big
+surprise. Largest cases were not tested at all as they don't fit well into 32G
+RAM even with 2.0.8.
+
+![Plotted data for Imex](03b_stats_2d_imex.png)
+[Plotted data for Imex in PDF](03b_stats_2d_imex.pdf)
+
+![Plotted data for Mulimex](03b_stats_2d_mulimex.png)
+[Plotted data for Mulimex in PDF](03b_stats_2d_mulimex.pdf)
+
+Setups with import / export tables are also sped up by a factor
+about 6-8. Data on largest setups (2M routes) are showing some strangely
+ineffective behaviour. Considering that both single-table and multi-table
+setups yield similar performance data, there is probably some unwanted
+inefficiency in the auxiliary table code.
+
+## Conclusion
+
+BIRD 3.0-alpha0 is a good version for preliminary testing in IXPs. There is
+some speedup in every testcase and code stability is enough to handle typical
+use cases. Some test scenarios went out of available memory and there is
+definitely a lot of work to stabilize this, yet for now it makes no sense to
+postpone this alpha version any more.
+
+We don't recommend upgrading a production machine to this version
+yet, anyway if you have a test setup, getting version 3.0-alpha0 there and
+reporting bugs is much welcome.
+
+Notice: Multithreaded BIRD, at least in version 3.0-alpha0, doesn't limit its number of
+threads. It will spawn at least one thread per every BGP, RPKI and Pipe
+protocol, one thread per every routing table (including auxiliary tables) and
+possibly several more. It's up to the machine administrator to setup a limit on
+CPU core usage by BIRD. When running with many threads and protocols, you may
+need also to raise the filedescriptor limit: BIRD uses 2 filedescriptors per
+every thread for internal messaging.
+
+*It's a long road to the version 3. By releasing this alpha version, we'd like
+to encourage every user to try this preview. If you want to know more about
+what is being done and why, you may also check the full
+[blogpost series about multithreaded BIRD](https://en.blog.nic.cz/2021/03/15/bird-journey-to-threads-chapter-0-the-reason-why/). Thank you for your ongoing support!*