This is part 3 of the series of post on llcpp, if you’ve just started here, I suggest you see the intro post for some context.

Benchmarks, what are we measuring?

As mentioned in the previous posts, the main selling point of most of the logging frameworks I’ve considered is performance. From their repositories, we can learn how they benchmark their performance and hopefully understand the consequences for using each candidate. The first step is to decide a test case, which involves the following variables:

  • What logging framework are we benchmarking?
  • Which features of the framework are we benchmarking?
  • What underlying abstractions are we building upon?
  • What exactly are we measuring?

The first point is pretty straight forward: pick a group of frameworks you wish to compare. Usually this includes the author’s own framework. Unfortunately I have not found a single (up to date) benchmark authored by anyone that is not an author of a logging framework. This creates a very competitive atmosphere and discussions; and while this is a positive thing when it leads to innovation, it often leads to biased discussion at best - and forgetting the actual customers at worst. I admit, by entering this discussion with my own framework, I’m now a part of the problem too. But I hope someone that is reading this will be encouraged to create a neutral comparison that will drive the framework authors forward.

After we chose our candidates, we decide which features we are benchmarking. This step is unfortunately, often overlooked, because it is crucial to understand the difference in features between the frameworks we’re comparing. One key example is a benchmark I saw, which compares two frameworks that handle the log’s timestamp differently: one converts it to human-readable format (with localtime(3)) and the other leaves it as an integer. This difference may seem subtle, but the extra work done by localtime(3) can be the reason for not-so-subtle differences in performance. This may lead to questionable comparisons - obviously, the author of the localtime(3)-formatting-framework can remove it (or make it optional) very easily, which may invalidate the results. Another example is mixing asynchronous frameworks that allow passing of non-literal strings as arguments (which means they have to copy it) with those that don’t. Users may pick the framework that works faster but only later realize that they’re missing a feature that’s essential for them.

As I’ve mentioned earlier, understanding the underlying abstraction for each framework and test case is extremely important for a benchmark. While it’s very hard to catch every little side-effect, we should actively try to find the ones that have the biggest impact. My key example here revolves around choosing the filesystem which stores the log files. Some benchmarks write to your current working directory, some write to /tmp (which may be the same filesystem as your CWD) and others write to /dev/null. While it may be relevant to benchmark any or all of these options, you should try to understand how this decision affects each framework. Some may seem faster on /dev/null because they’re optimized for a “happy flow” where I/O calls don’t block that much but on actual real-life workloads may not be better.

After we have all of the above figured out, we need to decide what we are measuring. The benchmarks that were used by the framework’s authors measured at least one of the following scenarios:

  • For a given work-load (X total number of logs, Y producer threads, async/sync method of operation), measure the total time to completion.
  • For a given work-load, measure the average log call site latency.
  • For a given work-load, measure the worst log call site latency.
  • For a given work-load, measure a histogram of the log call site latency (and pick the higher percentiles).

While there is a strong opinion for measuring only worst case latency, it really depends on the requirements and the design of the applications. Obviously, hard real time systems must make precise predictions on this property, but most of the systems don’t have these constraints. In fact, most operating systems will work against this measurement because of their preemptive nature. I have tried getting around this preemtion issue by playing around with the Linux Kernel scheduler parameters (sched_setscheduler(..., SCHED_FIFO, ...)) and by measuring with clock_gettime(CLOCK_THREAD_CPUTIME_ID, ...) but both showed inconsistent results.

An important thing to remember about call site latency benchmarks is that they do not reflect the total CPU time the framework has used. In the case of async loggers, they show relatively low call site latency, but if you measure the total time to completion (that is, including actually writing the log) you don’t necessarily get better performance from them. Therefore, if your issues are overall system performance (i.e. high CPU loads over extended periods of time) and not specific call site latency (i.e. quick bursts of high computational load), you should give proper weight to the first scenario over the last three.

My Benchmark, Results and Comparison

While going over the benchmarks in the candidate frameworks, the one implemented on NanoLog was my favorite (you can find it here). I liked it because it compared the most frameworks (NanoLog, spdlog, g3log and reckless) and showed the most information on call site latency (50th, 75th, 90th, 99th, and 99.9th pctile, worst case and average). Furthermore, it benchmark four different work loads by running the benchmark with 1, 2, 3 and 4 threads at a time. There were a couple of things missing, which I’ve added for my benchmarks:

  • Running them with time to measure the total CPU time.
  • Adding MAL log.
  • (Obviously) Adding llcpp.

One important thing to note about MAL log is that it prints the log’s timestamp and not human-readable date/time. This means it doesn’t have to call localtime or gmtime which reduces some overhead. If we’re already on the subject of time-formatting, note that: NanoLog uses gmtime (faster version, no tz conversion), spdlog lets you choose between the two and g3log uses localtime. In llcpp I’ve implemented two prefixes: time_format_prefix which lets you choose between gmtime and localtime and nanosec_time_prefix which outputs a nanosec timestamp. In my comparison I’ve used llcpp with time_format_prefix (localtime mode) and with nanosec_time_prefix, so you can get a sense of both options.

When planning my benchmarking I wanted to see how each framework handles the NanoLog benchmark for two different filesystem scenarios: /dev/shm/ (a tmpfs. would rather use /dev/null but not all frameworks let us do that), and /tmp (which, in Ubuntu, is mapped to your disk; in my case an SSD). I did this for two reasons: I wanted to get a better sense of how much work each framework is doing that is not filesystem related and I wanted to see whether that work actually matters when the I/O is the bottleneck. I compiled the benchmark with clang 5.0.1 and --std=c++1z -flto -O3 flags. Unfortunately, I could not get reckless to compile. Time is in nanoseconds, unless stated otherwise. I’m using NanoLog’s guaranteed logger.

1 Thread Logging To /tmp

Logger 50th 75th 90th 99th 99.9th Worst Average Total Time (seconds)
llcpp (timestamp) 165 171 211 2058 6313 149432 230.360544 0.42
llcpp (date/time) 184 195 216 2153 3718 1099523 260.166040 0.53
mini-async-logger 286 335 361 1348 2884 6204357 324.418935 1.32
NanoLog 211 248 291 1099 3809 2493347 798.895457 3.01
spdlog 729 751 780 973 3277 1057879 744.825963 0.95
g3log 2300 2819 3230 5603 28176 2249642 2490.365190 3.86

4 Threads Logging To /tmp

Logger 50th 75th 90th 99th 99.9th Worst Average Total Time (seconds)
llcpp (timestamp) 173 191 242 7501 36031 2668978 458.101174 1.61
llcpp (date/time) 220 227 270 15515 46522 2162221 567.410498 1.69
mini-async-logger 414 581 863 2872 24384 5858754 612.069366 5.52
NanoLog 338 404 1120 1859 24906 1599487 845.320061 12.81
spdlog 1100 1193 1238 1373 14022 1312104 1032.874598 1.91
g3log 1904 2452 3171 9600 37283 6174296 2428.927793 16.29

1 Thread Logging To /dev/shm

Logger 50th 75th 90th 99th 99.9th Worst Average Total Time (seconds)
llcpp (timestamp) 163 168 209 1262 2357 153668 200.760008 0.38
llcpp (date/time) 224 234 287 1409 3064 166864 277.915158 0.47
mini-async-logger 330 418 465 1389 4967 115102 368.982075 1.34
NanoLog 211 234 274 1077 1938 296433 259.825371 2.2
spdlog 747 776 882 1081 9157 156415 787.803761 1.04
g3log 2436 2883 3318 5120 28752 240359 2590.587821 2.84

4 Threads Logging To /dev/shm

Logger 50th 75th 90th 99th 99.9th Worst Average Total Time (seconds)
llcpp (timestamp) 169 173 209 2626 31151 6346309 340.374203 1.01
llcpp (date/time) 346 705 1183 3747 25537 6118893 709.675622 1.37
mini-async-logger 434 695 978 2528 20945 6228416 670.505850 5.37
NanoLog 350 410 1048 1806 20061 10900055 741.737527 8.86
spdlog 870 1160 1217 1392 12203 3034379 983.986584 1.87
g3log 1967 2674 3273 9856 36719 277528 2488.065324 11.14

Worst Case Latency

I feel that it’s important to discuss the worst case statistic. Measuring the worst case latency for log calls could be very important for many applications. However, it’s a very tricky thing to measure. When I ran these benchmarks, I could not get a consistent worst case performance number. Each framework have showed between 10,000 ns and 1,000,000 ns in different runs. The numbers I present above are the worst case over several benchmark tests, but I’m not sure they mean that much.

I’ve thought the huge variance could occur due to some hardware or filesystem state which I can’t control, so I tried benchmarking /dev/shm and /dev/null. Unfortunately, that didn’t decrease the variance either! The next thing I thought of is OS preemption, which could account for the variance when the system is under load (which it is). This has led me to try running the tests with Linux’s FIFO scheduler; I’ve set the process’s priority to 99 and sched_rt_runtime_us to 990000 (-1 has crashed the system). To my surprise, all of the frameworks, except llcpp, took way longer to complete that with the regular scheduler. I did get smaller variance of the worst case latencies for llcpp, but without anything to compare it to, it’s probably counter-productive to show it.

I have to admit that I’m a very puzzled over this issue. If you think I’ve missed anything or have some thoughts about this - please let me know!

Some Thoughts On The Results

First of all, I’d like to encourage you to make your own benchmarks. My tests (and any test by a framework author) are subjective. Also, regardless of their objectivity, they may not even measure the things you care about or the scenarios your application faces.

I was actually very surprised to see how well llcpp managed to perform. Without async support I was expecting to see worse performance across the call latency board. In terms of total CPU time, I thought I’d see a difference, but I wasn’t expecting such a dramatic decrease. In this sense, spdlog is doing a very good job as well.

While llcpp, NanoLog and MAL are doing very well in the mid-upper percentiles, the 99th and 99.9th percentile aren’t very promising for any candidate. I feel like the issues of the worst case latency measurements appear here as well, just more subtle.

Afterword

In the next post I’d like to share some thoughts on the future ides for llcpp and what I’ve learned from this experiment.

You can read it here: Part 4.