How to write open source SIMD code used by millions of people: a gameplan

The overall strategy was fairly simple and the same for all the projects I contributed to:

See where you can make improvements and see if you have a winning algorithm already.
Get familiar with the project/Read code.
Write scalar functions.
Write a lot of tests (and update as needed).
Write benchmarks.
Write SIMD code.

I think most of it is self-explanatory but would still like to add a couple of details:

Step #0: See Where You Can Make Improvements

I term it step zero because I had little to do with it myself, but the idea is to find something where we can make improvements and if there’s already an algorithm we can leverage. Daniel suggested them to me, and I thought they were really cool projects.

It can be a very old library. In the case of our Latin1 ⇔ UTFX transcoding, the most popular libraries were ICU and iconv, both of which didn’t use scalar instructions.¹

But it can also be something that is already SIMD-accelerated, as was the case with our C# projects. There wasn’t a direct equivalent to Latin1 ⇔ UTFX transcoding, but a lot of the code from the UTF8 ⇔ UTF16 part of Simdutf could be adapted to fit the new project.

The algorithms used in SimdUnicode and SimdBase64 already had a pretty direct C++ implementation. They had already been successful when implemented in other languages. This still wasn’t easy, but it made things significantly easier. These algorithms were developed over several years by multiple people.

We didn’t know at the beginning if these projects were going to work. In particular, the .NET Runtime already made pretty heavy use of SIMD instructions, but we knew it was well worth a try. The next step after that would be:

Step #1: Reading Other People’s Code

I was mostly a hobbyist programmer prior to starting this, and it was the first time I dealt with large, mature codebases used in production. As a warm-up, Daniel asked me to see if I could improve the command line utility of the ada project. I ended up adding the ability to pipe files to it , which is simple compared to what was to come, but my way of going about it did not sensibly change.

The first step is to ascertain the objective and get a general feel for the codebase—it takes too long to read a large codebase from start to finish. I had more latitude with simdutf as it was internal to our team, but with the C# projects, I had to make sure our functions could be used as drop-in replacements for existing ones in the Runtime. It had to be as conventional C# as possible.

The second step is to determine a point of entry: latch on to the one function that seems closest to the desired functionality and trace the program's flow from there.

The rest involves an ad hoc mixture of reading documentation, asking others, Googling, using Visual Studio Code’s search function, reading code, making guesses, reading past GitHub discussions, and applying general programming knowledge. In Visual Studio Code, right-clicking and selecting “Go to definition” is very helpful. However, in C++, this feature does not work if you hit a macro, so a debugger can be useful for tracing the program flow.

Step #2: Writing Scalar Version

This is to provide a sanity check: it is much easier to reason about and debug a scalar function.

Step #3: Writing Tests

For all SIMD projects, I wrote:

One brute force test that throws lots of garbage input (preferably with a fuzzer).
Tests for each conditional branch (e.g., one for each possible error code in our C++ UTF-8 validation example).
Tests for each bug I thought might reoccur.

While it’s not the most exciting topic, a significant portion of my time was spent writing tests and debugging. For each architecture, we might have tried a dozen plausible solutions. If only the brute force test failed, it was usually some obscure edge case I had forgotten by the time I reached the third plausible solution. Writing good tests is the programming equivalent of “A pint of sweat will save a gallon of blood.”

Debuggers are not ergonomic for SIMD, making tests even more crucial.

Step #4: Writing Benchmarks

This step is self-explanatory but not trivial. There are many potential pitfalls. The BenchmarkDotNet homepage provides a good overview:

“Reliable benchmarks always include a lot of boilerplate code. Let's think about what you should do in a typical case. First, you should perform a pilot experiment and determine the best number of method invocations. Next, you should execute several warm-up iterations and ensure that your benchmark achieved a steady state. After that, you should execute the main iterations and calculate some basic statistics. If you calculate some values in your benchmark, you should use them somehow to prevent dead code elimination. If you use loops, you should care about the effect of loop unrolling on your results (which may depend on the processor architecture). Once you get results, you should check for special properties of the obtained performance distribution, like multimodality or extremely high outliers. You should also evaluate the overhead of your infrastructure and deduct it from your results. If you want to test several environments, you should perform the measurements in each of them and manually aggregate the results. If you write this code from scratch, it's easy to make a mistake and spoil your measurements. Note that it's a shortened version of the full checklist that you should follow during benchmarking: there are a lot of additional hidden pitfalls that should be handled appropriately.”

Thankfully, I did not have to write much of that boilerplate code. The benchmarking infrastructure was already in place for simdutf. I only needed to extend and modify it. For C#, we used the BenchmarkDotNet library.

Step #5: Writing SIMD Code

Step #5 deserves a section of its own and is the subject of my next article.

[1] This is actually more common than it seems. i.e. no one bothered to check whether a very popular Python data science library was actually SIMD optimized until last year: