How to learn C++ SIMD programming in 30-ish days::What book publishers are afraid to tell you
Before applying, Daniel sent me this via email:
“[I have to warn you], be mindful that I do very advanced programming. You will be challenged (massively) if you work with me. I am a nice person, but I don't do easy work. :-) You have to be ready to be confused and puzzled, and to work hard.”
According to Peter Norvig, it takes 10 years to become a master programmer.
Well, I did not have 10 years. I had never written a line of code with other people before, nor anything beyond 1400 LOC in high level langages but that was hardly a head start.
I had one month with respect to C++, a language which I knew little about while at the same time doing renovations. While there wasn’t a definite timetable for our C# projects, I felt it was in bad taste to spend one whole month learning C# on the university’s dime, so I watched a few YouTube videos and dove right in.
I did however had one thing going for me : very specific goals.This is not a perfect solution but this is how I did it.
Day 0
First, I had to decide what to cut as learning everything under the sun would have been a fool`s errand.
From previous experience, I knew ChatGPT was good for things like syntax (“What’s the C++ intrinsic for a shift left by X bytes?”), features (“Does C# have equivalents to templates?”) and features so I decided to not focus too much on either or try to memorize them.
At first, I could not understand the papers nor the code, but I could tell that it did not use any multithreading, so I cut that from my training schedule. Template metaprogramming sounded really complicated, so I just hoped it wasn’t going to be on the test (good call: we used templates but only for simple uses).
I did not understand any of the terminology related to hardware, so I set out to understand how a computer works.
Week 1
I chose to emulate the venerable NES 6502 CPU in C++ using a Rust tutorial. I stopped when I had a few working instructions.
Good call: The intuition behind the 6502 also translates, with some caveats, to modern ones. There are a lot more instructions, more CPU cores, more memory, and the software stack has ballooned, but modern hardware works pretty much the same way. Emulating the whole thing would probably have squandered precious time.
Week 2
I then moved on to the C++ core guidelines. This was a good investment, but reading it cover to cover was a bit overkill. It was good to know what idiomatic C++ should look like, but in retrospect, I would probably have skimmed more and used it as a reference as needed (which turned out to be almost never).
C++ is an enormous language, and the guidelines try to be as comprehensive as possible. I would not use the majority of the features.
With Simdutf, I already had performant code as inspiration. Even if I wanted to, I would not have been able to stray very far from the rest of the codebase in the cases covered by the guidelines. In the cases where we needed to squeeze every last bit of performance with SIMD intrinsics, I had to be a little more creative.
Weeks 3 and 4
Next on the reading list was Agner Fog’s book on C++ performance. This turned out to be a great choice to understand what made a computer tick.
I wished it went more in-depth as to how compilers worked, as I often ended up working around their limitations.
It is no wonder that “trust the compiler” is common advice. They can be quite clever. However, a compiler can outperform humans in certain circumstances but struggle in others. For example, with bitwise operations, some compilers don’t recognize that NOT (NOT A) = A.1
It is more Rainman than John Von Neumann.As such, it often helps the compiler if we’re really explicit. For a more ‘real-world’ example, unrolling is a relatively low-effort technique that is often worth trying.
Another example: my next task was to port Simdutf’s UTF8 validation function to C#. After a rather uneventful start writing the tests, benchmark, and scalar function, I got stuck trying to optimize this piece of code:
public static class Vector256Extensions
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Vector256 Prev1(this Vector256 current, Vector256 prev)
{
Vector256 shuffle = Avx2.Permute2x128(prev, current, 0x21);
return Avx2.AlignRight(current, shuffle, (byte)(16 - 1));
}
public static Vector256 Prev2(this Vector256 current, Vector256 prev)
{
Vector256 shuffle = Avx2.Permute2x128(prev, current, 0x21);
return Avx2.AlignRight(current, shuffle, (byte)(16 - 2));
}
public static Vector256 Prev3(this Vector256 current, Vector256 prev)
{
Vector256 shuffle = Avx2.Permute2x128(prev, current, 0x21);
return Avx2.AlignRight(current, shuffle, (byte)(16 - 3));
}
public static Vector256 Lookup16(this Vector256 source, Vector256 lookupTable)
{
return Avx2.Shuffle(lookupTable, source);
}
public static Vector256 ShiftRightLogical4(this Vector256 vector)
{
Vector256 extended = vector.AsUInt16();
Vector256 shifted = Avx2.ShiftRightLogical(extended, 4);
Vector256 narrowed = shifted.AsByte();
return narrowed;
}
}
namespace SimdUnicode
{
public static unsafe class Utf8Utility
{
public static byte* GetPointerToFirstInvalidByte(byte* pInputBuffer, int inputLength)
{
int processedLength = 0;
if (pInputBuffer == null || inputLength <= 0)
{
return pInputBuffer;
}
while (processedLength + 64 <= inputLength)
{
SIMDGetPointerToFirstInvalidByte(pInputBuffer, processedLength);
Utf8Validation.utf8_checker.CheckEof();
if (Utf8Validation.utf8_checker.Errors())
{
return SimdUnicode.UTF8.RewindAndValidateWithErrors(pInputBuffer + processedLength, inputLength - processedLength);
}
processedLength += 64;
}
if (processedLength < inputLength)
{
byte* invalidBytePointer = SimdUnicode.UTF8.GetPointerToFirstInvalidByte(pInputBuffer + processedLength, inputLength - processedLength);
if (invalidBytePointer != pInputBuffer + inputLength)
{
return invalidBytePointer;
}
}
return pInputBuffer + inputLength;
}
public static byte* SIMDGetPointerToFirstInvalidByte(byte* pInputBuffer, int processedLength)
{
Vector256 currentBlock = Avx.LoadVector256(pInputBuffer + processedLength);
Utf8Validation.utf8_checker.CheckNextInput(currentBlock);
processedLength += 32;
currentBlock = Avx.LoadVector256(pInputBuffer + processedLength);
Utf8Validation.utf8_checker.CheckNextInput(currentBlock);
processedLength += 32;
return pInputBuffer + processedLength;
}
}
public struct Utf8Validation
{
public struct utf8_checker
{
static Vector256 error = Vector256.Zero;
static Vector256 prev_input_block = Vector256.Zero;
static Vector256 prev_incomplete = Vector256.Zero;
public utf8_checker()
{
error = Vector256.Zero;
prev_input_block = Vector256.Zero;
prev_incomplete = Vector256.Zero;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void CheckNextInput(Vector256 input)
{
Vector256 inputSBytes = input.AsSByte(); int mask = Avx2.MoveMask(inputSBytes.AsByte());
if (mask != 0)
{
CheckUtf8Bytes(input);
prev_incomplete = IsIncomplete(input);
}
prev_input_block = input;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void CheckUtf8Bytes(Vector256 input)
{
Vector256 prev1 = input.Prev1(prev_input_block);
Vector256 sc = CheckSpecialCases(input, prev1);
error = Avx2.Or(error, CheckMultibyteLengths(input, prev_input_block, sc));
}
public static bool Errors()
{
return !Avx2.TestZ(error, error);
}
public static void CheckEof()
{
error = Avx2.Or(error, prev_incomplete);
}
const byte TOO_SHORT = 1 << 0;
const byte TOO_LONG = 1 << 1;
const byte OVERLONG_3 = 1 << 2;
const byte SURROGATE = 1 << 4;
const byte OVERLONG_2 = 1 << 5;
const byte TWO_CONTS = 1 << 7;
const byte TOO_LARGE = 1 << 3;
const byte TOO_LARGE_1000 = 1 << 6;
const byte OVERLONG_4 = 1 << 6;
const byte CARRY = TOO_SHORT | TOO_LONG | TWO_CONTS;
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256 CheckSpecialCases(Vector256 input, Vector256 prev1)
{
Vector256 shuf1 = Vector256.Create(TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
TWO_CONTS, TWO_CONTS, TWO_CONTS, TWO_CONTS,
TOO_SHORT | OVERLONG_2,
TOO_SHORT,
TOO_SHORT | OVERLONG_3 | SURROGATE,
TOO_SHORT | TOO_LARGE | TOO_LARGE_1000 | OVERLONG_4,
TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
TOO_LONG, TOO_LONG, TOO_LONG, TOO_LONG,
TWO_CONTS, TWO_CONTS, TWO_CONTS, TWO_CONTS,
TOO_SHORT | OVERLONG_2,
TOO_SHORT,
TOO_SHORT | OVERLONG_3 | SURROGATE,
TOO_SHORT | TOO_LARGE | TOO_LARGE_1000 | OVERLONG_4);
Vector256 shuf2 = Vector256.Create(CARRY | OVERLONG_3 | OVERLONG_2 | OVERLONG_4,
CARRY | OVERLONG_2,
CARRY,
CARRY,
CARRY | TOO_LARGE,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000 | SURROGATE,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | OVERLONG_3 | OVERLONG_2 | OVERLONG_4,
CARRY | OVERLONG_2,
CARRY,
CARRY,
CARRY | TOO_LARGE,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000 | SURROGATE,
CARRY | TOO_LARGE | TOO_LARGE_1000,
CARRY | TOO_LARGE | TOO_LARGE_1000);
Vector256 shuf3 = Vector256.Create(TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE_1000 | OVERLONG_4,
TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE,
TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT,
TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE_1000 | OVERLONG_4,
TOO_LONG | OVERLONG_2 | TWO_CONTS | OVERLONG_3 | TOO_LARGE,
TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
TOO_LONG | OVERLONG_2 | TWO_CONTS | SURROGATE | TOO_LARGE,
TOO_SHORT, TOO_SHORT, TOO_SHORT, TOO_SHORT);
Vector256 byte_1_high = prev1.ShiftRightLogical4().Lookup16(shuf1);
Vector256 byte_1_low = (prev1 & Vector256.Create((byte)0x0F)).Lookup16(shuf2);
Vector256 byte_2_high = input.ShiftRightLogical4().Lookup16(shuf3);
return Avx2.And(Avx2.And(byte_1_high, byte_1_low), byte_2_high);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256 CheckMultibyteLengths(Vector256 input, Vector256 prev_input, Vector256 sc)
{
Vector256 prev2 = input.Prev2(prev_input);
Vector256 prev3 = input.Prev3(prev_input);
Vector256 must23 = MustBe23Continuation(prev2, prev3);
Vector256 must23_80 = Avx2.And(must23, Vector256.Create((byte)0x80));
return Avx2.Xor(must23_80, sc);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256 MustBe23Continuation(Vector256 prev2, Vector256 prev3)
{
Vector256 is_third_byte = Avx2.SubtractSaturate(prev2, Vector256.Create((byte)(0b11100000u - 0x80)));
Vector256 is_fourth_byte = Avx2.SubtractSaturate(prev3, Vector256.Create((byte)(0b11110000u - 0x80)));
Vector256 combined = Avx2.Or(is_third_byte, is_fourth_byte);
return combined;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256 IsIncomplete(Vector256 input)
{
Vector256 maxValue = Vector256.Create(255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 255, 255, 255,
255, 255, 255, 255, 255, 0b11110000 - 1, 0b11100000 - 1, 0b11000000 - 1);
Vector256 result = SaturatingSubtractUnsigned(input, maxValue);
return result;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static Vector256 SaturatingSubtractUnsigned(Vector256 left, Vector256 right)
{
if (!Avx2.IsSupported)
{
throw new PlatformNotSupportedException("AVX2 is not supported on this processor.");
}
Vector256 leftUShorts = left.AsUInt16();
Vector256 rightUShorts = right.AsUInt16();
Vector256 subtractionResult = Avx2.SubtractSaturate(leftUShorts, rightUShorts);
return subtractionResult.AsByte();
}
}
}
It’s basically a one-to-one port of the C++ code. It works, but it was slower than the runtime, and I could not figure out why.
Daniel came up with this solution.
In a nutshell, in the original C++ code, there was no issue with taking small pieces of code and putting them into functions like so:
public static Vector256<byte> Prev1(this Vector256<byte> current, Vector256<byte> prev)
{
Vector256<byte> shuffle = Avx2.Permute2x128(prev, current, 0x21);
return Avx2.AlignRight(current, shuffle, (byte)(16 - 1)); // shifts right by a certain amount
}
But in this particular case, a bunch of helper functions in C# that appears only in one place in a very long function is not a zero-cost abstraction. The hints to agressively inline were ignored by the compiler and the solution was to manually inline everything.
There were only maybe a few times when I had to look at long pieces of assembly code; that’s when I was likely to feel stuck. I can’t help but wonder if knowing more about compiler design would have saved me some time in the long run, if only to better isolate which part of the assembly would have been more relevant.
Week 4
This is the week where I tried to get hands-on practice: I learned more about SIMD, and wrote a program that did matrix dot multiplication. I then benchmarked it against a SIMD version of itself.
So that’s it in a nutshell. I wanted to do more, but I ended up learning on the job via the Intel Intrinsics Guide.
I make no claim that my path was optimal—it’s just how I did it, and it was enough to get started. In retrospect, I would have skimmed more overall, read more on simdutf itself, and included more readings on compiler design.