Porting SIMD C++ code to C#
For this article, I will use the Simdbase64 codebase as a source of examples. I highly recommend you read Daniel’s blog post on Base64 decoding and encoding.
The switch from C++ to C# is straightforward if you have a good grasp of SIMD programming already but there are a few differences which is the subject of this article. There are a lot of articles on the differences between the two langages already so I will focus on the SIMD part.
In C#, there isn’t always a 1-1 mapping to intrinsics for some instructions. Although I mostly stuck close to the C++ code, there were a few times I had to ensure that certain instructions really used the intrinsics I had in mind. Thankfully, it was intuitive and a bit of trial and error, along with some elbow grease, helped resolve the issues.
C# encourages different methods for certain operations (e.g., masking), and SIMD code in C# is often easier to read compared to its C++ counterpart.
Comparing C++ and C# Code
We can take a particular function as an example:C++ AVX2 Implementation
static inline void compress(__m128i data, uint16_t mask, char *output) {
if (mask == 0) {
_mm_storeu_si128(reinterpret_cast<__m128i *>(output), data);
return;
}
uint8_t mask1 = uint8_t(mask);
uint8_t mask2 = uint8_t(mask >> 8);
__m128i shufmask = _mm_set_epi64x(tables::base64::thintable_epi8[mask2],
tables::base64::thintable_epi8[mask1]);
shufmask = _mm_add_epi8(shufmask, _mm_set_epi32(0x08080808, 0x08080808, 0, 0));
__m128i pruned = _mm_shuffle_epi8(data, shufmask);
int pop1 = tables::base64::BitsSetTable256mul2[mask1];
__m128i compactmask = _mm_loadu_si128(reinterpret_cast(
tables::base64::pshufb_combine_table + pop1 * 8));
__m128i answer = _mm_shuffle_epi8(pruned, compactmask);
_mm_storeu_si128(reinterpret_cast<__m128i *>(output), answer);
}
C# Implementation
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe void Compress(Vector128 data, ushort mask, byte* output, byte* tablePtr) {
if (mask == 0) {
Sse2.Store(output, data);
return;
}
byte mask1 = (byte)mask;
byte mask2 = (byte)(mask >> 8);
ulong value1 = Tables.GetThintableEpi8(mask1);
ulong value2 = Tables.GetThintableEpi8(mask2);
Vector128 shufmask = Vector128.Create(value2, value1).AsSByte();
shufmask = Sse2.Add(shufmask, Vector128.Create(0x08080808, 0x08080808, 0, 0).AsSByte());
Vector128 pruned = Ssse3.Shuffle(data.AsSByte(), shufmask);
int pop1 = Tables.GetBitsSetTable256mul2(mask1);
Vector128 compactmask = Sse2.LoadVector128(tablePtr + pop1 * 8);
Vector128 answer = Ssse3.Shuffle(pruned.AsByte(), compactmask);
Sse2.Store(output, answer);
}
The most obvious difference is the lack of shorthands in the C# code. In C++, there is no function overloading, so you end up with names like _mm_shuffle_epi32
(where _mm_
refers to SSE instructions, shuffle
is the intrinsic, and epi32
indicates how the input vector is treated... In this case, it is categorised as a vector of signed 32-bits units).
In C#, you don’t need such detailed shorthands. The type of the arguments usually dictates what intrinsic is used, making the code easier to read.
Challenges with AVX-512
In these projects, one of the challenges in porting from AVX-512 to AVX-2 was finding ways to work around the fact that AVX-2 has fewer instructions than AVX-512. For example, the main difference between AVX-512 and AVX-2 is the introduction of masking and compress instructions in AVX-512.
In the AVX-2 code above, there is a scalar ushort
variable mask
that indicates which bytes to act on. In AVX-512, dedicated SIMD registers hold such masks. However, not all AVX-512 instructions are exposed in C#.
In AVX-512, the compress
function and related code could be replaced by a single intrinsic: _mm512_maskz_compress_epi8
.
In particular, _mm512_maskz_compress_epi8
was missing in .NET 9, so I adapted the AVX-2 code above nearly verbatim for AVX-512. This took more instructions and turned out to be slower than the AVX-2 implementation and the runtime by a significant margin.
In our case, this wasn’t a huge issue: our AVX-2 implementation was already faster than the runtime, and the missing instructions should be exposed in the next version of .NET, so we’re heading for a happy ending.