A year and a half doing SIMD programming

Porting SIMD C++ code to C#

For this article, I will use the Simdbase64 codebase as a source of examples. I highly recommend you read Daniel’s blog post on Base64 decoding and encoding.

The switch from C++ to C# is straightforward if you have a good grasp of SIMD programming already but there are a few differences which is the subject of this article. There are a lot of articles on the differences between the two langages already so I will focus on the SIMD part.

In C#, there isn’t always a 1-1 mapping to intrinsics for some instructions. Although I mostly stuck close to the C++ code, there were a few times I had to ensure that certain instructions really used the intrinsics I had in mind. Thankfully, it was intuitive and a bit of trial and error, along with some elbow grease, helped resolve the issues.

C# encourages different methods for certain operations (e.g., masking), and SIMD code in C# is often easier to read compared to its C++ counterpart.

Comparing C++ and C# Code

We can take a particular function as an example:

C++ AVX2 Implementation


        static inline void compress(__m128i data, uint16_t mask, char *output) {
          if (mask == 0) {
            _mm_storeu_si128(reinterpret_cast<__m128i *>(output), data);
            return;
          }
          uint8_t mask1 = uint8_t(mask);  	
          uint8_t mask2 = uint8_t(mask >> 8);
        
          __m128i shufmask = _mm_set_epi64x(tables::base64::thintable_epi8[mask2],
                                            tables::base64::thintable_epi8[mask1]);
          shufmask = _mm_add_epi8(shufmask, _mm_set_epi32(0x08080808, 0x08080808, 0, 0));
          __m128i pruned = _mm_shuffle_epi8(data, shufmask);
          int pop1 = tables::base64::BitsSetTable256mul2[mask1];
        
          __m128i compactmask = _mm_loadu_si128(reinterpret_cast(
                                               tables::base64::pshufb_combine_table + pop1 * 8));
          __m128i answer = _mm_shuffle_epi8(pruned, compactmask);
        
          _mm_storeu_si128(reinterpret_cast<__m128i *>(output), answer);
        }

C# Implementation


        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        private static unsafe void Compress(Vector128 data, ushort mask, byte* output, byte* tablePtr) {
            if (mask == 0) {
                Sse2.Store(output, data);
                return;
            }
        
            byte mask1 = (byte)mask;      
            byte mask2 = (byte)(mask >> 8);
        
            ulong value1 = Tables.GetThintableEpi8(mask1);
            ulong value2 = Tables.GetThintableEpi8(mask2);
        
            Vector128 shufmask = Vector128.Create(value2, value1).AsSByte();
        
            shufmask = Sse2.Add(shufmask, Vector128.Create(0x08080808, 0x08080808, 0, 0).AsSByte());
        
            Vector128 pruned = Ssse3.Shuffle(data.AsSByte(), shufmask);
            int pop1 = Tables.GetBitsSetTable256mul2(mask1);
        
            Vector128 compactmask = Sse2.LoadVector128(tablePtr + pop1 * 8);
        
            Vector128 answer = Ssse3.Shuffle(pruned.AsByte(), compactmask);
            Sse2.Store(output, answer);
        }

The most obvious difference is the lack of shorthands in the C# code. In C++, there is no function overloading, so you end up with names like _mm_shuffle_epi32 (where _mm_ refers to SSE instructions, shuffle is the intrinsic, and epi32 indicates how the input vector is treated... In this case, it is categorised as a vector of signed 32-bits units).

In C#, you don’t need such detailed shorthands. The type of the arguments usually dictates what intrinsic is used, making the code easier to read.

Challenges with AVX-512

In these projects, one of the challenges in porting from AVX-512 to AVX-2 was finding ways to work around the fact that AVX-2 has fewer instructions than AVX-512. For example, the main difference between AVX-512 and AVX-2 is the introduction of masking and compress instructions in AVX-512.

In the AVX-2 code above, there is a scalar ushort variable mask that indicates which bytes to act on. In AVX-512, dedicated SIMD registers hold such masks. However, not all AVX-512 instructions are exposed in C#.

In AVX-512, the compress function and related code could be replaced by a single intrinsic: _mm512_maskz_compress_epi8.

In particular, _mm512_maskz_compress_epi8 was missing in .NET 9, so I adapted the AVX-2 code above nearly verbatim for AVX-512. This took more instructions and turned out to be slower than the AVX-2 implementation and the runtime by a significant margin.

In our case, this wasn’t a huge issue: our AVX-2 implementation was already faster than the runtime, and the missing instructions should be exposed in the next version of .NET, so we’re heading for a happy ending.