Lookup Tables & Time Management

My next task was to implement the Latin1 ⇔ UTFX conversion.

Latin1 → UTF32/UTF16/UTF8 were the simpler conversions.

The converse, in particular, UTF8 → Latin1 (which is the subject of this article), turned out to be a tad more complicated.

Vectorized operations are all well and good, but it can sometimes feel a bit constricted: what happens if you need to be a little more surgical, e.g., an UTF8 element straddles two vectors? Or in the case of UTF8 → Latin1, if you want to check that a two-byte UTF8 character fits into the Latin1 range?

It becomes a bit like trying to pick up a pebble with a shovel. It’s a hassle. Now, this is where lookup tables can be useful.

To make a very long story short: the existing AVX2/SSE routines that transcribed between different UTF encodings used bit shifts to check against previous bytes and less-than/greater-than operations to mask certain properties (such as the presence of continuation bytes), after which the various outputs were checked against pre-computed lookup tables.

I have to admit that I spent the middle of July being stuck: I had first tried to make new lookup tables specifically for Latin1 and I got stuck debugging them. After a while, I conferred with Daniel about my lack of progress.

He suggested taking the existing UTF8 → UTF16 tables and treating Latin1 as a special case of UTF16. He acknowledged that even though the table for UTF16 was indeed bigger and not a 100% fit for the problem, redesigning them would take more development time and it was likely not worth it. After polishing and testing another of his sketches, we ended up with something like this for the SSE routine:

GitHub Pull Request #263

It would need further minor fixes, but the results turned out not to be too shabby. It might not be the perfect solution, but it beat the baseline set by ICU and iconv.

Morale of the story: Reuse whatever you have on hand if it’s good enough. Do the simplest possible thing first. Also, creating lookup tables from scratch is a pain.

Writing Across Different Architectures

One difficulty with SIMD is that it is harder to write code across different architectures and setups: I wrote code for SSE, AVX2, and AVX-512 for each project.

A fast routine in AVX-512 might become sluggish in older architectures and vice versa.

In the case of lookup tables, the bigger the registers, the more values you can put in and the less back-and-forth you need to get to. This comes back to my point that there are always tradeoffs when doing performance optimization.

While there was a lot of overlap between each, the latter instruction sets contain more specialized and numerous instructions. Some ingenuity is typically necessary to replicate missing instructions in older architectures (or vice versa: taking account of newer, more performant ones).

SSE and AVX2 are relatively similar and, for the most part, differ mostly by the size of their vectors. But AVX-512 is a different beast.

To illustrate, my next task was to incorporate these blog posts into simdutf:

And this respectively:

As you can see, we use completely different routines.

Indeed, while newer SIMD instructions operate on more values, they are also heavier. It turned out that using the same AVX2 tables modified for AVX-512 did not turn out to be quite as efficient.

The saving grace and the most notable feature of AVX-512 is the introduction of masking and compress instructions, which aren’t present in either SSE or AVX2. Finding ways to dance around missing AVX-512 instructions will prove to be a recurring point later on.