The Problem with RISC-V V Mask Bits
Storing mask bits in the low lanes of a regular vector register seems to create more problems than it solves.
I filed the concern described in this article on the RISC-V V github if you would like to follow along. I also passed on the concern personally to a RISC-V V community liaison, and I will post a follow-up article in the future with any responses I receive.
RISC-V is an instruction set architecture designed to compete with ARM. Like ARM, the base instruction set for RISC-V does not include any SIMD instructions. Instead, both ISAs have “extensions” which define these instructions for chips which need them.
In ARM’s case, there are extensions for packed (NEON) and vector (SVE) SIMD instructions. For RISC-V, although there was consideration of a “P” extension for packed SIMD, the main extension gaining traction is the vector extension, RISC-V “V”.
The RISC-V V design is relatively uncontroversial, but one aspect stands out as a problem for high-performance chips: the mask bits for predicated instructions are required to be visible as the low-order bits of the v0 vector register. For reasons I'll explain in detail later, this means high-performance chips have no option other than to create complex schemes for dealing with the dual-use of this register.
This unusual choice is unlike other common instruction sets that provide predicated vector instructions. ARM SVE, AVX-512, and early supercomputers like the CRAY-1 all use dedicated mask registers, avoiding this problem. Even packed instruction sets like SSE and AVX2, which lack dedicated mask registers, avoid this problem by defining the mask bits as the high bit of each lane, so they are already “in the right place” when they are used.
Although this may seem like a small issue, once code is written for an ISA, it becomes a permanent burden on the ISA to support forevermore. Problems like this which require workarounds in high-performance parts of the chip should be avoided. They not only create bloat in current designs, but that bloat must remain in all future designs that support the ISA.
Even though future RISC-V extensions might make this unusual choice obsolete - for example, if a future extension adds real mask registers - RISC-V chips will still be forced to include legacy workarounds just in case they encounter code using the old mask paradigm. Once it's in the ISA, it never really goes away, as experience with x86/64 has amply shown.
That’s about all there is to say. If you already know a lot about vector instruction sets, you now know the gravamen: using the low bits of v0 as both actual vector lanes and as the mask bits for all vector lanes forces high-performance chips to include unnecessary cruft, now and for as long as the RISC-V V ISA remains relevant (assuming it ever becomes relevant in the first place!)
Is there an alternative?
It certainly seems like there is: just have a single dedicated mask register, like the CRAY-1. If the mask was always a separate register that could not be targeted by normal vector instructions, then RISC-V V chips could be designed knowing that they never had to use the value of v0 in two ways at once - problem solved.
One objection to this solution might have been, “because we didn’t want to add a bunch of mask-specific instructions”. A reasonable answer, to be sure, but not one that applies to RISC-V. It already has a bunch of mask-specific instructions! The spec includes no less than eight custom instructions just for working with masks: vmand, vmnand, vmandn, vmxor, vmor, vmnor, vmorn, vmxnor.
If you’re already going to add custom instructions as if you have mask registers, it is hard to see why the very little additional work necessary to have at least one dedicated mask register would be sufficient reason to introduce the burdensome and unusual dual-use v0 scheme present in RISC-V V.
So why did they do it this way?
I honestly don’t know, and that’s why I’m writing the article. To me, this design looks like a bit of a special-case hack, perhaps intended to make it easier to implement RISC-V V slowly on small or low-performance chips, or something like that.
If I’m right about that, I think it’s a bad precedent to put that in the ISA proper. Vector instructions are specifically for performance. The whole point of having SIMD is to build fast chips. So the RISC-V V extension seems a poor place to make a tradeoff that penalizes high-performance designs.
Furthermore, the design seems very un-RISC. The thing that makes RISC attractive - to the extent that it is attractive - is simplicity. There’s not supposed to be a lot of “sometimes the registers work one way, sometimes they work another way” kinds of stuff going on, because that makes it more complicated to design a chip. While there’s no hard-and-fast rule about what can be in a RISC chip and what can’t, it certainly seems against the spirit of it.
Can it be changed?
The RISC-V V spec recently hit v1.0, and it is now frozen “for public comment”. I am hopeful that means that public comments, like this one, might lead to a spec revision before a significant amount of effort is expended supporting the “v0 as both vector and mask” design.
While the relative scarcity of hardware implementing RISC-V V makes it hard to have much practical experience with it, it otherwise seems like a reasonable spec, at least on paper. Yes, it probably needs multiple mask registers eventually, but figuring out some way to add multiple mask registers at a later date is fine, and requires no weird hardware workarounds. There’s no need to worry about that now.
Having the mask bits stored in the low part of v0, on the other hand, is something that needs to be worried about now, because that duality will have to be mimicked by all future RISC-V V chips forever if they actually want to be compatible with code written now.
So, I think it’s an issue worth pressing. And this is me pressing it. Press, PRESS!!!
Appendixeseseses
If you don't know a lot about vector instruction sets, and would like more explanation, here is a tiny primer on the parts of vector processing relevant to this article:
What are "vector instructions", and why do they exist?
High-performance CPU and GPU chips try to do large amounts of computations as quickly as possible. While the circuitry for the calculation itself costs some amount of chip real estate, each instruction also has other fixed costs that have nothing to do with the computation itself.
Take as an example a floating point multiply instruction. We can think of the instruction as requiring two different types of "work".
The first type of work is just the circuitry to literally compute the multiplication. There is no way to "save" on this work by doing more multiplies - if you wanted to multiply more numbers together at the same time, you would have to add more exact copies of this circuitry.
The second type of work is everything else. This is all the bookkeeping and routing that must be done to know what to do: decoding the instruction from the instruction stream, determining that it's a multiply, finding the source data and figuring out if it's ready, scheduling it to be executed on the multiply circuitry, retrieving the results once it's done, etc.
SIMD instructions - short for “single instruction, multiple data” - are a way to amortize the second type work. Instead of the multiply instruction taking one float and multiplying it by one other float, you define the multiply to take many floats and multiply them by many other floats. This is essentially equivalent to taking many multiply instructions with independent inputs and combining them into a single instruction.
Because high-performance computation almost always involves performing similar operations on large amounts of data, SIMD instructions have proven to be a popular way to reduce instruction overhead. All modern consumer CPUs and GPUs use SIMD processing for their core math operations: x64 has SSE, AVX, and AVX-512; ARM has NEON and SVE; and nVidia/AMD/Intel GPUs all have their own internal SIMD instruction sets.
But why did I say “vector instructions”, instead of “SIMD instructions”, in the title of this section? Well, technically RISC-V V is a particular type of SIMD instruction set called a vector instruction set.
There are two common kinds of instruction sets that leverage SIMD: "Packed" instruction sets, like AVX2 and NEON, use fixed-length registers, so instructions are defined to operate on a specific number of items. When code is written (or compiled) for the ISA, the stride of loops is fixed at compile time. If a loop is compiled for the 4-wide version of the instruction set, then it will always run 4-wide, even if it is later run on a CPU that is 8-wide. Code must be rewritten (or re-compiled) to target a new version of the ISA each time the width is expanded if you want to take advantage of the newer, wider CPUs.
"Vector" instruction sets like ARM SVE and RISC-V V, by contrast, do not specify the exact number of items in the ISA. They instead provide instructions you can use when writing loops to adjust the stride of the loop based on how many items that particular CPU can handle at a time. So the same instruction sequence, on the same ISA, will run at different widths on different CPUs. Running on an 4-wide CPU and a 8-wide CPU, the exact same loop code will run 4-wide and 8-wide, respectively.
Although packed vs. vector is an important architectural difference to consider in general, it can be safely ignored with respect to the mask bit problem described in this article. The same problem would exist in either design.
What are "mask bits" and "predicated instructions"?
An obvious problem arises as soon as you have instructions operating on multiple pieces of data at the same time: some operations need to be applied only to a subset of the data.
As a simple example, suppose you wanted to add 5 to any input that was greater than 0. In a scalar loop, you are accustomed to using a branch to jump around the addition if the input value isn't greater than 0. This works because there is only one value per iteration of the loop, so there's only one answer to the question "is the input greater than 0?"
However, when you are operating on many values at the same time, you are no longer guaranteed a single answer. Perhaps some of the inputs are greater than 0, but some are not. What do you do?
The solution to this problem in SIMD instruction sets is to make comparison instructions produce a full bit pattern - called a mask - rather than a single-bit answer used in a branch. Each bit in the mask holds the result of the comparison for one of the inputs. Using the mask, each instruction that performs an operation on multiple inputs - such as our "add 5" - can use the corresponding bit to know whether to perform the operation or not on each of the inputs individually.
This can be done in one of three ways:
using additional "and", "andnot", and "or" instructions to merge results together based on the bit pattern,
using a single special-purpose “select” instruction that does the merge in a single instruction, or
intrinsically, by SIMD hardware that supports predicated instructions.
Predicated instructions read the mask bits as part of the instruction, and leave unaffected those lanes untouched where the mask is zero. This obviates the need to manually merge results.
RISC-V V has predicated instruction support in the ISA. Every vector instruction that performs an operation has two versions: one that operates on the entire set of inputs, and one that uses a mask.
Where and how are RISC-V mask bits stored?
Unlike ARM SVE and AVX-512, RISC-V V lacks support for mask registers. It does not define an additional set of registers for holding these mask values. And, unlike the CRAY-1, it also doesn’t define a single additional vector mask register, either.
Instead, the RISC-V V spec defines mask bits as always coming from the first vector register, v0. Any time an instruction uses a mask, it will use v0 as the source of the mask. This doesn't mean you can't have multiple masks, but it does mean that you must always move your masks into v0 before issuing masked instructions.
Although this may seem like a significant drawback, it's actually not that alarming, at least not to me. While it's true that you must insert various "move mask to v0" instructions that wouldn't otherwise be there, it's important to remember that these will not really be actual computation instructions. Moves from one vector register to another will always be simple register renames handled by the front end of any high-performance chip, and I would consider it highly unlikely that you would change masks so frequently as to overburden the front end.
Thus, although not having mask registers does pose a problem for register pressure (since your masks are now taking up registers you’d otherwise have for your actual values, etc.), that problem is one that would be completely alleviated by a future mask register spec. There is no obvious, large backwards-compatibility cost to the chip design for this interim solution existing, so I probably wouldn't take the time to write a whole article like this if that were the only issue.
What is the problem with the RISC-V mask bits?
The problem isn't the lack of mask registers, or the limitation that it has to be v0. Instead, the problem is where the bits are defined to be stored in v0.
If you imagine how a physical CPU or GPU has to be constructed in order to do large multi-input operations - such as multiplying 64 floats at a time, a not-unreasonable number for something like a GPU - there are hard constraints on where the bits of the inputs come from. You can imagine these inputs as being in "lanes" that are arranged across the chip such that the inputs to each lane are stored near the lane. Getting a 64-float multiply to run quickly means you can’t shuttle all 64 floats from some central location all the way to each of the 64 multiply units. Such an operation would be more like an L1 cache fetch, as opposed to direct use of a register.
So for example, when you think of a single vector register, rather than thinking of it as one big blob containing something like 64 floats, you instead have to think of it as 64 separate floats, each stored near its respective lane, such that when it is used as the input to a multiply, the floats are each close to the lane multiplier they actually need to use. In other words, it's more like 64 sets of single-float registers feeding 64 single-float multipliers than it is a big 64-float register stored in one place and a big 64-float multiplier in another.
When you use dedicated mask registers, the chip designer is free to spread those mask bits across all the lanes to ensure that the storage for the mask register keeps each bit close to the lane where it will be used as a predicate - just like they do with the vector registers themselves. However, the RISC-V V spec makes an unreasonable demand on the organization of the mask register bits: it specifically says that the bits of v0 that will be used for masking are the lowest n bits of the entire register.
This means, for example, if you had the 64-float registers hypothesized above, you would store the corresponding 64-bit mask in just the first two (low-order) float lanes of v0 (and the other 62 lanes are all empty). The 32-bits of the first lane would be the mask bits for the first 32 lanes, and the 32-bits of the second lane would be the mask bits for the second 32 lanes.
This puts the chip designer in a bind: if an instruction calls for v0 to be used as a vector input, then the low-order bits of v0 would want to be stored near the low lanes. But if an instruction calls for v0 to be used as the mask, suddenly those same bits want to spread across all the lanes! The ISA basically wants the designer to have the same bit in two different places at once.
And, although I’m not knowledgeable enough about chip design to know what the best work-around is here, that’s basically what they’re going to have to do. I’d imagine most high-performance designs will resort to some kind of shadowing where they try to keep the bits stored as if they were a mask register when they come out of mask instructions, as if they were a vector register when they come out of vector instructions, and then have some hacks in there to work around cases where the instruction stream does something unexpected and the bits aren’t in the right place.
i'm very happy you've started a longer-form, text-based, programming-centric, habit of communication, instead of making me parse your twitter threads. thanks, casey!
Found this interesting video that talks about an actual implementation (in hardware, as far as I can tell), and about exactly this problem (timestamp link): https://youtu.be/WzID6kk8RNs?t=567
The presentation is by Roger Espasa, from what I saw on the mailing list he's the co-chair of the RISC-V Vector work group / committee / thingy. He does mention the extra wiring, connecting the lanes, and a lot of complication needed for an out of order core, but it's hard (for me) to gauge the actual complexity from the presentation (like, did they work for 3 years on just this problem or was it work as usual)