4 Example Applications
This chapter contains a collection of short code snippets and algorithms using the Bitmanip extension. It also contains some examples of bit manipulation code that doesn’t require any extension beyond the base ISA.
4.1 Basic Bitmanipulation
4.1.1 Loading constants
On RV32 any arbitrary constant can be loaded using LUI+ADDI.
lui a0, ((v - (v << 20 >> 20)) >> 12) & 0xfffff
addi a0, a0, (v << 20 >> 20)
(Assuming signed 32-bit arithmetic for the expression (v << 20 >> 20)
.)
Using the following sequence on RV64 will yield the 32-bit constant in sign-extended form.
lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
addiw a0, a0, (v << 52 >> 52)
(addiw
is needed instead of addi
to handle the cases correctly that have
bits 11-30 of the constant set to one.)
Using addiwu
instead of addiw
produces a zero-extended version of
the same constant, iff any of the bits 11-31 of the constant is zero.
lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
addiwu a0, a0, (v << 52 >> 52)
In the remaining cases with bits 11-31 all set, addi
+pack
can be used to produce the constant:
addi a0, zero, v
pack a0, a0, zero
64-bit constants of the form R × 1 S × 0 T × 1 (with R + S + T = 64)
can be constructed using sloi
and rori
:
sloi a0, zero, R+T
rori a0, a0, R
Likewise, constructing R × 0 S × 1 T × 0:
sloi a0, zero, S
slli a0, a0, T
Any constant that is just a repeating 16-bit pattern can be constructed
in two instructions with lui
and orc16
. For example,
constructing 0xABCD_ABCD_ABCD_ABCD
:
lui a0, 0x0BCDA
orc16 a0, a0
Finally, any arbitrary 64-bit constant can be created using the following 5-instruction pattern and one spill register:
lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
addiw a0, a0, (v << 52 >> 52)
lui a1, ((v - (v << 20 >> 20)) >> 44) & 0xfffff
addiw a1, a1, (v << 20 >> 52)
pack a0, a0, a1
4.1.2 Bitfield extract
Extracting a bit field of length len
at position pos
can be done using
two shift operations.
slli a0, a0, (XLEN-len-pos)
srli a0, a0, (XLEN-len)
Or using srai
for a signed bit-field.
slli a0, a0, (XLEN-len-pos)
srai a0, a0, (XLEN-len)
4.1.3 Packing bit-fields
There are different ways of packing bit-fields with the help of RISC-V BitManip instructions.
For example, packing a 16-bit RGB value in 5:6:5 format, from the bytes
in a0, a1, and a2, using pack[h]
and bext
:
li t0, 0x00f8fcf8
packh a0, a0, a1
pack a0, a0, a2
bext a0, a0, t0
Or using funnel shifts (assuming a2
is already in zero-extended form
and/or the upper bits of the return value do not matter):
srli a2, a2, 3
slli a1, a1, XLEN-8
fsli a1, a2, a1, 6
slli a0, a0, XLEN-8
fsli a0, a1, a0, 5
Using only base-ISA instructions, at least 7 instructions are needed to pack a
5:6:5 RGB value (assuming a0
is alredy in zero-extended form):
andi a2, a2, 0xf8
slli a2, a2, 9
andi a1, a1, 0xfc
slli a1, a1, 3
srli a0, a0, 3
or a1, a1, a2
or a0, a0, a1
Another example for packing bit fields is generating IEEE floats using only integer instructions (aka “soft float”). For example, generating a 32-bit float in the range [ − 1.0… + 1.0) from a signed 16-bit integer:
short2float:
neg a1, a0
max a1, a1, a0
clz a2, a1
srli a0, a0, 31
sll a3, a1, a2
srli a3, a3, 15
neg a2, a2
addi a2, a2, 143
packh a0, a2, a0
pack a0, a3, a0
slli a4, a4, 7
orc a1, a1
and a0, a0, a1
ret
Or using funnel shifts:
short2float:
neg a1, a0
max a1, a1, a0
clz a2, a1
sll a3, a1, a2
slli a3, a3, 1
neg a2, a2
addi a2, a2, 143
fsri a3, a3, a2, 8
srli a0, a0, 31
fsri a0, a3, a0, 1
cmov a0, a1, a0, zero
ret
4.1.4 Parity check
The parity of a word (xor of all bits) is the LSB of the population count.
pcnt a0, a0
andi a0, a0, 1
4.1.5 Average of two integers
The following four instructions calculate the average of the unsigned
integers in a0
and a1
, with compensation for overflow:
and a2, a0, a1
xor a0, a0, a1
srli a0, a0, 1
add a0, a0, a2
And likewise the average of two signed integers:
and a2, a0, a1
xor a0, a0, a1
srai a0, a0, 1
add a0, a0, a2
With fsri
the unsigned case can be accomplished in just three
instructions:
add a0, a0, a1
sltu a1, a0, a1
fsri a0, a1, 1
4.1.6 Detecting integer overflow
Overflow in unsigned addition can be detected in two instructions:
add a0, a1, a2
bltu a0, a1, overflow
For signed addition, if the sign of one operand is known, for example because it is constant:
addi a0, a1, +imm
blt a0, a1, overflow
addi a0, a1, -imm
bgt a0, a1, overflow
And for signed addition in the general case:
add a0, a1, a2
slti a3, a1, 0
slt a4, a0, a2
bne a3, a4, overflow
And finally, generating the carry flag for an addition:
add a0, a1, a2
sltu a3, a0, a1
Thus, adding a0
, a1
, and a2
with results in a0
and carry-out in a1
:
add a0, a0, a1
sltu a1, a0, a1
add a0, a0, a2
sltu a2, a0, a2
add a1, a1, a2
4.1.7 Fuse-able sequences for logic operations
RISC-V has dedicated instructions for branching on equal/not-equal. But C code
such as the following would require set-equal and set-not-equal instructions, similar
to slt
.
int is_equal = (a == b);
int is_noteq = (c != d);
Those can be implemented using the following fuse-able sequences:
sub rd, rs1, rs2
sltui rd, rd, 1
sub rd, rs1, rs2
sltu rd, zero, rd
Likewise for logic OR:
int logic_or = (c || d);
or rd, rs1, rs2
sltu rd, zero, rd
And for logic AND, if rd != rs2
:
int logic_and = (c && d);
orc rd, rs1
and rd, rd, rs2
sltu rd, zero, rd
4.1.8 Rotate shift of bytes and half-words
Rotate right shift of the byte in a0
by the shift amount in a1
,
assuming a0
is stored in zero-extended form:
orc8 a0, a0
ror a0, a1
andi a0, a0, 255
And rotate right shift of the 16-bit half-word in a0
by the shift amount in a1
,
assuming a0
is stored in zero-extended form:
orc16 a0, a0
ror a0, a1
pack[w] a0, a0, zero
4.1.9 Rank and select
Rank and select are fundamental operations in succinct data structures [SelectX86].
select(a0, a1)
returns the position of the a1
th set bit in a0
.
It can be implemented efficiently using bdep
and ctz
:
select:
sbset a1, zero, a1
bdep a0, a1, a0
ctz a0, a0
ret
rank(a0, a1)
returns the number of set bits in a0
up to and
including position a1
.
rank:
not a1, a1
sll a0, a1
pcnt a0, a0
ret
4.1.10 OR/AND/XOR-reduce in byte vectors
OR-ing the bytes in a register and returning the resulting byte is easy with GORC:
gorci a0, a0, -8
andi a0, 255
AND-ing can accomplished by applying De Morgan’s laws:
not a0, a0
gorci a0, a0, -8
not a0, a0
andi a0, 255
XOR-ing can be accomplished with CLMUL (see also section 1.7).
andi a1, zero, 0x80
gorci a1, a1, -8
clmulr a0, a0, a1
andi a0, 255
Where the first two instructions (andi+gorci) just create the constant 0x8080..8080.
Finally, on RV64, XOR-ing the bytes in a register can also be accomplished with BMATXOR:
andi a1, zero, 0xff
bmatxor a0, a1, a0
4.1.11 Counting trailing non-zero bytes
Counting the trailing (LSB-end) non-zero bytes in a word
is a helpful operation in optimized implementations of strlen()
and strcpy()
:
int count_trailing_nonzero_bytes(long x)
{
return _rv_ctz(~_rv_orc_b(x)) >> 3;
}
4.1.12 Finding bytes of certain values
Finding zero bytes is a useful operations for strchr()
and memchr()
:
bool check_zero_bytes(long x)
{
return ~_rv_orc_b(x) != 0;
}
To find other bytes we simply XOR the value with a mask of the byte value we are looking for:
bool check_byte(long x, unsigned char c)
{
return ~_rv_orc_b(x ^ _rv_orc8(c)) != 0;
}
These schemes can easily be extended with ctz
and pcnt
to perform
operations such as counting the number of bytes of a certain value within a
word, or finding the position of the first such byte.
4.1.13 Fill right of most significant set bit
The “fill right” or “fold right” operation is a pattern commonly used in bit manipulation code. [MAGIC]
The straight-forward RV64 implementation requires 12 instructions:
uint64_t rfill(uint64_t x)
{
x |= x >> 1; // SRLI, OR
x |= x >> 2; // SRLI, OR
x |= x >> 4; // SRLI, OR
x |= x >> 8; // SRLI, OR
x |= x >> 16; // SRLI, OR
x |= x >> 32; // SRLI, OR
return x;
}
With clz
it can be implemented in only 4 instructions. Notice the
handling of the case where x=0
using sltiu+addi
.
uint64_t rfill_clz(uint64_t x)
{
uint64_t t;
t = clz(x); // CLZ
x = (!x)-1; // SLTIU, ADDI
x = x >> (t & 63); // SRL
return x;
}
Alternatively, a Trailing Bit Manipulation (TBM) code pattern can be used
together with rev
to implement this function in 4 instructions:
uint64_t rfill_rev(uint64_t x)
{
x = rev(x); // GREVI
x = x | ~(x - 1); // ADDI, ORN
x = rev(x); // GREVI
return x;
}
Finally, there is another implementation in 4 instructions using BMATOR, if we do not count the extra instructions for loading utility matrices.
uint64_t rfill_bmat(uint64_t x)
{
uint64_t m0, m1, m2, t;
m0 = 0xFF7F3F1F0F070301LL; // LD
m1 = bmatflip(m0 << 8); // SLLI, BMATFLIP
m2 = -1LL; // ADDI
t = bmator(x, m0); // BMATOR
x = bmator(x, m2); // BMATOR
x = bmator(m1, x); // BMATOR
x |= t; // OR
return x;
}
4.1.14 Round to next power of two
One common application of rfill()
is rounding up to the next power of two:
uint64_t round_pow2(uint64_t x)
{
return rfill(x-1)+1;
}
This can also be implemented in just 4 instructions, if we don’t care about the
case where the above code overflows because x
is already larger than the
largest power-of-two representable in an uint64_t
.
uint64_t round_pow2(uint64_t x)
{
uint64_t t;
t = clz(x-1); // ADDI, CLZ
x = ror(!!x, t); // SLTU, ROR
return x;
}
Note that this code handles 0 → 0 and 1 → 1 correctly,
i.e. equivialent to rfill(x-1)+1
.
4.2 Packed vectors
4.2.1 Packing bytes
The following RV32 code packs the lower 8 bits from a0, a1, a2, a3 into a 32-bit word returned in a0, ignoring other bits in the input values.
packh a0, a0, a1
packh a1, a2, a3
pack a0, a0, a1
And the following RV64 code packs 8 bytes into a register.
packh a0, a0, a1
packh a1, a2, a3
packh a2, a4, a5
packh a3, a6, a7
packw a0, a0, a1
packw a1, a2, a3
pack a0, a0, a1
4.2.2 Permuting bytes
There are 24 ways of arranging the four bytes in a 32-bit word. ror
, grev
, and [un]shfl
can perform any of those permutations in at most 3
instructions. Table [permbytes] lists those sequences. [Wolf19A]
Bytes | Instructions |
---|---|
A B C D | initial byte order |
A B D C | ROR(24),SHFL(8),ROR(8) |
A C B D | SHFL(8) |
A C D B | ROR(8),GREV(8),SHFL(8) |
A D B C | ROR(16),SHFL(8),ROR(24) |
A D C B | ROR(8),GREV(8) |
B A C D | ROR(8),SHFL(8),ROR(24) |
B A D C | GREV(8) |
B C A D | ROR(16),SHFL(8),ROR(8) |
B C D A | ROR(24) |
B D A C | GREV(8),SHFL(8) |
B D C A | ROR(24),SHFL(8) |
C A B D | ROR(8),GREV(24),SHFL(8) |
C A D B | ROR(16),SHFL(8) |
C B A D | ROR(8),GREV(24) |
C B D A | SHFL(8),ROR(24) |
C D A B | ROR(16) |
C D B A | ROR(8),SHFL(8),ROR(8) |
D A B C | ROR(8) |
D A C B | SHFL(8),ROR(8) |
D B A C | ROR(8),SHFL(8) |
D B C A | GREV(24),SHFL(8) |
D C A B | ROR(24),SHFL(8),ROR(24) |
D C B A | GREV(24) |
4.2.3 Widening and narrowing
The [un]zip
instructions can help with widening and narrowing packed
vectors. For example, narrowing the bytes in two words into a single
word with the values in nibbles with values from a0
in LSB half and
values from a1
in MSB half:
unzip4 a0, a0
unzip4 a1, a1
pack a0, a0, a1
And widening the nibbles from a0
into bytes in a1
(MSB half)
and a0
(LSB half), with zero extension:
srli a1, a0, XLEN/2
pack a0, a0, zero
zip4 a1, a1
zip4 a0, a0
And finally the same widening operation with sign extension:
addi t0, zero, 8
orc4 t0, t0
and t0, t0, a0
orc.n t0, t0
srli t1, t0, XLEN/2
srli a1, a0, XLEN/2
pack a1, a1, t1
pack a0, a0, t0
zip4 a1, a1
zip4 a0, a0
4.2.4 Shifting packed vector elements
Using zip
we can re-arrange the bits in a packed vector of N elements
so that a shift by k of each byte becomes a shift of Nk of the entire new
vector. So we zip, shift, and then unzip
to shuffle everything back. The number
of zip
and unzip
is log2(N). This works for all kinds of
shift operations. For example, rotating a vector of bytes on RV32 in 6
instructions:
zip a0, a0
zip a0, a0
slli a1, a1, 2
ror a0, a0, a1
unzip a0, a0
unzip a0, a0
Because zip; zip; zip
is equal to unzip; unzip
on RV32,
and equal to unzip; unzip; unzip
on RV64, we need never more
than 2 [un]zip
on RV32, or 3 [un]zip
on RV64.
4.2.5 Adding packed vectors
The following six instructions will add the elements of the two vectors passed
in a0
and a1
, and return the vector of sums in a0
.
This expects a mask in a2
that marks the MSB bit of each vector
element. For a vector of bytes this mask would be 0x8080...80
(which
can be obtained in two instructions via orc8(0x80)
).
xor a3, a0, a1
and a3, a3, a2
andn a0, a0, a2
andn a1, a1, a2
add a0, a0, a1
xor a0, a0, a3
4.3 Funnel shifts
A funnel shift takes two XLEN registers, concatenates them to a 2 × XLEN word, shifts that by a certain amount, then returns the lower half of the result for a right shift and the upper half of the result for a left shift.
The fsl
, fsr
, and fsri
instructions perform funnel shifts.
4.3.1 Bigint shift
A common application for funnel shifts is shift operations in bigint libraries.
For example, the following functions implement rotate-shift operations
for bigints made from n
XLEN words.
void bigint_rol(uint_xlen_t data[], int n, int shamt)
{
if (n <= 0)
return;
uint_xlen_t buffer = data[n-1];
for (int i = n-1; i > 0; i--)
data[i] = fsl(data[i], shamt, data[i-1]);
data[0] = fsl(data[0], shamt, buffer);
}
void bigint_ror(uint_xlen_t data[], int n, int shamt)
{
if (n <= 0)
return;
uint_xlen_t buffer = data[0];
for (int i = 0; i < n-1; i++)
data[i] = fsr(data[i], shamt, data[i+1]);
data[n-1] = fsr(data[n-1], shamt, buffer);
}
These version only works for shift-amounts <XLEN. But functions supporting
other kinds of shift operations, or shifts ≥XLEN can easily be built
with fsl
and fsr
.
4.3.2 Parsing bit-streams of 27-bit words
The following function parses n
27-bit words from a packed array of XLEN words:
void parse_27bit(uint_xlen_t *idata, uint_xlen_t *odata, int n)
{
uint_xlen_t lower = 0, upper = 0;
int reserve = 0;
while (n--) {
if (reserve < 27) {
uint_xlen_t buf = *(idata++);
lower |= sll(buf, reserve);
upper = reserve ? srl(buf, -reserve) : 0;
reserve += XLEN;
}
*(odata++) = lower & ((1 << 27)-1);
lower = fsr(lower, 27, upper);
upper = srl(upper, 27);
reserve -= 27;
}
}
And here the same thing in RISC-V assembler:
parse_27bit:
li t1, 0 ; lower
li t2, 0 ; upper
li t3, 0 ; reserve
li t4, 27 ; shamt
slo t5, zero, t4 ; mask
beqz a2, endloop ; while (n--)
loop:
addi a2, a2, -1
bge t3, t4, output ; if (reserve < 27)
lw t6, 0(a0) ; buf = *(idata++)
addi a0, a0, 4
sll t7, t6, t3 ; lower |= sll(buf, reserve)
or t1, t1, t7
sub t7, zero, t3 ; upper = reserve ? srl(buf, -reserve) : 0
srl t7, t6, t7
cmov t2, t3, t7, zero
addi t3, t3, 32 ; reserve += XLEN;
output:
and t6, t1, t5 ; *(odata++) = lower & ((1 << 27)-1)
sw t6, 0(a1)
addi a1, a1, 4
fsr t1, t1, t2, t4 ; lower = fsr(lower, 27, upper)
srl t2, t2, t4 ; upper = srl(upper, 27)
sub t3, t3, t4 ; reserve -= 27
bnez a2, loop ; while (n--)
endloop:
ret
A loop iteration without fetch is 9 instructions long, and a loop iteration with fetch is 17 instructions long.
Without ternary operators that would be 13 instructions and 22 instructions, i.e. assuming one cycle per instruction, that function would be about 30% slower without ternary instructions.
4.3.3 Parsing bit-streams of 6-bit words
The following code accepts three 64-bit words in t0
, t1
, and t2
containing a bit-stream of 32 6-bit words, and outputs these 32 values in the bytes
of t0
, t1
, t2
, and t3
.
addi a0, zer0, 0x3f
orc8 a0, a0 // a0 = 0x3f3f3f3f3f3f3f3f
srli t3, t2, 16
bdep t3, t3, a0
fsli t2, t2, t1, 32
bdep t2, t2, a0
fsli t1, t1, t0, 16
bdep t1, t1, a0
bdep t0, t0, a0
That’s 7 instructions without the two instructions for constructing the mask in a0
.
Without funnel shift this operation requires 11 instructions:
addi a0, zer0, 0x3f
orc8 a0, a0 // a0 = 0x3f3f3f3f3f3f3f3f
srli t3, t2, 16
bdep t3, t3, a0
slli t2, t2, 32
srli a1, t1, 32
or t2, t2, a1
bdep t2, t2, a0
slli t1, t1, 16
srli a1, t0, 48
or t1, t1, a1
bdep t1, t1, a0
bdep t0, t0, a0
Sign-extending the 6-bit values in the bytes of t0
, t1
, t2
, and t3
with bit-matrix multiply:
li a0, 0x80c0e01008040201
bmator t0, t0, a0
bmator t1, t1, a0
bmator t2, t2, a0
bmator t3, t3, a0
Or without bit-matrix multiply:
addi a0, zer0, 0x60
orc8 a0, a0 // a0 = 0x6060606060606060
add t0, t0, a0
add t1, t1, a0
add t2, t2, a0
add t3, t3, a0
xor t0, t0, a0
xor t1, t1, a0
xor t2, t2, a0
xor t3, t3, a0
4.3.4 Fixed-point multiply
A fixed-point multiply is simply an integer multiply, followed by a right
shift. If the entire dynamic range of XLEN bits should be useable for the
factors, then the product before shift must be 2*XLEN wide. Therefore mul
+mulh
is needed for the multiplication, and funnel shift instructions
can help with the final right shift. For fixed-point numbers with N fraction
bits:
mul_fracN:
mulh a2, a0, a1
mul a0, a0, a1
fsri a0, a0, a2, N
ret
4.4 Arbitrary bit permutations
This section lists code snippets for computing arbitrary bit permutations that are defined by data (as opposed to bit permutations that are known at compile time and can likely be compiled into shift-and-mask operations and/or a few instances of bext/bdep).
4.4.1 Using butterfly operations
The following macro performs a stage-N
butterfly operation on the word in
a0
using the mask in a1
.
grevi a2, a0, (1 << N)
cmix a0, a1, a2, a0
The bitmask in a1
must be preformatted correctly for the selected butterfly
stage. A butterfly operation only has a XLEN/2 wide control word. The following
macros format the mask assuming those XLEN/2 bits in the lower half of a1
on entry:
bfly_msk_0:
pack a1, a1, a1
zip a1, a1
bfly_msk_1:
pack a1, a1, a1
zip2 a1, a1
bfly_msk_2:
pack a1, a1, a1
zip4 a1, a1
...
A sequence of 2 ⋅ log2(XLEN) − 1 butterfly operations can perform any arbitrary bit permutation (Beneš network):
butterfly(LOG2_XLEN-1)
butterfly(LOG2_XLEN-2)
...
butterfly(0)
...
butterfly(LOG2_XLEN-2)
butterfly(LOG2_XLEN-1)
Many permutations arising from real-world applications can be implemented using shorter sequences. For example, any sheep-and-goats operation (SAG, see section 1.4.4) with either the sheep or the goats bit reversed can be implemented in log2(XLEN) butterfly operations.
Reversing a permutation implemented using butterfly operations is as simple as reversing the order of butterfly operations.
4.4.2 Using omega-flip networks
The omega operation is a stage-0 butterfly preceded by a zip operation:
zip a0, a0
grevi a2, a0, 1
cmix a0, a1, a2, a0
The flip operation is a stage-0 butterfly followed by an unzip operation:
grevi a2, a0, 1
cmix a0, a1, a2, a0
unzip a0, a0
A sequence of log2(XLEN) omega operations followed by log2(XLEN) flip operations can implement any arbitrary 32 bit permutation.
As for butterfly networks, permutations arising from real-world applications can often be implemented using a shorter sequence.
4.4.3 Using baseline networks
Another way of implementing arbitrary 32 bit permutations is using a baseline network followed by an inverse baseline network.
A baseline network is a sequence of log2(XLEN) butterfly(0) operations interleaved with unzip operations. For example, a 32-bit baseline network:
butterfly(0)
unzip
butterfly(0)
unzip.h
butterfly(0)
unzip.b
butterfly(0)
unzip.n
butterfly(0)
An inverse baseline network is a sequence of log2(XLEN) butterfly(0) operations interleaved with zip operations. The order is opposite to the order in a baseline network. For example, a 32-bit inverse baseline network:
butterfly(0)
zip.n
butterfly(0)
zip.b
butterfly(0)
zip.h
butterfly(0)
zip
butterfly(0)
A baseline network followed by an inverse baseline network can implement any arbitrary bit permutation.
4.4.4 Using sheep-and-goats
The Sheep-and-goats (SAG) operation is a common operation for bit permutations. It moves all the bits selected by a mask (goats) to the LSB end of the word and all the remaining bits (sheep) to the MSB end of the word, without changing the order of sheep or goats.
The SAG operation can easily be performed using bext
(data in a0
and
mask in a1
):
bext a2, a0, a1
not a1, a1
bext a0, a0, a1
pcnt a1, a1
ror a0, a0, a1
or a0, a0, a2
Any arbitrary bit permutation can be implemented in log2(XLEN) SAG operations.
The Hacker’s Delight describes an optimized standard C implementation of the SAG operation. Their algorithm takes 254 instructions (for 32 bit) or 340 instructions (for 64 bit) on their reference RISC instruction set. [Seander05]
4.4.5 Using bit-matrix multiply
bat[x]or
performs a permutation of bits within each byte when used with a
permutation matrix in rs2
, and performs a permutation of bytes when used
with a permutation matrix in rs1
.
4.5 Mirroring and rotating bitboards
Bitboards are 64-bit bitmasks that are used to represent part of the game state in chess engines (and other board game AIs). The bits in the bitmask correspond to squares on a 8 × 8 chess board:
56 57 58 59 60 61 62 63
48 49 50 51 52 53 54 55
40 41 42 43 44 45 46 47
32 33 34 35 36 37 38 39
24 25 26 27 28 29 30 31
16 17 18 19 20 21 22 23
8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7
Many bitboard operations are simple straight-forward operations such as bitwise-AND, but mirroring and rotating bitboards can take up to 20 instructions on x86.
4.5.1 Mirroring bitboards
Flipping horizontally or vertically can easily done with grevi
:
Flip horizontal:
63 62 61 60 59 58 57 56 RISC-V Bitmanip:
55 54 53 52 51 50 49 48 rev.b
47 46 45 44 43 42 41 40
39 38 37 36 35 34 33 32
31 30 29 28 27 26 25 24 x86:
23 22 21 20 19 18 17 16 13 operations
15 14 13 12 11 10 9 8
7 6 5 4 3 2 1 0
Flip vertical:
0 1 2 3 4 5 6 7 RISC-V Bitmanip:
8 9 10 11 12 13 14 15 rev8
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
32 33 34 35 36 37 38 39 x86:
40 41 42 43 44 45 46 47 bswap
48 49 50 51 52 53 54 55
56 57 58 59 60 61 62 63
Rotating by 180 (flip horizontal and vertical):
Rotate 180:
7 6 5 4 3 2 1 0 RISC-V Bitmanip:
15 14 13 12 11 10 9 8 rev
23 22 21 20 19 18 17 16
31 30 29 28 27 26 25 24
39 38 37 36 35 34 33 32 x86:
47 46 45 44 43 42 41 40 14 operations
55 54 53 52 51 50 49 48
63 62 61 60 59 58 57 56
4.5.2 Rotating bitboards
Using zip
a bitboard can be transposed easily:
[transposebitboard]
Transpose:
7 15 23 31 39 47 55 63 RISC-V Bitmanip:
6 14 22 30 38 46 54 62 zip, zip, zip
5 13 21 29 37 45 53 61
4 12 20 28 36 44 52 60
3 11 19 27 35 43 51 59 x86:
2 10 18 26 34 42 50 58 18 operations
1 9 17 25 33 41 49 57
0 8 16 24 32 40 48 56
A rotation is simply the composition of a flip operation and a transpose operation. This takes 19 operations on x86 [ChessProg]. With Bitmanip the rotate operation only takes 4 operations:
rotate_bitboard:
rev8 a0, a0
zip a0, a0
zip a0, a0
zip a0, a0
4.5.3 Explanation
The bit indices for a 64-bit word are 6 bits wide. Let i[5:0]
be the
index of a bit in the input, and let i
′[5:0]
be the index of the
same bit after the permutation.
As an example, a rotate left shift by N can be expressed using this notation
as i
′[5:0]
= i[5:0]
+ N (mod 64).
The GREV operation with shamt N is i
′[5:0]
= i[5:0]
XOR N.
And a SHFL operation corresponds to a rotate left shift by one position of any
contiguous region of i[5:0]
. For example, zip
is a left rotate shift
of the entire bit index:
i
′[5:0]
= {i[4:0]
, i[5]
}
And zip4
performs a left rotate shift on bits 5:2
:
i
′[5:0]
= {i[4:2]
, i[5]
, i[1:0]
}
In a bitboard, i[2:0]
corresponds to the X coordinate of a board position, and
i[5:3]
corresponds to the Y coordinate.
Therefore flipping the board horizontally is the same as negating bits i[2:0]
,
which is the operation performed by grevi rd, rs, 7
(rev.b
).
Likewise flipping the board vertically is done by grevi rd, rs, 56
(rev8
).
Finally, transposing corresponds by swapping the lower and upper half of i[5:0]
,
or rotate shifting i[5:0]
by 3 positions. This can easily done by rotate shifting the entire
i[5:0]
by one bit position (zip
) three times.
4.5.4 Rotating Bitcubes
Let’s define a bitcube as a 4 × 4 × 4 cube with x = i[1:0]
,
y = i[3:2]
, and z = i[5:4]
. Using the same methods as described
above we can easily rotate a bitcube by 90∘ around the X-, Y-, and Z-axis:
3
rotate_x:
rev16 a0, a0
zip4 a0, a0
zip4 a0, a0
rotate_y:
rev.n a0, a0
zip a0, a0
zip a0, a0
zip4 a0, a0
zip4 a0, a0
rotate_z:
rev4.h
zip.h a0, a0
zip.h a0, a0
4.6 Manipulating 64x64 Bit Matrices
The bmat[x]or
and bmatflip
instructions operate on 8x8 bit
matrices stored in single 64-bit registers, where each byte of such
a 64-bit value represents one row (column) of a 8x8 bit matrix.
Let’s assume we have a 64x64 bit matrix in memory, stored as one row
(column) per 64-bit value. In order to use bmat[x]or
and
bmatflip
on such a matrix, we must first convert it into
a 8x8 block matrix of 64 individual 8x8 matrices, each stored in
a 64-bit value. The following function performs this transformation
for a single row (column) of the block matrix in 40 instructions.
void conv8x8(const uint64_t x[8], uint64_t y[8])
{
uint64_t x0_x1_31_00 = _rv64_pack (x[0], x[1]);
uint64_t x2_x3_31_00 = _rv64_pack (x[2], x[3]);
uint64_t x4_x5_31_00 = _rv64_pack (x[4], x[5]);
uint64_t x6_x7_31_00 = _rv64_pack (x[6], x[7]);
uint64_t x0_x1_63_32 = _rv64_packu(x[0], x[1]);
uint64_t x2_x3_63_32 = _rv64_packu(x[2], x[3]);
uint64_t x4_x5_63_32 = _rv64_packu(x[4], x[5]);
uint64_t x6_x7_63_32 = _rv64_packu(x[6], x[7]);
uint64_t x0_x1_31_00_z = _rv64_unzip16(x0_x1_31_00);
uint64_t x2_x3_31_00_z = _rv64_unzip16(x2_x3_31_00);
uint64_t x4_x5_31_00_z = _rv64_unzip16(x4_x5_31_00);
uint64_t x6_x7_31_00_z = _rv64_unzip16(x6_x7_31_00);
uint64_t x0_x1_63_32_z = _rv64_unzip16(x0_x1_63_32);
uint64_t x2_x3_63_32_z = _rv64_unzip16(x2_x3_63_32);
uint64_t x4_x5_63_32_z = _rv64_unzip16(x4_x5_63_32);
uint64_t x6_x7_63_32_z = _rv64_unzip16(x6_x7_63_32);
uint64_t x0_x1_x2_x3_15_00 = _rv64_pack (x0_x1_31_00_z, x2_x3_31_00_z);
uint64_t x4_x5_x6_x7_15_00 = _rv64_pack (x4_x5_31_00_z, x6_x7_31_00_z);
uint64_t x0_x1_x2_x3_31_16 = _rv64_packu(x0_x1_31_00_z, x2_x3_31_00_z);
uint64_t x4_x5_x6_x7_31_16 = _rv64_packu(x4_x5_31_00_z, x6_x7_31_00_z);
uint64_t x0_x1_x2_x3_47_32 = _rv64_pack (x0_x1_63_32_z, x2_x3_63_32_z);
uint64_t x4_x5_x6_x7_47_32 = _rv64_pack (x4_x5_63_32_z, x6_x7_63_32_z);
uint64_t x0_x1_x2_x3_63_48 = _rv64_packu(x0_x1_63_32_z, x2_x3_63_32_z);
uint64_t x4_x5_x6_x7_63_48 = _rv64_packu(x4_x5_63_32_z, x6_x7_63_32_z);
uint64_t x0_x1_x2_x3_15_00_z = _rv64_unzip8(x0_x1_x2_x3_15_00);
uint64_t x4_x5_x6_x7_15_00_z = _rv64_unzip8(x4_x5_x6_x7_15_00);
uint64_t x0_x1_x2_x3_31_16_z = _rv64_unzip8(x0_x1_x2_x3_31_16);
uint64_t x4_x5_x6_x7_31_16_z = _rv64_unzip8(x4_x5_x6_x7_31_16);
uint64_t x0_x1_x2_x3_47_32_z = _rv64_unzip8(x0_x1_x2_x3_47_32);
uint64_t x4_x5_x6_x7_47_32_z = _rv64_unzip8(x4_x5_x6_x7_47_32);
uint64_t x0_x1_x2_x3_63_48_z = _rv64_unzip8(x0_x1_x2_x3_63_48);
uint64_t x4_x5_x6_x7_63_48_z = _rv64_unzip8(x4_x5_x6_x7_63_48);
y[0] = _rv64_pack (x0_x1_x2_x3_15_00_z, x4_x5_x6_x7_15_00_z);
y[1] = _rv64_packu(x0_x1_x2_x3_15_00_z, x4_x5_x6_x7_15_00_z);
y[2] = _rv64_pack (x0_x1_x2_x3_31_16_z, x4_x5_x6_x7_31_16_z);
y[3] = _rv64_packu(x0_x1_x2_x3_31_16_z, x4_x5_x6_x7_31_16_z);
y[4] = _rv64_pack (x0_x1_x2_x3_47_32_z, x4_x5_x6_x7_47_32_z);
y[5] = _rv64_packu(x0_x1_x2_x3_47_32_z, x4_x5_x6_x7_47_32_z);
y[6] = _rv64_pack (x0_x1_x2_x3_63_48_z, x4_x5_x6_x7_63_48_z);
y[7] = _rv64_packu(x0_x1_x2_x3_63_48_z, x4_x5_x6_x7_63_48_z);
}
Each of the 5 blocks in this function only consumes the eight outputs of the previous block. Therefore 16 registers are sufficient to run this function in registers only without the need to spill any data on the stack.
Note that this function is its own inverse. Therefore the same function can be used for the convertion from block matrix form back to row (column) major form.
A bit 64x64 bit matrix in block matrix form can easily be transposed
by running bmatflip
(or zip; zip; zip
) on the blocks of
the matrix and then renaming the individual 64-bit variables.
To multiply 64x64 bit matrices in block matrix form, the matrix-matrix-product
is decomposed in the obvious way in 8 × 8 × 8 = 512 bmat[x]or
instructions and 7 × 8 × 8 = 448 [x]or
instructions.
4.7 Inverting Xorshift RNGs
Xorshift RNGs are a class of fast RNGs for different bit widths. There are 648 Xorshift RNGs for 32 bits, but this is the one that the author of the original Xorshift RNG paper recommends. [Xorshift]
uint32_t xorshift32(uint32_t x)
{
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
return x;
}
This function of course has been designed and selected so it’s efficient, even without special bit-manipulation instructions. So let’s look at the inverse instead. First, the naïve form of inverting this function:
uint32_t xorshift32_inv(uint32_t x)
{
uint32_t t;
t = x ^ (x << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
t = x ^ (t << 5);
x = x ^ (t << 5);
x = x ^ (x >> 17);
t = x ^ (x << 13);
x = x ^ (t << 13);
return x;
}
This translates to 18 RISC-V instructions, not including the function call overhead.
Obviously the C expression x ^ (x >> 17)
is already its own inverse
(because 17 ≥ XLEN/2) and therefore already has an effecient inverse. But the two
other blocks can easily be implemented using a single clmul
instruction each:
uint32_t xorshift32_inv(uint32_t x)
{
x = clmul(x, 0x42108421);
x = x ^ (x >> 17);
x = clmul(x, 0x04002001);
return x;
}
This are 8 RISC-V instructions, including 4 instructions for loading the constants, but not including the function call overhead.
An optimizing compiler could easily generate the clmul instructions and the magic
constants from the C code for the naïve implementation. (0x04002001 = (1 << 2*13) | (1 << 13) | 1
and 0x42108421 = (1 << 6*5) | (1 << 5*5) | …| (1 << 5) | 1
)
The obvious remaining question is “if clmul(x, 0x42108421)
is the inverse of x ^ (x << 5)
, what’s
the inverse of x ^ (x >> 5)
?” It’s clmulr(x, 0x84210842)
, where 0x84210842
is the bit-reversal of 0x42108421
.
A special case of xorshift is x ^ (x >> 1)
, which is a gray encoder. The corresponding gray decoder is clmulr(x, 0xffffffff)
.
4.8 Cyclic redundency checks (CRC)
There are special instructions for performing CRCs using the two most widespread 32-bit CRC polynomials, CRC-32 and CRC-32C.
CRCs with other polynomials can be computed efficiently using CLMUL. The following examples are using CRC32Q.
The easiest way of implementing CRC32Q with clmul is using a Barrett reduction. On RV32:
uint32_t crc32q_simple(const uint32_t *data, int length)
{
uint32_t P = 0x814141AB; // CRC polynomial (implicit x^32)
uint32_t mu = 0xFEFF7F62; // x^64 divided by CRC polynomial
uint32_t mu1 = 0xFF7FBFB1; // "mu" with leading 1, shifted right by 1 bit
uint32_t crc = 0;
for (int i = 0; i < length; i++) {
crc ^= rev8(data[i]);
crc = clmulr(crc, mu1);
crc = clmul(crc, P);
}
return crc;
}
The following python code calculates the value of mu
for a given CRC
polynomial:
def polydiv(dividend, divisor):
quotient = 0
while dividend.bit_length() >= divisor.bit_length():
i = dividend.bit_length() - divisor.bit_length()
dividend = dividend ^ (divisor << i)
quotient |= 1 << i
return quotient
P = 0x1814141AB
print("0x%X" % (polydiv(1<<64, P))) # prints 0x1FEFF7F62
A more efficient method would be the following, which processes 64-bit at a time (RV64):
uint32_t crc32q_fast(const uint64_t *p, int len)
{
uint64_t P = 0x1814141ABLL; // CRC polynomial
uint64_t k1 = 0xA1FA6BECLL; // remainder of x^128 divided by CRC polynomial
uint64_t k2 = 0x9BE9878FLL; // remainder of x^96 divided by CRC polynomial
uint64_t k3 = 0xB1EFC5F6LL; // remainder of x^64 divided by CRC polynomial
uint64_t mu = 0x1FEFF7F62LL; // x^64 divided by CRC polynomial
uint64_t a0, a1, a2, t1, t2;
assert(len >= 2);
a0 = rev8(p[0]);
a1 = rev8(p[1]);
// Main loop: Reduce to 2x 64 bits
for (const uint64_t *t0 = p+2; t0 != p+len; t0++)
{
a2 = rev8(*t0);
t1 = clmulh(a0, k1);
t2 = clmul(a0, k1);
a0 = a1 ^ t1;
a1 = a2 ^ t2;
}
// Reduce to 64 bit, add 32 bit zero padding
t1 = clmulh(a0, k2);
t2 = clmul(a0, k2);
a0 = (a1 >> 32) ^ t1;
a1 = (a1 << 32) ^ t2;
t2 = clmul(a0, k3);
a1 = a1 ^ t2;
// Barrett Reduction
t1 = clmul(a1 >> 32, mu);
t2 = clmul(t1 >> 32, P);
a0 = a1 ^ t2;
return a0;
}
The main idea is to transform an array of arbitrary length to an array with the same CRC that’s only two 64-bit elements long. (That’s the “Main loop” portion of above code.)
Then we further reduce it to just 64-bit. And then we use a Barrett reduction to get the final 32-bit result.
The following python code can be used to calculate the “magic constants” k1
, k2
, and k3
:
def polymod(dividend, divisor):
quotient = 0
while dividend.bit_length() >= divisor.bit_length():
i = dividend.bit_length() - divisor.bit_length()
dividend = dividend ^ (divisor << i)
quotient |= 1 << i
return dividend
print("0x%X" % (polymod(1<<128, P))) # prints 0xA1FA6BEC
print("0x%X" % (polymod(1<< 96, P))) # prints 0x9BE9878F
print("0x%X" % (polymod(1<< 64, P))) # prints 0xB1EFC5F6
The above example code is taken from [Wolf18A]. A more detailed description of the algorithms employed can be found in [FastCRC].
4.9 Decoding RISC-V Immediates
The following code snippets decode and sign-extend the immediate from RISC-V S-type, B-type, J-type, and CJ-type instructions. They are nice “nothing up my sleeve”-examples for real-world bit permutations.
2
decode_s:
li t0, 0xfe000f80
bext a0, a0, t0
c.slli a0, 20
c.srai a0, 20
ret
decode_b:
li t0, 0xeaa800aa
rori a0, a0, 8
grevi a0, a0, 8
shfli a0, a0, 7
bext a0, a0, t0
c.slli a0, 20
c.srai a0, 19
ret
decode_j:
li t0, 0x800003ff
li t1, 0x800ff000
bext a1, a0, t1
c.slli a1, 23
rori a0, a0, 21
bext a0, a0, t0
c.slli a0, 12
c.or a0, a1
c.srai a0, 11
ret
// variant 1 (with RISC-V Bitmanip)
decode_cj:
li t0, 0x28800001
li t1, 0x000016b8
li t2, 0xb4e00000
li t3, 0x4b000000
bext a1, a0, t1
bdep a1, a1, t2
rori a0, a0, 11
bext a0, a0, t0
bdep a0, a0, t3
c.or a0, a1
c.srai a0, 20
ret
// variant 2 (without RISC-V Bitmanip)
decode_cj:
srli a5, a0, 2
srli a4, a0, 7
c.andi a4, 16
slli a3, a0, 3
c.andi a5, 14
c.add a5, a4
andi a3, a3, 32
srli a4, a0, 1
c.add a5, a3
andi a4, a4, 64
slli a2, a0, 1
c.add a5, a4
andi a2, a2, 128
srli a3, a0, 1
slli a4, a0, 19
c.add a5, a2
andi a3, a3, 768
c.slli a0, 2
c.add a5, a3
andi a0, a0, 1024
c.srai a4, 31
c.add a5, a0
slli a0, a4, 11
c.add a0, a5
ret
// variant 3 (with RISC-V Zbp only)
decode_cj:
shfli a0, a0, 15
rori a0, a0, 28
shfli a0, a0, 2
shfli a0, a0, 14
rori a0, a0, 26
shfli a0, a0, 8
rori a0, a0, 10
unshfli a0, a0, 12
rori a0, a0, 18
unshfli a0, a0, 14
rori a0, a0, 28
shfli a0, a0, 6
rori a0, a0, 28
unshfli a0, a0, 15
slli a0, a0, 21
srai a0, a0, 20
ret