Five EmbedDev logo Five EmbedDev

An Embedded RISC-V Blog
RISC-V Bitmanip Extension

4 Example Applications

This chapter contains a collection of short code snippets and algorithms using the Bitmanip extension. It also contains some examples of bit manipulation code that doesn’t require any extension beyond the base ISA.

4.1 Basic Bitmanipulation

4.1.1 Loading constants

On RV32 any arbitrary constant can be loaded using LUI+ADDI.

  lui a0, ((v - (v << 20 >> 20)) >> 12) & 0xfffff
  addi a0, a0, (v << 20 >> 20)

(Assuming signed 32-bit arithmetic for the expression (v << 20 >> 20).)

Using the following sequence on RV64 will yield the 32-bit constant in sign-extended form.

  lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
  addiw a0, a0, (v << 52 >> 52)

(addiw is needed instead of addi to handle the cases correctly that have bits 11-30 of the constant set to one.)

Using addiwu instead of addiw produces a zero-extended version of the same constant, iff any of the bits 11-31 of the constant is zero.

  lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
  addiwu a0, a0, (v << 52 >> 52)

In the remaining cases with bits 11-31 all set, addi+pack can be used to produce the constant:

  addi a0, zero, v
  pack a0, a0, zero

64-bit constants of the form R × 1 S × 0 T × 1 (with R + S + T = 64) can be constructed using sloi and rori:

  sloi a0, zero, R+T
  rori a0, a0, R

Likewise, constructing R × 0 S × 1 T × 0:

  sloi a0, zero, S
  slli a0, a0, T

Any constant that is just a repeating 16-bit pattern can be constructed in two instructions with lui and orc16. For example, constructing 0xABCD_ABCD_ABCD_ABCD:

  lui a0, 0x0BCDA
  orc16 a0, a0

Finally, any arbitrary 64-bit constant can be created using the following 5-instruction pattern and one spill register:

  lui a0, ((v - (v << 52 >> 52)) >> 12) & 0xfffff
  addiw a0, a0, (v << 52 >> 52)
  lui a1, ((v - (v << 20 >> 20)) >> 44) & 0xfffff
  addiw a1, a1, (v << 20 >> 52)
  pack a0, a0, a1

4.1.2 Bitfield extract

Extracting a bit field of length len at position pos can be done using two shift operations.

  slli a0, a0, (XLEN-len-pos)
  srli a0, a0, (XLEN-len)

Or using srai for a signed bit-field.

  slli a0, a0, (XLEN-len-pos)
  srai a0, a0, (XLEN-len)

4.1.3 Packing bit-fields

There are different ways of packing bit-fields with the help of RISC-V BitManip instructions.

For example, packing a 16-bit RGB value in 5:6:5 format, from the bytes in a0, a1, and a2, using pack[h] and bext:

  li t0, 0x00f8fcf8

  packh a0, a0, a1
  pack  a0, a0, a2
  bext  a0, a0, t0

Or using funnel shifts (assuming a2 is already in zero-extended form and/or the upper bits of the return value do not matter):

  srli a2, a2, 3
  slli a1, a1, XLEN-8
  fsli a1, a2, a1, 6
  slli a0, a0, XLEN-8
  fsli a0, a1, a0, 5

Using only base-ISA instructions, at least 7 instructions are needed to pack a 5:6:5 RGB value (assuming a0 is alredy in zero-extended form):

  andi a2, a2, 0xf8
  slli a2, a2, 9
  andi a1, a1, 0xfc
  slli a1, a1, 3
  srli a0, a0, 3
  or a1, a1, a2
  or a0, a0, a1

Another example for packing bit fields is generating IEEE floats using only integer instructions (aka “soft float”). For example, generating a 32-bit float in the range [ − 1.0… + 1.0) from a signed 16-bit integer:

  short2float:
    neg a1, a0
    max a1, a1, a0
    clz a2, a1
    srli a0, a0, 31
    sll a3, a1, a2
    srli a3, a3, 15
    neg a2, a2
    addi a2, a2, 143
    packh a0, a2, a0
    pack a0, a3, a0
    slli a4, a4, 7
    orc a1, a1
    and a0, a0, a1
    ret

Or using funnel shifts:

  short2float:
    neg a1, a0
    max a1, a1, a0
    clz a2, a1
    sll a3, a1, a2
    slli a3, a3, 1
    neg a2, a2
    addi a2, a2, 143
    fsri a3, a3, a2, 8
    srli a0, a0, 31
    fsri a0, a3, a0, 1
    cmov a0, a1, a0, zero
    ret

4.1.4 Parity check

The parity of a word (xor of all bits) is the LSB of the population count.

  pcnt a0, a0
  andi a0, a0, 1

4.1.5 Average of two integers

The following four instructions calculate the average of the unsigned integers in a0 and a1, with compensation for overflow:

  and  a2, a0, a1
  xor  a0, a0, a1
  srli a0, a0, 1
  add  a0, a0, a2

And likewise the average of two signed integers:

  and  a2, a0, a1
  xor  a0, a0, a1
  srai a0, a0, 1
  add  a0, a0, a2

With fsri the unsigned case can be accomplished in just three instructions:

  add  a0, a0, a1
  sltu a1, a0, a1
  fsri a0, a1, 1

4.1.6 Detecting integer overflow

Overflow in unsigned addition can be detected in two instructions:

  add  a0, a1, a2
  bltu a0, a1, overflow

For signed addition, if the sign of one operand is known, for example because it is constant:

  addi a0, a1, +imm
  blt  a0, a1, overflow
  addi a0, a1, -imm
  bgt  a0, a1, overflow

And for signed addition in the general case:

  add  a0, a1, a2
  slti a3, a1, 0
  slt  a4, a0, a2
  bne  a3, a4, overflow

And finally, generating the carry flag for an addition:

  add  a0, a1, a2
  sltu a3, a0, a1

Thus, adding a0, a1, and a2 with results in a0 and carry-out in a1:

  add  a0, a0, a1
  sltu a1, a0, a1
  add  a0, a0, a2
  sltu a2, a0, a2
  add  a1, a1, a2

4.1.7 Fuse-able sequences for logic operations

RISC-V has dedicated instructions for branching on equal/not-equal. But C code such as the following would require set-equal and set-not-equal instructions, similar to slt.

  int is_equal = (a == b);
  int is_noteq = (c != d);

Those can be implemented using the following fuse-able sequences:

  sub rd, rs1, rs2
  sltui rd, rd, 1

  sub rd, rs1, rs2
  sltu rd, zero, rd

Likewise for logic OR:

  int logic_or  = (c || d);

  or rd, rs1, rs2
  sltu rd, zero, rd

And for logic AND, if rd != rs2:

  int logic_and  = (c && d);

  orc rd, rs1
  and rd, rd, rs2
  sltu rd, zero, rd

4.1.8 Rotate shift of bytes and half-words

Rotate right shift of the byte in a0 by the shift amount in a1, assuming a0 is stored in zero-extended form:

  orc8 a0, a0
  ror a0, a1
  andi a0, a0, 255

And rotate right shift of the 16-bit half-word in a0 by the shift amount in a1, assuming a0 is stored in zero-extended form:

  orc16 a0, a0
  ror a0, a1
  pack[w] a0, a0, zero

4.1.9 Rank and select

Rank and select are fundamental operations in succinct data structures [SelectX86].

select(a0, a1) returns the position of the a1th set bit in a0. It can be implemented efficiently using bdep and ctz:

  select:
    sbset a1, zero, a1
    bdep a0, a1, a0
    ctz a0, a0
    ret

rank(a0, a1) returns the number of set bits in a0 up to and including position a1.

  rank:
    not a1, a1
    sll a0, a1
    pcnt a0, a0
    ret

4.1.10 OR/AND/XOR-reduce in byte vectors

OR-ing the bytes in a register and returning the resulting byte is easy with GORC:

  gorci a0, a0, -8
  andi a0, 255

AND-ing can accomplished by applying De Morgan’s laws:

  not a0, a0
  gorci a0, a0, -8
  not a0, a0
  andi a0, 255

XOR-ing can be accomplished with CLMUL (see also section 1.7).

  andi a1, zero, 0x80
  gorci a1, a1, -8
  clmulr a0, a0, a1
  andi a0, 255

Where the first two instructions (andi+gorci) just create the constant 0x8080..8080.

Finally, on RV64, XOR-ing the bytes in a register can also be accomplished with BMATXOR:

  andi a1, zero, 0xff
  bmatxor a0, a1, a0

4.1.11 Counting trailing non-zero bytes

Counting the trailing (LSB-end) non-zero bytes in a word is a helpful operation in optimized implementations of strlen() and strcpy():

  int count_trailing_nonzero_bytes(long x)
  {
    return _rv_ctz(~_rv_orc_b(x)) >> 3;
  }

4.1.12 Finding bytes of certain values

Finding zero bytes is a useful operations for strchr() and memchr():

  bool check_zero_bytes(long x)
  {
    return ~_rv_orc_b(x) != 0;
  }

To find other bytes we simply XOR the value with a mask of the byte value we are looking for:

  bool check_byte(long x, unsigned char c)
  {
    return ~_rv_orc_b(x ^ _rv_orc8(c)) != 0;
  }

These schemes can easily be extended with ctz and pcnt to perform operations such as counting the number of bytes of a certain value within a word, or finding the position of the first such byte.

4.1.13 Fill right of most significant set bit

The “fill right” or “fold right” operation is a pattern commonly used in bit manipulation code. [MAGIC]

The straight-forward RV64 implementation requires 12 instructions:

  uint64_t rfill(uint64_t x)
  {
    x |= x >> 1;   // SRLI, OR
    x |= x >> 2;   // SRLI, OR
    x |= x >> 4;   // SRLI, OR
    x |= x >> 8;   // SRLI, OR
    x |= x >> 16;  // SRLI, OR
    x |= x >> 32;  // SRLI, OR
    return x;
  }

With clz it can be implemented in only 4 instructions. Notice the handling of the case where x=0 using sltiu+addi.

  uint64_t rfill_clz(uint64_t x)
  {
    uint64_t t;
    t = clz(x);         // CLZ
    x = (!x)-1;         // SLTIU, ADDI
    x = x >> (t & 63);  // SRL
    return x;
  }

Alternatively, a Trailing Bit Manipulation (TBM) code pattern can be used together with rev to implement this function in 4 instructions:

  uint64_t rfill_rev(uint64_t x)
  {
    x = rev(x);         // GREVI
    x = x | ~(x - 1);   // ADDI, ORN
    x = rev(x);         // GREVI
    return x;
  }

Finally, there is another implementation in 4 instructions using BMATOR, if we do not count the extra instructions for loading utility matrices.

  uint64_t rfill_bmat(uint64_t x)
  {
    uint64_t m0, m1, m2, t;

    m0 = 0xFF7F3F1F0F070301LL;  // LD
    m1 = bmatflip(m0 << 8);     // SLLI, BMATFLIP
    m2 = -1LL;                  // ADDI

    t = bmator(x, m0);          // BMATOR
    x = bmator(x, m2);          // BMATOR
    x = bmator(m1, x);          // BMATOR
    x |= t;                     // OR

    return x;
  }

4.1.14 Round to next power of two

One common application of rfill() is rounding up to the next power of two:

  uint64_t round_pow2(uint64_t x)
  {
    return rfill(x-1)+1;
  }

This can also be implemented in just 4 instructions, if we don’t care about the case where the above code overflows because x is already larger than the largest power-of-two representable in an uint64_t.

  uint64_t round_pow2(uint64_t x)
  {
    uint64_t t;
    t = clz(x-1);     // ADDI, CLZ
    x = ror(!!x, t);  // SLTU, ROR
    return x;
  }

Note that this code handles 0 → 0 and 1 → 1 correctly, i.e. equivialent to rfill(x-1)+1.

4.2 Packed vectors

4.2.1 Packing bytes

The following RV32 code packs the lower 8 bits from a0, a1, a2, a3 into a 32-bit word returned in a0, ignoring other bits in the input values.

  packh a0, a0, a1
  packh a1, a2, a3
  pack  a0, a0, a1

And the following RV64 code packs 8 bytes into a register.

  packh a0, a0, a1
  packh a1, a2, a3
  packh a2, a4, a5
  packh a3, a6, a7
  packw a0, a0, a1
  packw a1, a2, a3
  pack  a0, a0, a1

4.2.2 Permuting bytes

There are 24 ways of arranging the four bytes in a 32-bit word. ror, grev, and [un]shfl can perform any of those permutations in at most 3 instructions. Table [permbytes] lists those sequences. [Wolf19A]

Bytes Instructions
A B C D initial byte order
A B D C ROR(24),SHFL(8),ROR(8)
A C B D SHFL(8)
A C D B ROR(8),GREV(8),SHFL(8)
A D B C ROR(16),SHFL(8),ROR(24)
A D C B ROR(8),GREV(8)
B A C D ROR(8),SHFL(8),ROR(24)
B A D C GREV(8)
B C A D ROR(16),SHFL(8),ROR(8)
B C D A ROR(24)
B D A C GREV(8),SHFL(8)
B D C A ROR(24),SHFL(8)
C A B D ROR(8),GREV(24),SHFL(8)
C A D B ROR(16),SHFL(8)
C B A D ROR(8),GREV(24)
C B D A SHFL(8),ROR(24)
C D A B ROR(16)
C D B A ROR(8),SHFL(8),ROR(8)
D A B C ROR(8)
D A C B SHFL(8),ROR(8)
D B A C ROR(8),SHFL(8)
D B C A GREV(24),SHFL(8)
D C A B ROR(24),SHFL(8),ROR(24)
D C B A GREV(24)

4.2.3 Widening and narrowing

The [un]zip instructions can help with widening and narrowing packed vectors. For example, narrowing the bytes in two words into a single word with the values in nibbles with values from a0 in LSB half and values from a1 in MSB half:

  unzip4 a0, a0
  unzip4 a1, a1
  pack a0, a0, a1

And widening the nibbles from a0 into bytes in a1 (MSB half) and a0 (LSB half), with zero extension:

  srli a1, a0, XLEN/2
  pack a0, a0, zero
  zip4 a1, a1
  zip4 a0, a0

And finally the same widening operation with sign extension:

  addi t0, zero, 8
  orc4 t0, t0
  and t0, t0, a0
  orc.n t0, t0
  srli t1, t0, XLEN/2
  srli a1, a0, XLEN/2
  pack a1, a1, t1
  pack a0, a0, t0
  zip4 a1, a1
  zip4 a0, a0

4.2.4 Shifting packed vector elements

Using zip we can re-arrange the bits in a packed vector of N elements so that a shift by k of each byte becomes a shift of Nk of the entire new vector. So we zip, shift, and then unzip to shuffle everything back. The number of zip and unzip is log2(N). This works for all kinds of shift operations. For example, rotating a vector of bytes on RV32 in 6 instructions:

  zip a0, a0
  zip a0, a0
  slli a1, a1, 2
  ror a0, a0, a1
  unzip a0, a0
  unzip a0, a0

Because zip; zip; zip is equal to unzip; unzip on RV32, and equal to unzip; unzip; unzip on RV64, we need never more than 2 [un]zip on RV32, or 3 [un]zip on RV64.

4.2.5 Adding packed vectors

The following six instructions will add the elements of the two vectors passed in a0 and a1, and return the vector of sums in a0.

This expects a mask in a2 that marks the MSB bit of each vector element. For a vector of bytes this mask would be 0x8080...80 (which can be obtained in two instructions via orc8(0x80)).

  xor  a3, a0, a1
  and  a3, a3, a2
  andn a0, a0, a2
  andn a1, a1, a2
  add  a0, a0, a1
  xor  a0, a0, a3

4.3 Funnel shifts

A funnel shift takes two XLEN registers, concatenates them to a 2 × XLEN word, shifts that by a certain amount, then returns the lower half of the result for a right shift and the upper half of the result for a left shift.

The fsl, fsr, and fsri instructions perform funnel shifts.

4.3.1 Bigint shift

A common application for funnel shifts is shift operations in bigint libraries.

For example, the following functions implement rotate-shift operations for bigints made from n XLEN words.

  void bigint_rol(uint_xlen_t data[], int n, int shamt)
  {
    if (n <= 0)
      return;

    uint_xlen_t buffer = data[n-1];
    for (int i = n-1; i > 0; i--)
      data[i] = fsl(data[i], shamt, data[i-1]);
    data[0] = fsl(data[0], shamt, buffer);
  }

  void bigint_ror(uint_xlen_t data[], int n, int shamt)
  {
    if (n <= 0)
      return;

    uint_xlen_t buffer = data[0];
    for (int i = 0; i < n-1; i++)
      data[i] = fsr(data[i], shamt, data[i+1]);
    data[n-1] = fsr(data[n-1], shamt, buffer);
  }

These version only works for shift-amounts <XLEN. But functions supporting other kinds of shift operations, or shifts XLEN can easily be built with fsl and fsr.

4.3.2 Parsing bit-streams of 27-bit words

The following function parses n 27-bit words from a packed array of XLEN words:

  void parse_27bit(uint_xlen_t *idata, uint_xlen_t *odata, int n)
  {
    uint_xlen_t lower = 0, upper = 0;
    int reserve = 0;

    while (n--) {
      if (reserve < 27) {
        uint_xlen_t buf = *(idata++);
        lower |= sll(buf, reserve);
        upper = reserve ? srl(buf, -reserve) : 0;
        reserve += XLEN;
      }
      *(odata++) = lower & ((1 << 27)-1);
      lower = fsr(lower, 27, upper);
      upper = srl(upper, 27);
      reserve -= 27;
    }
  }

And here the same thing in RISC-V assembler:

  parse_27bit:
    li t1, 0              ; lower
    li t2, 0              ; upper
    li t3, 0              ; reserve
    li t4, 27             ; shamt
    slo t5, zero, t4      ; mask
    beqz a2, endloop      ; while (n--)
  loop:
    addi a2, a2, -1
    bge t3, t4, output       ; if (reserve < 27)
    lw t6, 0(a0)                 ; buf = *(idata++)
    addi a0, a0, 4
    sll t7, t6, t3               ; lower |= sll(buf, reserve)
    or t1, t1, t7
    sub t7, zero, t3             ; upper = reserve ? srl(buf, -reserve) : 0
    srl t7, t6, t7
    cmov t2, t3, t7, zero
    addi t3, t3, 32              ; reserve += XLEN;
  output:
    and t6, t1, t5           ; *(odata++) = lower & ((1 << 27)-1)
    sw t6, 0(a1)
    addi a1, a1, 4
    fsr t1, t1, t2, t4       ; lower = fsr(lower, 27, upper)
    srl t2, t2, t4           ; upper = srl(upper, 27)
    sub t3, t3, t4           ; reserve -= 27
    bnez a2, loop         ; while (n--)
  endloop:
    ret

A loop iteration without fetch is 9 instructions long, and a loop iteration with fetch is 17 instructions long.

Without ternary operators that would be 13 instructions and 22 instructions, i.e. assuming one cycle per instruction, that function would be about 30% slower without ternary instructions.

4.3.3 Parsing bit-streams of 6-bit words

The following code accepts three 64-bit words in t0, t1, and t2 containing a bit-stream of 32 6-bit words, and outputs these 32 values in the bytes of t0, t1, t2, and t3.

    addi a0, zer0, 0x3f
    orc8 a0, a0          // a0 = 0x3f3f3f3f3f3f3f3f

    srli t3, t2, 16
    bdep t3, t3, a0
    fsli t2, t2, t1, 32
    bdep t2, t2, a0
    fsli t1, t1, t0, 16
    bdep t1, t1, a0
    bdep t0, t0, a0

That’s 7 instructions without the two instructions for constructing the mask in a0.

Without funnel shift this operation requires 11 instructions:

    addi a0, zer0, 0x3f
    orc8 a0, a0          // a0 = 0x3f3f3f3f3f3f3f3f

    srli t3, t2, 16
    bdep t3, t3, a0
    slli t2, t2, 32
    srli a1, t1, 32
    or t2, t2, a1
    bdep t2, t2, a0
    slli t1, t1, 16
    srli a1, t0, 48
    or t1, t1, a1
    bdep t1, t1, a0
    bdep t0, t0, a0

Sign-extending the 6-bit values in the bytes of t0, t1, t2, and t3 with bit-matrix multiply:

    li a0, 0x80c0e01008040201

    bmator t0, t0, a0
    bmator t1, t1, a0
    bmator t2, t2, a0
    bmator t3, t3, a0

Or without bit-matrix multiply:

    addi a0, zer0, 0x60
    orc8 a0, a0          // a0 = 0x6060606060606060

    add t0, t0, a0
    add t1, t1, a0
    add t2, t2, a0
    add t3, t3, a0
    xor t0, t0, a0
    xor t1, t1, a0
    xor t2, t2, a0
    xor t3, t3, a0

4.3.4 Fixed-point multiply

A fixed-point multiply is simply an integer multiply, followed by a right shift. If the entire dynamic range of XLEN bits should be useable for the factors, then the product before shift must be 2*XLEN wide. Therefore mul+mulh is needed for the multiplication, and funnel shift instructions can help with the final right shift. For fixed-point numbers with N fraction bits:

  mul_fracN:
    mulh a2, a0, a1
    mul a0, a0, a1
    fsri a0, a0, a2, N
    ret

4.4 Arbitrary bit permutations

This section lists code snippets for computing arbitrary bit permutations that are defined by data (as opposed to bit permutations that are known at compile time and can likely be compiled into shift-and-mask operations and/or a few instances of bext/bdep).

4.4.1 Using butterfly operations

The following macro performs a stage-N butterfly operation on the word in a0 using the mask in a1.

  grevi a2, a0, (1 << N)
  cmix a0, a1, a2, a0

The bitmask in a1 must be preformatted correctly for the selected butterfly stage. A butterfly operation only has a XLEN/2 wide control word. The following macros format the mask assuming those XLEN/2 bits in the lower half of a1 on entry:

bfly_msk_0:
  pack a1, a1, a1
  zip a1, a1

bfly_msk_1:
  pack a1, a1, a1
  zip2 a1, a1

bfly_msk_2:
  pack a1, a1, a1
  zip4 a1, a1

...

A sequence of 2 ⋅ log2(XLEN) − 1 butterfly operations can perform any arbitrary bit permutation (Beneš network):

  butterfly(LOG2_XLEN-1)
  butterfly(LOG2_XLEN-2)
  ...
  butterfly(0)
  ...
  butterfly(LOG2_XLEN-2)
  butterfly(LOG2_XLEN-1)

Many permutations arising from real-world applications can be implemented using shorter sequences. For example, any sheep-and-goats operation (SAG, see section 1.4.4) with either the sheep or the goats bit reversed can be implemented in log2(XLEN) butterfly operations.

Reversing a permutation implemented using butterfly operations is as simple as reversing the order of butterfly operations.

4.4.2 Using omega-flip networks

The omega operation is a stage-0 butterfly preceded by a zip operation:

  zip a0, a0
  grevi a2, a0, 1
  cmix a0, a1, a2, a0

The flip operation is a stage-0 butterfly followed by an unzip operation:

  grevi a2, a0, 1
  cmix a0, a1, a2, a0
  unzip a0, a0

A sequence of log2(XLEN) omega operations followed by log2(XLEN) flip operations can implement any arbitrary 32 bit permutation.

As for butterfly networks, permutations arising from real-world applications can often be implemented using a shorter sequence.

4.4.3 Using baseline networks

Another way of implementing arbitrary 32 bit permutations is using a baseline network followed by an inverse baseline network.

A baseline network is a sequence of log2(XLEN) butterfly(0) operations interleaved with unzip operations. For example, a 32-bit baseline network:

  butterfly(0)
  unzip
  butterfly(0)
  unzip.h
  butterfly(0)
  unzip.b
  butterfly(0)
  unzip.n
  butterfly(0)

An inverse baseline network is a sequence of log2(XLEN) butterfly(0) operations interleaved with zip operations. The order is opposite to the order in a baseline network. For example, a 32-bit inverse baseline network:

  butterfly(0)
  zip.n
  butterfly(0)
  zip.b
  butterfly(0)
  zip.h
  butterfly(0)
  zip
  butterfly(0)

A baseline network followed by an inverse baseline network can implement any arbitrary bit permutation.

4.4.4 Using sheep-and-goats

The Sheep-and-goats (SAG) operation is a common operation for bit permutations. It moves all the bits selected by a mask (goats) to the LSB end of the word and all the remaining bits (sheep) to the MSB end of the word, without changing the order of sheep or goats.

The SAG operation can easily be performed using bext (data in a0 and mask in a1):

  bext a2, a0, a1
  not a1, a1
  bext a0, a0, a1
  pcnt a1, a1
  ror a0, a0, a1
  or a0, a0, a2

Any arbitrary bit permutation can be implemented in log2(XLEN) SAG operations.

The Hacker’s Delight describes an optimized standard C implementation of the SAG operation. Their algorithm takes 254 instructions (for 32 bit) or 340 instructions (for 64 bit) on their reference RISC instruction set. [Seander05]

4.4.5 Using bit-matrix multiply

bat[x]or performs a permutation of bits within each byte when used with a permutation matrix in rs2, and performs a permutation of bytes when used with a permutation matrix in rs1.

4.5 Mirroring and rotating bitboards

Bitboards are 64-bit bitmasks that are used to represent part of the game state in chess engines (and other board game AIs). The bits in the bitmask correspond to squares on a 8 × 8 chess board:

 56 57 58 59 60 61 62 63
 48 49 50 51 52 53 54 55
 40 41 42 43 44 45 46 47
 32 33 34 35 36 37 38 39
 24 25 26 27 28 29 30 31
 16 17 18 19 20 21 22 23
  8  9 10 11 12 13 14 15
  0  1  2  3  4  5  6  7

Many bitboard operations are simple straight-forward operations such as bitwise-AND, but mirroring and rotating bitboards can take up to 20 instructions on x86.

4.5.1 Mirroring bitboards

Flipping horizontally or vertically can easily done with grevi:

Flip horizontal:
 63 62 61 60 59 58 57 56    RISC-V Bitmanip:
 55 54 53 52 51 50 49 48       rev.b
 47 46 45 44 43 42 41 40
 39 38 37 36 35 34 33 32
 31 30 29 28 27 26 25 24    x86:
 23 22 21 20 19 18 17 16       13 operations
 15 14 13 12 11 10  9  8
  7  6  5  4  3  2  1  0

Flip vertical:
  0  1  2  3  4  5  6  7    RISC-V Bitmanip:
  8  9 10 11 12 13 14 15       rev8
 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31
 32 33 34 35 36 37 38 39    x86:
 40 41 42 43 44 45 46 47       bswap
 48 49 50 51 52 53 54 55
 56 57 58 59 60 61 62 63

Rotating by 180 (flip horizontal and vertical):

Rotate 180:
  7  6  5  4  3  2  1  0    RISC-V Bitmanip:
 15 14 13 12 11 10  9  8       rev
 23 22 21 20 19 18 17 16
 31 30 29 28 27 26 25 24
 39 38 37 36 35 34 33 32    x86:
 47 46 45 44 43 42 41 40       14 operations
 55 54 53 52 51 50 49 48
 63 62 61 60 59 58 57 56

4.5.2 Rotating bitboards

Using zip a bitboard can be transposed easily: [transposebitboard]

Transpose:
  7 15 23 31 39 47 55 63    RISC-V Bitmanip:
  6 14 22 30 38 46 54 62       zip, zip, zip
  5 13 21 29 37 45 53 61
  4 12 20 28 36 44 52 60
  3 11 19 27 35 43 51 59    x86:
  2 10 18 26 34 42 50 58       18 operations
  1  9 17 25 33 41 49 57
  0  8 16 24 32 40 48 56

A rotation is simply the composition of a flip operation and a transpose operation. This takes 19 operations on x86 [ChessProg]. With Bitmanip the rotate operation only takes 4 operations:

rotate_bitboard:
  rev8 a0, a0
  zip a0, a0
  zip a0, a0
  zip a0, a0

4.5.3 Explanation

The bit indices for a 64-bit word are 6 bits wide. Let i[5:0] be the index of a bit in the input, and let i[5:0] be the index of the same bit after the permutation.

As an example, a rotate left shift by N can be expressed using this notation as i[5:0] = i[5:0] + N   (mod 64).

The GREV operation with shamt N is i[5:0] = i[5:0] XOR N.

And a SHFL operation corresponds to a rotate left shift by one position of any contiguous region of i[5:0]. For example, zip is a left rotate shift of the entire bit index:


i[5:0] = {i[4:0], i[5]}

And zip4 performs a left rotate shift on bits 5:2:


i[5:0] = {i[4:2], i[5], i[1:0]}

In a bitboard, i[2:0] corresponds to the X coordinate of a board position, and i[5:3] corresponds to the Y coordinate.

Therefore flipping the board horizontally is the same as negating bits i[2:0], which is the operation performed by grevi rd, rs, 7 (rev.b).

Likewise flipping the board vertically is done by grevi rd, rs, 56 (rev8).

Finally, transposing corresponds by swapping the lower and upper half of i[5:0], or rotate shifting i[5:0] by 3 positions. This can easily done by rotate shifting the entire i[5:0] by one bit position (zip) three times.

4.5.4 Rotating Bitcubes

Let’s define a bitcube as a 4 × 4 × 4 cube with x = i[1:0], y = i[3:2], and z = i[5:4]. Using the same methods as described above we can easily rotate a bitcube by 90 around the X-, Y-, and Z-axis:

3

rotate_x:
  rev16 a0, a0
  zip4 a0, a0
  zip4 a0, a0
rotate_y:
  rev.n a0, a0
  zip a0, a0
  zip a0, a0
  zip4 a0, a0
  zip4 a0, a0
rotate_z:
  rev4.h
  zip.h a0, a0
  zip.h a0, a0

4.6 Manipulating 64x64 Bit Matrices

The bmat[x]or and bmatflip instructions operate on 8x8 bit matrices stored in single 64-bit registers, where each byte of such a 64-bit value represents one row (column) of a 8x8 bit matrix.

Let’s assume we have a 64x64 bit matrix in memory, stored as one row (column) per 64-bit value. In order to use bmat[x]or and bmatflip on such a matrix, we must first convert it into a 8x8 block matrix of 64 individual 8x8 matrices, each stored in a 64-bit value. The following function performs this transformation for a single row (column) of the block matrix in 40 instructions.

void conv8x8(const uint64_t x[8], uint64_t y[8])
{
  uint64_t x0_x1_31_00 = _rv64_pack (x[0], x[1]);
  uint64_t x2_x3_31_00 = _rv64_pack (x[2], x[3]);
  uint64_t x4_x5_31_00 = _rv64_pack (x[4], x[5]);
  uint64_t x6_x7_31_00 = _rv64_pack (x[6], x[7]);
  uint64_t x0_x1_63_32 = _rv64_packu(x[0], x[1]);
  uint64_t x2_x3_63_32 = _rv64_packu(x[2], x[3]);
  uint64_t x4_x5_63_32 = _rv64_packu(x[4], x[5]);
  uint64_t x6_x7_63_32 = _rv64_packu(x[6], x[7]);
  uint64_t x0_x1_31_00_z = _rv64_unzip16(x0_x1_31_00);
  uint64_t x2_x3_31_00_z = _rv64_unzip16(x2_x3_31_00);
  uint64_t x4_x5_31_00_z = _rv64_unzip16(x4_x5_31_00);
  uint64_t x6_x7_31_00_z = _rv64_unzip16(x6_x7_31_00);
  uint64_t x0_x1_63_32_z = _rv64_unzip16(x0_x1_63_32);
  uint64_t x2_x3_63_32_z = _rv64_unzip16(x2_x3_63_32);
  uint64_t x4_x5_63_32_z = _rv64_unzip16(x4_x5_63_32);
  uint64_t x6_x7_63_32_z = _rv64_unzip16(x6_x7_63_32);
  uint64_t x0_x1_x2_x3_15_00 = _rv64_pack (x0_x1_31_00_z, x2_x3_31_00_z);
  uint64_t x4_x5_x6_x7_15_00 = _rv64_pack (x4_x5_31_00_z, x6_x7_31_00_z);
  uint64_t x0_x1_x2_x3_31_16 = _rv64_packu(x0_x1_31_00_z, x2_x3_31_00_z);
  uint64_t x4_x5_x6_x7_31_16 = _rv64_packu(x4_x5_31_00_z, x6_x7_31_00_z);
  uint64_t x0_x1_x2_x3_47_32 = _rv64_pack (x0_x1_63_32_z, x2_x3_63_32_z);
  uint64_t x4_x5_x6_x7_47_32 = _rv64_pack (x4_x5_63_32_z, x6_x7_63_32_z);
  uint64_t x0_x1_x2_x3_63_48 = _rv64_packu(x0_x1_63_32_z, x2_x3_63_32_z);
  uint64_t x4_x5_x6_x7_63_48 = _rv64_packu(x4_x5_63_32_z, x6_x7_63_32_z);
  uint64_t x0_x1_x2_x3_15_00_z = _rv64_unzip8(x0_x1_x2_x3_15_00);
  uint64_t x4_x5_x6_x7_15_00_z = _rv64_unzip8(x4_x5_x6_x7_15_00);
  uint64_t x0_x1_x2_x3_31_16_z = _rv64_unzip8(x0_x1_x2_x3_31_16);
  uint64_t x4_x5_x6_x7_31_16_z = _rv64_unzip8(x4_x5_x6_x7_31_16);
  uint64_t x0_x1_x2_x3_47_32_z = _rv64_unzip8(x0_x1_x2_x3_47_32);
  uint64_t x4_x5_x6_x7_47_32_z = _rv64_unzip8(x4_x5_x6_x7_47_32);
  uint64_t x0_x1_x2_x3_63_48_z = _rv64_unzip8(x0_x1_x2_x3_63_48);
  uint64_t x4_x5_x6_x7_63_48_z = _rv64_unzip8(x4_x5_x6_x7_63_48);
  y[0] = _rv64_pack (x0_x1_x2_x3_15_00_z, x4_x5_x6_x7_15_00_z);
  y[1] = _rv64_packu(x0_x1_x2_x3_15_00_z, x4_x5_x6_x7_15_00_z);
  y[2] = _rv64_pack (x0_x1_x2_x3_31_16_z, x4_x5_x6_x7_31_16_z);
  y[3] = _rv64_packu(x0_x1_x2_x3_31_16_z, x4_x5_x6_x7_31_16_z);
  y[4] = _rv64_pack (x0_x1_x2_x3_47_32_z, x4_x5_x6_x7_47_32_z);
  y[5] = _rv64_packu(x0_x1_x2_x3_47_32_z, x4_x5_x6_x7_47_32_z);
  y[6] = _rv64_pack (x0_x1_x2_x3_63_48_z, x4_x5_x6_x7_63_48_z);
  y[7] = _rv64_packu(x0_x1_x2_x3_63_48_z, x4_x5_x6_x7_63_48_z);
}

Each of the 5 blocks in this function only consumes the eight outputs of the previous block. Therefore 16 registers are sufficient to run this function in registers only without the need to spill any data on the stack.

Note that this function is its own inverse. Therefore the same function can be used for the convertion from block matrix form back to row (column) major form.

A bit 64x64 bit matrix in block matrix form can easily be transposed by running bmatflip (or zip; zip; zip) on the blocks of the matrix and then renaming the individual 64-bit variables.

To multiply 64x64 bit matrices in block matrix form, the matrix-matrix-product is decomposed in the obvious way in 8 × 8 × 8 = 512 bmat[x]or instructions and 7 × 8 × 8 = 448 [x]or instructions.

4.7 Inverting Xorshift RNGs

Xorshift RNGs are a class of fast RNGs for different bit widths. There are 648 Xorshift RNGs for 32 bits, but this is the one that the author of the original Xorshift RNG paper recommends. [Xorshift]

  uint32_t xorshift32(uint32_t x)
  {
    x ^= x << 13;
    x ^= x >> 17;
    x ^= x << 5;
    return x;
  }

This function of course has been designed and selected so it’s efficient, even without special bit-manipulation instructions. So let’s look at the inverse instead. First, the naïve form of inverting this function:

  uint32_t xorshift32_inv(uint32_t x)
  {
    uint32_t t;
    t = x ^ (x << 5);
    t = x ^ (t << 5);
    t = x ^ (t << 5);
    t = x ^ (t << 5);
    t = x ^ (t << 5);
    x = x ^ (t << 5);
    x = x ^ (x >> 17);
    t = x ^ (x << 13);
    x = x ^ (t << 13);
    return x;
  }

This translates to 18 RISC-V instructions, not including the function call overhead.

Obviously the C expression x ^ (x >> 17) is already its own inverse (because 17 ≥ XLEN/2) and therefore already has an effecient inverse. But the two other blocks can easily be implemented using a single clmul instruction each:

  uint32_t xorshift32_inv(uint32_t x)
  {
    x = clmul(x, 0x42108421);
    x = x ^ (x >> 17);
    x = clmul(x, 0x04002001);
    return x;
  }

This are 8 RISC-V instructions, including 4 instructions for loading the constants, but not including the function call overhead.

An optimizing compiler could easily generate the clmul instructions and the magic constants from the C code for the naïve implementation. (0x04002001 = (1 << 2*13) | (1 << 13) | 1 and 0x42108421 = (1 << 6*5) | (1 << 5*5) | …| (1 << 5) | 1)

The obvious remaining question is “if clmul(x, 0x42108421) is the inverse of x ^ (x << 5), what’s the inverse of x ^ (x >> 5)?” It’s clmulr(x, 0x84210842), where 0x84210842 is the bit-reversal of 0x42108421.

A special case of xorshift is x ^ (x >> 1), which is a gray encoder. The corresponding gray decoder is clmulr(x, 0xffffffff).

4.8 Cyclic redundency checks (CRC)

There are special instructions for performing CRCs using the two most widespread 32-bit CRC polynomials, CRC-32 and CRC-32C.

CRCs with other polynomials can be computed efficiently using CLMUL. The following examples are using CRC32Q.

The easiest way of implementing CRC32Q with clmul is using a Barrett reduction. On RV32:

  uint32_t crc32q_simple(const uint32_t *data, int length)
  {
    uint32_t P  = 0x814141AB;  // CRC polynomial (implicit x^32)
    uint32_t mu = 0xFEFF7F62;  // x^64 divided by CRC polynomial
    uint32_t mu1 = 0xFF7FBFB1; // "mu" with leading 1, shifted right by 1 bit
    uint32_t crc = 0;

    for (int i = 0; i < length; i++) {
      crc ^= rev8(data[i]);
      crc = clmulr(crc, mu1);
      crc = clmul(crc, P);
    }

    return crc;
  }

The following python code calculates the value of mu for a given CRC polynomial:

  def polydiv(dividend, divisor):
      quotient = 0
      while dividend.bit_length() >= divisor.bit_length():
          i = dividend.bit_length() - divisor.bit_length()
          dividend = dividend ^ (divisor << i)
          quotient |= 1 << i
      return quotient

  P = 0x1814141AB
  print("0x%X" % (polydiv(1<<64, P)))   # prints 0x1FEFF7F62

A more efficient method would be the following, which processes 64-bit at a time (RV64):

  uint32_t crc32q_fast(const uint64_t *p, int len)
  {
    uint64_t P  = 0x1814141ABLL;   // CRC polynomial
    uint64_t k1 =  0xA1FA6BECLL;   // remainder of x^128 divided by CRC polynomial
    uint64_t k2 =  0x9BE9878FLL;   // remainder of x^96 divided by CRC polynomial
    uint64_t k3 =  0xB1EFC5F6LL;   // remainder of x^64 divided by CRC polynomial
    uint64_t mu = 0x1FEFF7F62LL;   // x^64 divided by CRC polynomial

    uint64_t a0, a1, a2, t1, t2;

    assert(len >= 2);
    a0 = rev8(p[0]);
    a1 = rev8(p[1]);
    // Main loop: Reduce to 2x 64 bits

    for (const uint64_t *t0 = p+2; t0 != p+len; t0++)
    {
      a2 = rev8(*t0);
      t1 = clmulh(a0, k1);
      t2 = clmul(a0, k1);
      a0 = a1 ^ t1;
      a1 = a2 ^ t2;
    }
    // Reduce to 64 bit, add 32 bit zero padding

    t1 = clmulh(a0, k2);
    t2 = clmul(a0, k2);

    a0 = (a1 >> 32) ^ t1;
    a1 = (a1 << 32) ^ t2;

    t2 = clmul(a0, k3);
    a1 = a1 ^ t2;
    // Barrett Reduction

    t1 = clmul(a1 >> 32, mu);
    t2 = clmul(t1 >> 32, P);
    a0 = a1 ^ t2;

    return a0;
  }

The main idea is to transform an array of arbitrary length to an array with the same CRC that’s only two 64-bit elements long. (That’s the “Main loop” portion of above code.)

Then we further reduce it to just 64-bit. And then we use a Barrett reduction to get the final 32-bit result.

The following python code can be used to calculate the “magic constants” k1, k2, and k3:

  def polymod(dividend, divisor):
      quotient = 0
      while dividend.bit_length() >= divisor.bit_length():
          i = dividend.bit_length() - divisor.bit_length()
          dividend = dividend ^ (divisor << i)
          quotient |= 1 << i
      return dividend

  print("0x%X" % (polymod(1<<128, P)))   # prints 0xA1FA6BEC
  print("0x%X" % (polymod(1<< 96, P)))   # prints 0x9BE9878F
  print("0x%X" % (polymod(1<< 64, P)))   # prints 0xB1EFC5F6

The above example code is taken from [Wolf18A]. A more detailed description of the algorithms employed can be found in [FastCRC].

4.9 Decoding RISC-V Immediates

The following code snippets decode and sign-extend the immediate from RISC-V S-type, B-type, J-type, and CJ-type instructions. They are nice “nothing up my sleeve”-examples for real-world bit permutations.

image

image

2

  decode_s:
    li t0, 0xfe000f80
    bext a0, a0, t0
    c.slli a0, 20
    c.srai a0, 20
    ret
  decode_b:
    li t0, 0xeaa800aa
    rori a0, a0, 8
    grevi a0, a0, 8
    shfli a0, a0, 7
    bext a0, a0, t0
    c.slli a0, 20
    c.srai a0, 19
    ret
  decode_j:
    li t0, 0x800003ff
    li t1, 0x800ff000
    bext a1, a0, t1
    c.slli a1, 23
    rori a0, a0, 21
    bext a0, a0, t0
    c.slli a0, 12
    c.or a0, a1
    c.srai a0, 11
    ret
  // variant 1 (with RISC-V Bitmanip)
  decode_cj:
    li t0, 0x28800001
    li t1, 0x000016b8
    li t2, 0xb4e00000
    li t3, 0x4b000000
    bext a1, a0, t1
    bdep a1, a1, t2
    rori a0, a0, 11
    bext a0, a0, t0
    bdep a0, a0, t3
    c.or a0, a1
    c.srai a0, 20
    ret
  // variant 2 (without RISC-V Bitmanip)
  decode_cj:
    srli a5, a0, 2
    srli a4, a0, 7
    c.andi a4, 16
    slli a3, a0, 3
    c.andi a5, 14
    c.add a5, a4
    andi a3, a3, 32
    srli a4, a0, 1
    c.add a5, a3
    andi a4, a4, 64
    slli a2, a0, 1
    c.add a5, a4
    andi a2, a2, 128
    srli a3, a0, 1
    slli a4, a0, 19
    c.add a5, a2
    andi a3, a3, 768
    c.slli a0, 2
    c.add a5, a3
    andi a0, a0, 1024
    c.srai a4, 31
    c.add a5, a0
    slli a0, a4, 11
    c.add a0, a5
    ret
  // variant 3 (with RISC-V Zbp only)
  decode_cj:
    shfli   a0, a0, 15
    rori    a0, a0, 28
    shfli   a0, a0, 2
    shfli   a0, a0, 14
    rori    a0, a0, 26
    shfli   a0, a0, 8
    rori    a0, a0, 10
    unshfli a0, a0, 12
    rori    a0, a0, 18
    unshfli a0, a0, 14
    rori    a0, a0, 28
    shfli   a0, a0, 6
    rori    a0, a0, 28
    unshfli a0, a0, 15
    slli    a0, a0, 21
    srai    a0, a0, 20
    ret