Five EmbedDev logo Five EmbedDev

An Embedded RISC-V Blog
RISC-V Bitmanip Extension

3 Reference Implementations

3.1 Verilog reference implementations

We have implemented Verilog cores for all instructions proposed in this specification. These cores are permissively licensed under the ISC license and can be obtained from https://github.com/riscv/riscv-bitmanip/tree/master/verilog.

Area of 32-bit Rocket MulDiv core (center) compared to a complete implementation of all 32-bit instructions proposed in this specification (right), and the 32-bit “Zbb” extension (left).

For evaluation purposes we synthesized these cores for RV32 and RV64 to the following mockup ASIC cell library:

image

image

For comparison we also synthesized the rocket-chip MulDiv cores obtained using the following rocket-chip configurations:

  class MulDivConfig64 extends Config(
      new WithFastMulDiv ++
      new DefaultConfig
  )

  class MulDivConfig32 extends Config(
      new WithRV32 ++
      new WithFastMulDiv ++
      new DefaultConfig
  )

The following table lists the verilog reference cores and the instructions they implement:

image

On RV64 these cores also implement all W instruction variants of the above instructions.

Note that rvb_shifter also implements the base ISA sll, srl, and sra instructions. Thus it can replace an existing implementation of the base ISA shift instructions.

Fig. 1.1 shows the area comparison for RV32 and fig. 1.2 shows the comparison for RV64. The area of the red frame surrounding the blue rvb_ modules accurately represents the added area by the rvb_full wrapper module.

Area of 64-bit Rocket MulDiv core (left) compared to a complete implementation of all 64-bit instructions proposed in this specification (right).

Regarding timing we evaluate the longest paths for rvb_full and rocket-chip MulDiv, measured in gate delays:

image

All rvb_ reference cores provide single-cycle implementations of their functions, with the exception of rvb_clmul which requires 4 cycles for a 32-bit carry-less multiply and 8 cycles for a 64-bit carry-less multiply, and rvb_crc which requires 1 cycle for each payload byte.

3.2 Fast C reference implementations

GCC has intrinsics for the bit counting instructions clz, ctz, and pcnt. So a performance-sensitive application (such as an emulator) should probably just use those:

[bextcref-fast-bitcnt]

For processors with BMI2 support GCC has intrinsics for bit extract and bit deposit instructions (compile with -mbmi2 and include <x86intrin.h>):

[bextcref-fast-bext-bmi2]

For other processors we need to provide our own implementations. The following implementation is a good compromise between code complexity and runtime:

[bextcref-fast-bext]

For the other Bitmanip instructions the C reference functions given in Chapter [bext] are already reasonably efficient.

3.3 Bit permutation instructions as bit-index operations

For programs that synthesize sequences of bit permutation instructions it can be useful to describe bit permutation instructions in terms of bit-index operations.

Expressed as bit-index operation, ror is just subtraction and grev is just XOR:

  uint_xlen_t ror(uint_xlen_t a, uint_xlen_t b)
  {
    uint_xlen_t ret = 0;
    for (int i = 0; i < XLEN; i++) {
      int j = (i - b) & (XLEN-1);
      ret |= ((a >> i) & 1) << j;
    }
    return ret;
  }
  uint_xlen_t grev(uint_xlen_t a, uint_xlen_t b)
  {
    uint_xlen_t ret = 0;
    for (int i = 0; i < XLEN; i++) {
      int j = (i ^ b) & (XLEN-1);
      ret |= ((a >> i) & 1) << j;
    }
    return ret;
  }

The following unperm() function calculates the new position of bit i after unshfl by k.

  int unperm(int k, int i)
  {
    return ((k + (i & k & ~(k<<1))) & ~k) |
           (i & ~(k | (k<<1))) | ((i>>1) & k);
  }

This allows us to write shfl and unshfl in the following way.

  uint_xlen_t shfl(uint_xlen_t a, uint_xlen_t b)
  {
    uint_xlen_t ret = 0;
    for (int i = 0; i < XLEN; i++) {
      int j = unperm(b & (XLEN/2-1), i);
      ret |= ((a >> j) & 1) << i;
    }
    return ret;
  }
  uint_xlen_t unshfl(uint_xlen_t a, uint_xlen_t b)
  {
    uint_xlen_t ret = 0;
    for (int i = 0; i < XLEN; i++) {
      int j = unperm(b & (XLEN/2-1), i);
      ret |= ((a >> i) & 1) << j;
    }
    return ret;
  }