2 RISCV Bitmanip Extension
In the proposals provided in this chapter, the C code examples are for illustration purposes only. They are not optimal implementations, but are intended to specify the desired functionality.
The final standard will likely define a range of Zextensions for different bit manipulation instructions, with the “B” extension itself being a mix of instructions from those Zextensions. It is unclear as of yet what this will look like exactly, but it will probably look something like this:
The main open questions are:
Which of the “Zb*” extensions should be included in “B”?
There is some redundancy/overlap with
add[i]wu
,subwu
,addu.w
,subu.w
, andslliu.w
. Which instructions should stay?Which “Zbp” pseudoops should be included in “Zbb”?
These decisions will be informed in big part by evaluations of the cost and added value for the individual instructions.
For the purpose of toolchain development “B” is currently everything.
For extensions that only implement certain pseudoinstructions (such as “Zbb”
implements rev8
and rev
, which are pseudoinstructions for grevi rd, rs1, 8
and grevi rd, rs1, 1
respectively, the same binary encoding is used for those
instructions as are used on a core with full support for the grev[i]
instruction.
Like in the base ISA, we add W
instruction variants on RV64 with the
semantic of the matching RV32 instruction. Those instructions ignore the upper
32 bit of their input and signextend their 32 bit output values. The W
instruction is omitted when the 64bit instruction produces the same result as
the W
instruction would when the 64bit instruction is fed signextended
32 bit values.
2.1 Basic bit manipulation instructions
2.1.1 Count Leading/Trailing Zeros (clz, ctz
)
RV32, RV64: clz rd, rs ctz rd, rs
RV64 only: clzw rd, rs ctzw rd, rs
The clz
operation counts the number of 0 bits at the MSB end of the
argument. That is, the number of 0 bits before the first 1 bit counting from
the most significant bit. If the input is 0, the output is XLEN. If the input
is 1, the output is 0.
The ctz
operation counts the number of 0 bits at the LSB end of the
argument. If the input is 0, the output is XLEN. If the input is 1, the
output is 0.
The expression XLEN1clz(x)
evaluates to the index of the most significant
set bit, also known as integer base2 logarithm, or 1 if x
is zero.
These instructions are commonly used for scanning bitmaps for set bits, for
example in malloc()
, in binary GCD, or in priority queues such as the
sched_find_first_bit()
function used in the Linux kernel realtime
scheduler.
Another common applications include normalization in fixedpoint code and soft float libraries, null suppression in data compression.
2.1.2 Count Bits Set (pcnt
)
RV32, RV64: pcnt rd, rs
RV64 only: pcntw rd, rs
This instruction counts the number of 1 bits in a register. This operations is known as population count, popcount, sideways sum, bit summation, or Hamming weight. [HammingWeight Warren12]
2.1.3 Logicwithnegate (andn
, orn
, xnor
)
RV32, RV64: andn rd, rs1, rs2 orn rd, rs1, rs2 xnor rd, rs1, rs2
This instructions implement AND, OR, and XOR with the 2nd arument inverted.
This can use the existing inverter on rs2 in the ALU that’s already there to implement subtract.
Among other things, those instructions allow implementing the “trailing bit
manipulation” code patterns in two instructions each. For example, (x  1) & ~x
produces a mask from trailing zero bits in x
.
2.1.4 Pack two words in one register (pack, packu, packh
)
RV32, RV64: pack rd, rs1, rs2 packu rd, rs1, rs2 packh rd, rs1, rs2
RV64 only: packw rd, rs1, rs2 packuw rd, rs1, rs2
The pack
instruction packs the XLEN/2bit lower halves of rs1 and rs2 into
rd, with rs1 in the lower half and rs2 in the upper half.
The packu
instruction packs the upper halves of rs1 and rs2 into rd.
And the packh
instruction packs the LSB bytes of rs1 and rs2 into the 16 LSB bits
of rd, zero extending the rest of rd.
Applications include XLEN/2bit funnel shifts, zeroextend XLEN/2 bit values, duplicate the lower XLEN/2 bits (e.g. for mask creation), loading unsigned 32 constants on RV64, and packing C structs that fit in a register and are therefore passed in a register according to the RISCV calling convention.
; Constructing a 32bit int from four bytes (RV32)
packh a0, a0, a1
packh a1, a2, a3
pack a0, a0, a1
; Load 0xffff0000ffff0000 on RV64
lui rd, 0xffff0
pack rd, rd, rd
; Same as FSLW on RV64
pack rd, rs1, rs3
rol rd, rd, rs2
addiw rd, rd, 0
; Clear the upper half of rd
pack rd, rd, zero
Paired with shfli/unshfli
and the other bit permutation instructions,
pack can interleave arbitrary poweroftwo chunks of rs1
and rs2
. For
example, interleaving the bytes in the lower halves of rs1
and rs2
:
pack rd, rs1, rs2
zip8 rd, rd
pack
is most commonly used to zeroextend words <XLEN.
For this purpose we define the following assembler pseudoops:
RV32:
zext.b rd, rs > andi rd, rs, 255
zext.h rd, rs > pack rd, rs, zero
RV64:
zext.b rd, rs > andi rd, rs, 255
zext.h rd, rs > packw rd, rs, zero
zext.w rd, rs > pack rd, rs, zero
RV128:.
zext.b rd, rs > andi rd, rs, 255
zext.h rd, rs > packw rd, rs, zero
zext.w rd, rs > packd rd, rs, zero
zext.d rd, rs > pack rd, rs, zero
2.1.5 Min/max instructions (min, max, minu, maxu
)
RV32, RV64: min rd, rs1, rs2 max rd, rs1, rs2 minu rd, rs1, rs2 maxu rd, rs1, rs2
We define 4 Rtype instructions min, max, minu, maxu
with the
following semantics:
Code that performs saturated arithmetic on a word size < XLEN
needs to perform
min/max operations frequently. A simple way of performing those operations without branching
can benefit those programs.
Some applications spend a lot of time on calculating the absolute values of
signed integers. One example of that would be SAT solvers, due to the way CNF
literals are commonly encoded [BiereComm]. With max
(or
minu
) this is a twoinstruction operation:
neg a1, a0
max a0, a0, a1
2.1.6 Signextend instructions (sext.b, sext.h
)
RV32, RV64: sext.b rd, rs sext.h rd, rs
For signextending a byte or halfword we define two unary instructions:
Additionally, we define pseudoinstructions for zero extending bytes or halfwords and for zero and signextending words:
RV32:
zext.b rd, rs > andi rd, rs, 255
zext.h rd, rs > pack rd, rs, zero
RV64:
zext.b rd, rs > andi rd, rs, 255
zext.h rd, rs > packw rd, rs, zero
zext.w rd, rs > pack rd, rs, zero
sext.w rd, rs > addiw rd, rs, 0
Sign extending 8bit and 16bit values is needed when calling a function that accepts an 8bit or 16bit signed argument, because the RISCV calling conventions dictates that such an argument must be passed in signextended form.
2.1.7 Singlebit instructions (sbset, sbclr, sbinv, sbext
)
RV32, RV64: sbset rd, rs1, rs2 sbclr rd, rs1, rs2 sbinv rd, rs1, rs2 sbext rd, rs1, rs2 sbseti rd, rs1, imm sbclri rd, rs1, imm sbinvi rd, rs1, imm sbexti rd, rs1, imm
RV64: sbsetw rd, rs1, rs2 sbclrw rd, rs1, rs2 sbinvw rd, rs1, rs2 sbextw rd, rs1, rs2 sbsetiw rd, rs1, imm sbclriw rd, rs1, imm sbinviw rd, rs1, imm
We define 4 singlebit instructions sbset
(set), sbclr
(clear),
sbinv
(invert), and sbext
(extract), and their immediatevariants,
with the following semantics:
2.1.8 Shift Ones (Left/Right) (slo, sloi, sro, sroi
)
RV32, RV64: slo rd, rs1, rs2 sro rd, rs1, rs2 sloi rd, rs1, imm sroi rd, rs1, imm
RV64 only: slow rd, rs1, rs2 srow rd, rs1, rs2 sloiw rd, rs1, imm sroiw rd, rs1, imm
These instructions are similar to shiftlogical operations from the base spec, except instead of shifting in zeros, they shift in ones.
ISAs with flag registers often have a "Shift in Carry" or "Rotate through Carry" instruction. Arguably a "Shift Ones" is an equivalent on an ISA like RISCV that avoids such flag registers.
The main application for the Shift Ones instruction is mask generation.
When implementing this circuit, the only change in the ALU over a standard logical shift is that the value shifted in is not zero, but is a 1bit register value that has been forwarded from the high bit of the instruction decode. This creates the desired behavior on both logical zeroshifts and logical onesshifts.
2.2 Bit permutation instructions
The following sections describe 3 types of bit permutation instructions: Rotate shift, generalized reverse, and generalized shuffle.
A bit permutation essentially is applying an invertible function to the bit addresses. (Bit addresses are 5 bit values on RV32 and 6 bit values on RV64.)
Rotate shift by k is simply addition (rol
) or subtraction (ror
) modulo XLEN.
i′_{rot} := i ± kmod XLEN
Generalized reverse with control argument k is simply XORing the bit addresses with k:
i′_{grev} := i ⊕ k
And generalized shuffle is performing a bit permutation on the bits of the bit addresses:
i′_{shfl} := perm_{k}(i)
With the caveat that a single shfl
/unshfl
instruction can only
perferm a certain subset of bit address permutations, but a sequence of 4 shfl
/unshfl
instructions can perform any of the 120 such permutations on
RV32, and a sequence of 5 shfl
/unshfl
instructions can perform any
of the 720 such permutations on RV64.
Combining those three types of operations makes a vast number of bit permutations accessible within only a few instructions [Wolf19A] (see Table [numperms]).
N  ROTonly  GREVonly  SHFLonly  ROT+GREV  ROT+GREV+SHFL 

0  1  1  1  1  1 
1  32  32  24  62  85 
2  —  —  86  864  3 030 
3  —  —  119  4 640  78 659 
4  —  —  120  23 312  2 002 167 
5  —  —  —  92 192  50 106 844 
6  —  —  —  294 992  1 234 579 963 
7  —  —  —  703 744  est. 30 000 000 000 
8  —  —  —  1 012 856  est. 700 000 000 000 
9  —  —  —  1 046 224  est. 15 000 000 000 000 
10  —  —  —  1 048 576  ⋯ ⋯ ⋯ ⋯ ⋯ 
11  —  —  —  —  ⋯ ⋯ ⋯ ⋯ ⋯ 
Sequences of ror
, grev
, and [un]shfl
instructions can
generate any arbitrary bit permutation. Often in surprising ways. For example,
the following sequence swaps the two LSB bits of a0
:
Instruction  State (XLEN=8)  BitIndex Op 

initial value  7 6 5 4 3 2 1 0  — 
rori a0, a0, 2 
1 0 7 6 5 4 3 2  i′ := i − 2 
unshfli a0, a0, 1 
1 7 5 3 0 6 4 2  i′ := ror(i) 
roli a0, a0, 1 
7 5 3 0 6 4 2 1  i′ := i + 1 
shfli a0, a0, 1 
7 6 5 4 3 2 0 1  i′ := rol(i) 
rori a0, a0, 2
unshfli a0, a0, 1
roli a0, a0, 1
shfli a0, a0, 1
The mechanics of this sequence is closely related to the fact that rol(ror(x2)+1)
is a function that maps 1 to 0 and 0 to 1 and every other
number to itself. (With rol
and ror
denoting 1bit rotate left and
right shifts respectively.) See Table [permdemo] for details.
The numbers in the right column of Table [numperms] might appear large,
but they are tiny in comparison to the total number of 32bit permutations
(2.63 ⋅ 10^{35}). Fortunately the [un]shfl
instruction “explores”
permutations that involve fields that have a size that’s a poweroftwo,
and/or moves that are a poweroftwo first, which means we get shorter sequences
for permutations we more often care about in realworld applications.
For example, there are 24 ways of arranging the four bytes in a 32bit word.
ror
, grev
, and [un]shfl
can perform any of those permutations
in at most 3 instructions. See Table [permbytes] for a list of those 24
sequences.
There are 40320 ways of arranging the eight bytes in a 64bit word or the eight
nibbles in a 32bit word. ror
, grev
, and [un]shfl
can perform
any of those permutations in at most 9 instructions. [Wolf19A]
Besides the moreorless arbitrary permutations we get from combining ror
, grev
, and [un]shfl
in long sequences, there are of course
many cases where just one of these instructions does exactly what we need.
For grev
and [un]shfl
we define rev*
and [un]zip*
pseudo instructions for the most common usecases.
2.2.1 Rotate (Left/Right) (rol, ror, rori
)
RV32, RV64: ror rd, rs1, rs2 rol rd, rs1, rs2 rori rd, rs1, imm
RV64 only: rorw rd, rs1, rs2 rolw rd, rs1, rs2 roriw rd, rs1, imm
These instructions are similar to shiftlogical operations from the base spec, except they shift in the values from the opposite side of the register, in order. This is also called ‘circular shift’.
2.2.2 Generalized Reverse (grev
, grevi
, rev
)
RV32, RV64: grev rd, rs1, rs2 grevi rd, rs1, imm
RV64 only: grevw rd, rs1, rs2 greviw rd, rs1, imm
This instruction provides a single hardware instruction that can implement all of byteorder swap, bitwise reversal, shortorderswap, wordorderswap (RV64), nibbleorder swap, bitwise reversal in a byte, etc, all from a single hardware instruction.
The Generalized Reverse (GREV) operation iteratively checks each bit i in the 2nd argument from i = 0 to log_{2}(XLEN) − 1, and if the corresponding bit is set, swaps each adjacent pair of 2^{i} bits.
The above pattern should be intuitive to understand in order to extend this definition in an obvious manner for RV128.
The grev
operation can easily be implemented using a permutation
network with log_{2}(XLEN) stages. Figure 1.1
shows the permutation network for ror
for reference.
Figure 1.2 shows the permutation network for grev
.
Pseudoinstructions are provided for the most common GREVI usecases. Their names consist of a prefix and and optional suffix. Each prefix and suffix corresponds to a bit mask (see Table [grevinames]). The GREVI control word is obtained by ANDing the two masks together.
Prefix  Mask  Suffix  Mask  

rev 
111111  —  111111  
rev2 
111110  .w 
011111  (w = word) 

rev4 
111100  .h 
001111  (h = half word) 

rev8 
111000  .b 
000111  (b = byte) 

rev16 
110000  .n 
000011  (n = nibble) 

rev32 
100000  .p 
000001  (p = pair) 
In other words, the prefix controls the number of zero bits at the LSB end of the control word, and the suffix controls the number of zeros at the MSB end of the control word.
rev8
reverses the order of bytes in a word, thus performs endianness conversion. This
is equivalent to the ARM REV
instructions or BSWAP
on x86. ARM also has instructions
for swapping the bytes in 16bit and 32bit words, and reversing the bit order (see table [grevicmp]).
RISCV  ARM  X86 

rev 
RBIT 
— 
rev8.h 
REV16 
— 
rev8.w 
REV32 
— 
rev8 
REV 
BSWAP 
2.2.3 Generalized Shuffle (shfl
, unshfl
, shfli
, unshfli
, zip
, unzip
)
RV32, RV64: shfl rd, rs1, rs2 unshfl rd, rs1, rs2 shfli rd, rs1, imm unshfli rd, rs1, imm
RV64 only: shflw rd, rs1, rs2 unshflw rd, rs1, rs2
Shuffle is the third bit permutation instruction in the RISCV Bitmanip extension, after rotary shift and generalized reverse. It implements a generalization of the operation commonly known as perfect outer shuffle and its inverse (shuffle/unshuffle), also known as zip/unzip or interlace/uninterlace.
Bit permutations can be understood as reversible functions on bit indices (i.e. 5 bit functions on RV32 and 6 bit functions on RV64).
A generalized (un)shuffle operation has log_{2}(XLEN) − 1 control bits,
one for each pair of neighbouring bits in a bit index. When the bit is set,
generalized shuffle will swap the two index bits. The shfl
operation
performs this swaps in MSBtoLSB order (performing a rotate left shift on
contiguous regions of set control bits), and the unshfl
operation performs
the swaps in LSBtoMSB order (performing a rotate right shift on contiguous
regions of set control bits). Combining up to log_{2}(XLEN) of those
shfl
/unshfl
operations can implement any bitpermutation on the
bit indices.
The most common type of shuffle/unshuffle operation is one on an immediate
control value that only contains one contiguous region of set bits. We call
those operations zip/unzip and provide pseudoinstructions for them. The naming
scheme for those pseudoinstructions is similar to the naming scheme for the
grevi
pseudoinstructions (see Tables 1.3 and
[grevinames]), except that the LSB bit of the masks in Table [grevinames]
is not used for zip/unzip.
Shuffle/unshuffle operations that only have individual bits set (not a contiguous region of two or more bits) are their own inverse.
Like GREV and rotate shift, the (un)shuffle instruction can be implemented using a short sequence of elementary permutations, that are enabled or disabled by the shamt bits. But (un)shuffle has one stage fewer than GREV. Thus shfli+unshfli together require the same amount of encoding space as grevi.
Or for RV64:
The above pattern should be intuitive to understand in order to extend this definition in an obvious manner for RV128.
Alternatively (un)shuffle can be implemented in a single network with one more stage than GREV, with the additional first and last stage executing a permutation that effectively reverses the order of the inner stages. However, since the inner stages only mux half of the bits in the word each, a hardware implementation using this additional “flip” stages might actually be more expensive than simply creating two networks.
Figure 1.7 shows the (un)shuffle permutation network with “flip” stages and Figure 1.6 shows the (un)shuffle permutation network without “flip” stages.
The zip
instruction with the upper half of its input cleared performs
the commonly needed “fanout” operation. (Equivalent to bdep
with a
0x55555555 mask.) The zip
instruction applied twice fans out the bits
in the lower quarter of the input word by a spacing of 4 bits.
For example, the following code calculates the bitwise prefix sum of the bits in the lower byte of a 32 bit word on RV32:
andi a0, a0, 0xff
zip a0, a0
zip a0, a0
slli a1, a0, 4
c.add a0, a1
slli a1, a0, 8
c.add a0, a1
slli a1, a0, 16
c.add a0, a1
The final prefix sum is stored in the 8 nibbles of the a0
output word.
Similarly, the following code stores the indices of the set bits in the LSB nibbles of the output word (with the LSB bit having index 1), with the unused MSB nibbles in the output set to zero:
andi a0, a0, 0xff
zip a0, a0
zip a0, a0
orc.n a0, a0
li a1, 0x87654321
and a1, a0, a1
bext a0, a1, a0
Other zip
modes can be used to “fanout” in blocks of 2, 4, 8, or 16 bit.
zip
can be combined with grevi
to perform inner shuffles. For example
on RV64:
li a0, 0x0000000012345678
zip4 t0, a0 ; < 0x0102030405060708
rev4.b t1, t0 ; < 0x1020304050607080
zip8 t2, a0 ; < 0x0012003400560078
rev8.h t3, t2 ; < 0x1200340056007800
zip16 t4, a0 ; < 0x0000123400005678
rev16.w t5, t4 ; < 0x1234000056780000
Another application for the zip instruction is generating Morton code [MortonCode].
The x86 PUNPCK[LH]*
MMX/SSE/AVX instructions perform similar operations
as zip8
and zip16
.
2.2.4 Crossbar Permutation Instructions (xperm.[nbhw]
)
RV32, RV64: xperm.n rd, rs1, rs2 xperm.b rd, rs1, rs2 xperm.h rd, rs1, rs2
RV64 only: xperm.w rd, rs1, rs2
These instructions operate on nibbles/bytes/halfwords/words. rs1
is
a vector of data words and rs2
is a vector of indices into rs1
.
The result of the instruction is the vector rs2
with each element replaced
by the corresponding data word from rs1
, or zero then the index in rs2
is out of bounds.
The xperm.[nbhw]
instructions can be implemented with an XLEN/4lane
nibblewide crossbar switch.
The xperm.n
instruction can be used to implement an arbitrary 64bit
bit permutation in 15 instructions, using 8 control words and 3 constant masks [Wolf20A]:
uint64_t perm64_bitmanip_cmix(perm64t &p, uint64_t v)
{
uint64_t v0 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[0]) & p.mask[0], 3); // 3 insns
uint64_t v1 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[1]) & p.mask[1], 3); // 6 insns
uint64_t v2 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[2]) & p.mask[2], 3); // 9 insns
uint64_t v3 = _rv64_gorc(_rv64_xperm_n(v, p.ctrl[3]) & p.mask[3], 3); // 12 insns
v0 = _rv_cmix(0x1111111111111111LL, v0, v1); // 13 insns
v2 = _rv_cmix(0x4444444444444444LL, v2, v3); // 14 insns
return _rv_cmix(0x3333333333333333LL, v0, v2); // 15 insns
}
2.3 Generalized ORCombine (gorc, gorci
)
RV32, RV64: gorc rd, rs1, rs2 gorci rd, rs1, imm
RV64 only: gorcw rd, rs1, rs2 gorciw rd, rs1, imm
The GORC operation is similar to GREV, except that instead of swapping pairs of bits, GORC ORs them together, and writes the new value in both positions.
GORC can be usefull for copying naturally aligned fields in a word, and testing such fields for being equal zero.
gorci
pseudoinstructions follow the same naming scheme as grevi
pseudoinstructions (see Tables 1.3 and [grevinames]),
except the prefix orc
is used instead of rev
. See
Table 1.8 for a full list of gorci
pseudoinstructions.
An important usecase is strlen()
and strcpy()
, which can utilize
orc.b
for testing for zero bytes, and counting trailing nonzero bytes
in a word.
2.4 BitField Place (bfp
)
RV32, RV64: bfp rd, rs1, rs2
RV64 only: bfpw rd, rs1, rs2
The bit field place (bfp
) instruction places up to XLEN/2 LSB bits from rs2
into
the value in rs1
. The upper bits of rs2
control the length of the bit field
and target position. The layout of rs2
is chosen in a way that makes it possible to construct
rs2
easily using pack[h]
instructions and/or andi
/lui
.
The layout of the control word in rs2 is as follows for RV32. LEN=0
encodes for LEN=16
.
 3 2 1 
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0

  LEN   OFF  DATA 

And on RV64 (with LEN=0
encoding for LEN=32
):
 6 5 4 3 
3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 .... 2 1 0
  
SEL  LEN   OFF   LEN'   OFF'  DATA 
  
When SEL=10
then LEN
and OFF
are used, otherwise LEN’
and OFF’
are used.
Placing bits from a0
in a1
, with results in t0
on RV32:
addi t0, zero, {length[3:0], offset[7:0]}
pack t0, a0, t0
bfp t0, a1, t0
And on RV64:
lui t0, zero, {3'b 100, length[4:0], offset[7:0], 4'b 0000}
pack t0, a0, t0
bfp t0, a1, t0
Or with a2=length
and a3=offset
:
packh t0, a3, a2
pack t0, a0, t0
bfp t0, a1, t0
Placing up to 16 constant bits in any contiguous region:
lui t0, ...
addi t0, t0, ...
bfp[w] t0, a1, t0
Note that above sequences only modify one register (t0
), which makes them
fuseable sequences.
2.5 Bit Extract/Deposit (bext, bdep
)
RV32, RV64: bext rd, rs1, rs2 bdep rd, rs1, rs2
RV64 only: bextw rd, rs1, rs2 bdepw rd, rs1, rs2
This instructions implement the generic bit extract and bit deposit functions. This operation is also referred to as bit gather/scatter, bit pack/unpack, parallel extract/deposit, compress/expand, or right_compress/right_expand.
bext
collects LSB justified bits to rd from rs1 using extract mask in rs2.
bdep
writes LSB justified bits from rs1 to rd using deposit mask in rs2.
Implementations may choose to use smaller multicycle implementations of
bext
and bdep
, or even emulate the instructions in software.
Even though multicycle bext
and bdep
often are not fast
enough to outperform algorithms that use sequences of shifts and bit masks,
dedicated instructions for those operations can still be of great advantage in
cases where the mask argument is not constant.
For example, the following code efficiently calculates the index of the tenth
set bit in a0
using bdep
:
li a1, 0x00000200
bdep a0, a1, a0
ctz a0, a0
For cases with a constant mask an optimizing compiler would decide when to use
bext
or bdep
based on the optimization profile for the
concrete processor it is optimizing for. This is similar to the decision
whether to use MUL or DIV with a constant, or to perform the same operation
using a longer sequence of much simpler operations.
The bext
and bdep
instructions are equivalent to the x86 BMI2
instructions PEXT and PDEP. But there is much older prior art. For example, the
soviet BESM6 mainframe computer, designed and built in the 1960s, had APX/AUX
instructions with almost the same semantics. [BESM6] (The BESM6 APX/AUX
instructions packed/unpacked at the MSB end instead of the LSB end. Otherwise
it is the same instruction.)
Efficient hardware implementations of bext
and bdep
are described
in [Hilewitz06] and demonstrated in [Wolf17B].
2.6 CarryLess Multiply (clmul, clmulh, clmulr
)
RV32, RV64: clmul rd, rs1, rs2 clmulh rd, rs1, rs2 clmulr rd, rs1, rs2
RV64 only: clmulw rd, rs1, rs2 clmulhw rd, rs1, rs2 clmulrw rd, rs1, rs2
Calculate the carryless product [CarryLessProduct] of the two arguments. clmul
produces the lower half of the carryless product and clmulh
produces the upper half
of the 2⋅XLEN carryless product.
Carryless multiplication is equivalent to multiplication in the polynomial ring over GF(2).
clmulr
produces bits 2⋅XLEN − 2:XLEN1 of the 2⋅XLEN carryless product.
That means clmulh
is equivalent to clmulr
followed by a 1bit right shift.
(The MSB of a clmulh
result is always zero.) Another equivalent definition of
clmulr
is clmulr(a,b) := rev(clmul(rev(a), rev(b)))
. (The “r”
in clmulr
means reversed.)
Unlike mulh[[s]u]
, we add a *W variant of clmulh
. This is because we expect
some code to use 32bit clmul intrisics, even on 64bit architectures. For example in cases
where data is processed in 32bit chunks.
The classic applications for clmul
are Cyclic Redundancy Check (CRC) [FastCRC Wolf18A]
and Galois/Counter Mode (GCM), but more applications exist, including the following examples.
There are obvious applications in hashing and pseudo random number generations. For example, it has been reported that hashes based on carryless multiplications can outperform Google’s CityHash [CLHASH].
clmul
of a number with itself inserts zeroes between each input bit. This can
be useful for generating Morton code [MortonCode].
clmul
of a number with 1 calculates the prefix XOR operation. This can
be useful for decoding gray codes.
Another application of XOR prefix sums calculated with clmul
is
branchless tracking of quoted strings in highperformance parsers. [ParseJSON]
Carryless multiply can also be used to implement Erasure code efficiently. [ClmulErasureCode]
SPARC introduced similar instructions (XMULX, XMULXHI) in SPARC T3 in 2010. [sparct3]
TI C6000 introduced a similar instruction (XORMPY) in C64x+. [c64xp]
2.7 CRC Instructions (crc32.[bhwd], crc32c.[bhwd]
)
RV32, RV64: crc32.b rd, rs crc32.h rd, rs crc32.w rd, rs crc32c.b rd, rs crc32c.h rd, rs crc32c.w rd, rs
RV64 only: crc32.d rd, rs crc32c.d rd, rs
Unary Cyclic Redundancy Check (CRC) instructions that interpret the bits of rs1 as a CRC32/CRC32C state and perform a polynomial reduction of that state shifted left by 8, 16, 32, or 64 bits.
The instructions return the new CRC32/CRC32C state.
The crc32.w
/crc32c.w
instructions are equivalent to executing
crc32.h
/crc32c.h
twice, and crc32.h
/crc32c.h
instructions are equivalent to executing crc32.b
/crc32c.b
twice.
All 8 CRC instructions operate on bitreflected data.
Payload data must be XOR’ed into the LSB end of the state before executing the
CRC instruction. The following code demonstrates the use of crc32.b
:
uint32_t crc32_demo(const uint8_t *p, int len)
{
uint32_t x = 0xffffffff;
for (int i = 0; i < len; i++) {
x = x ^ p[i];
x = crc32_b(x);
}
return ~x;
}
In terms of binary polynomial arithmetic those instructions perform the operation
$$\texttt{rd}'(x) = (\texttt{rs1}'(x) \cdot x^N) \; \textrm{mod} \; \{\texttt{1}, P'\}(x)\textrm,$$
with N ∈ {8, 16, 32, 64},
P = 0xEDB8_8320
for CRC32 and P = 0x82F6_3B78
for CRC32C,
a′ denoting the XLEN bit reversal of a,
and {a, b} denoting bit concatenation.
Note that for example for CRC32 {1
, P′} = 0x1_04C1_1DB7
on RV32 and {1
, P′} = 0x1_04C1_1DB7_0000_0000
on RV64.
These dedicated CRC instructions are meant for RISCV implementations without fast multiplier
and therefore without fast clmul[h]
. For implementations with fast clmul[h]
it is recommended to use the methods described in [FastCRC] and demonstrated in [Wolf18A]
that can process XLEN input bits using just one carryless multiply for arbitrary CRC polynomials.
In applications where those methods are not applicable it is possible to emulate the dedicated CRC
instructions using two carryless multiplies that implement a Barrett reduction. The following example
implements a replacement for crc32.w
(RV32).
crc32_w:
li t0, 0xF7011641
li t1, 0xEDB88320
clmul a0, a0, t0
clmulr a0, a0, t1
ret
2.8 BitMatrix Instructions (bmatxor, bmator, bmatflip
, RV64 only)
RV64 only: bmator rd, rs1, rs2 bmatxor rd, rs1, rs2 bmatflip rd, rs
These are 64bitonly instruction that are not available on RV32. On RV128 they ignore the upper half of operands and sign extend the results.
This instructions interpret a 64bit value as 8x8 binary matrix.
bmatxor
performs a matrixmatrix multiply with boolean AND as multiply
operator and boolean XOR as addition operator.
bmator
performs a matrixmatrix multiply with boolean AND as multiply
operator and boolean OR as addition operator.
bmatflip
is a unary operator that transposes the source matrix. It is
equivalent to zip; zip; zip
on RV64.
Among other things, bmatxor
/bmator
can be used to perform
arbitrary permutations of bits within each byte (permutation matrix as 2nd
operand) or perform arbitrary permutations of bytes within a 64bit word
(permutation matrix as 1st operand).
There are similar instructions in Cray XMT [CrayXMT]. The Cray X1
architecture even has a full 64x64 bit matrix multiply unit [CrayX1].
(See Section [bmat64] for how to implement 64x64 bit matix operations
with bmat[x]or
.)
The MMIX architecture has MOR and MXOR instructions with the same semantic. [Knuth4A]
The x86 EVEX/VEX/SSE instruction GF2P8AFFINEQB is equivalent to bmatxor
.
The bmm.8
instruction proposed in [Hilewitz08] is also equivalent to bmatxor
.
2.9 Ternary BitManipulation Instructions
2.9.1 Conditional Mix (cmix
)
RV32, RV64: cmix rd, rs2, rs1, rs3
(Note that the assembler syntax of cmix
has the rs2
argument first
to make assembler code more readable. But the reference C code code below uses
the “architecturally correct” argument order rs1, rs2, rs3
.)
The cmix rd, rs2, rs1, rs3
instruction selects bits from rs1
and rs3
based
on the bits in the control word rs2
.
It replaces sequences like the following.
and rd, rs1, rs2
andn t0, rs3, rs2
or rd, rd, t0
Using cmix
a single butterfly stage can be implemented in only two
instructions. Thus, arbitrary bitpermutations can be implemented using only
18 instruction (32 bit) or 22 instructions (64 bits).
2.9.2 Conditional Move (cmov
)
RV32, RV64: cmov rd, rs2, rs1, rs3
(Note that the assembler syntax of cmov
has the rs2
argument first
to make assembler code more readable. But the reference C code code below uses
the “architecturally correct” argument order rs1, rs2, rs3
.)
The cmov rd, rs2, rs1, rs3
instruction selects rs1
if the control
word rs2
is nonzero, and rs3
if the control word is zero.
The cmov
instruction helps avoiding branches, which can lead to better
performance, and helps with constanttime code as used in some cryptography
applications.
2.9.3 Funnel Shift (fsl
, fsr
, fsri
)
RV32, RV64: fsl rd, rs1, rs3, rs2 fsr rd, rs1, rs3, rs2 fsri rd, rs1, rs3, imm
RV64 only: fslw rd, rs1, rs3, rs2 fsrw rd, rs1, rs3, rs2 fsriw rd, rs1, rs3, imm
(Note that the assembler syntax for funnel shifts has the rs2
argument
last to make assembler code more readable. But the reference C code code below
uses the “architecturally correct” argument order rs1, rs2, rs3
.)
The fsl rd, rs1, rs3, rs2
instruction creates a 2 ⋅ XLEN word
by concatenating rs1 and rs3 (with rs1 in the MSB half), rotateleftshifts that
word by the amount indicated in the log_{2}(XLEN) + 1 LSB bits in rs2, and
then writes the MSB half of the result to rd.
The fsr rd, rs1, rs3, rs2
instruction creates a 2 ⋅ XLEN word
by concatenating rs1 and rs3 (with rs1 in the LSB half), rotaterightshifts that
word by the amount indicated in the log_{2}(XLEN) + 1 LSB bits in rs2, and
then writes the LSB half of the result to rd.
A shift unit capable of either fsl
or fsr
is capable of performing all
the other shift functions, including the other funnel shift, with only minimal additional
logic.
For any values of A
, B
, and C
:
fsl(A, B, C) = fsr(A, B, C)
And for any values x
and 0 ≤ shamt
< XLEN
:
sll(x, shamt) == fsl(x, shamt, 0)
srl(x, shamt) == fsr(x, shamt, 0)
sra(x, shamt) == fsr(x, shamt, sext_x)
slo(x, shamt) == fsl(x, shamt, ~0)
sro(x, shamt) == fsr(x, shamt, ~0)
ror(x, shamt) == fsr(x, shamt, x)
rol(x, shamt) == fsl(x, shamt, x)
Furthermore an RV64 implementation of either fsl
or fsr
is capable
of performing the *W versions of all shift operations with only a few gates
of additional control logic.
On RV128 there is no fsri
instruction. But there is fsriw
and fsrid
.
2.10 Address calculation instructions
RV32, RV64: sh1add rd, rs1, rs2 sh2add rd, rs1, rs2 sh3add rd, rs1, rs2
RV64 only: sh1addu.w rd, rs1, rs2 sh2addu.w rd, rs1, rs2 sh3addu.w rd, rs1, rs2
These instructions shift rs1
left by 1, 2, or 3 bits, then add the result
to rs2
. The sh?addu.w
instructions are identical to sh?add
, except
that bits XLEN1:32 of the rs1
argument are cleared before the shift.
An opcode for sh4add
/sh4addu.w
for RV128 and/or RVQ is reserved.
2.11 Mixed uint32/int64 arithmetic instructions
Consider RV64 C code that’s using unsigned 32bit ints as array indices. For example:
char u32i64demo(char *p, unsigned int i) {
return p[i1];
}
In such cases the expression within p[…]
must overflow according to
32bit arithmetic, then be zeroextended, and then this zeroextended result
must be used in the address calculation.
The instructions below make sure that no explicit zext.w
instruction
is needed in such cases, to make sure there is no systematic performance
penalty for code like shown above on RV64 compared to RV32.
2.11.1 Add/sub with postfix zeroextend (addwu
, subwu
, addiwu
)
RV64: addwu rd, rs1, rs2 subwu rd, rs1, rs2 addiwu rd, rs1, imm
These instructions are identical to addw
, subw
, addiw
,
except that bits XLEN1:32 of the result are cleared after the addition. I.e.
these instructions zeroextend instead of signextend the 32bit result.
The 12bit immediate to addiwu
is signextended, exactly like the immediate
to addiw
.
2.11.2 Add/sub/shift with prefix zeroextend (addu.w
, subu.w
, slliu.w
)
RV64: addu.w rd, rs1, rs2 subu.w rd, rs1, rs2 slliu.w rd, rs1, imm
slliu.w
is identical to slli
, except that bits XLEN1:32 of the
rs1
argument are cleared before the shift.
addu.w
and subu.w
are identical to add
and sub
, except
that bits XLEN1:32 of the rs2
argument are cleared before the add/subtract.
[bextcrefslliuw] [bextcrefadduw]
2.12 Opcode Encodings
This chapter contains proposed encodings for most of the instructions described in this document. DO NOT IMPLEMENT THESE OPCODES YET. We are trying to get official opcodes assigned and will update this chapter soon with the official opcodes.
The andn
, orn
, and xnor
instruction are encoded the same way
as and
, or
, and xor
, but with op[30]
set, mirroring the
encoding scheme used for add
and sub
.
All shift instructions use funct3=001
for left shifts and funct3=101
for right shifts. Just like in the RISCV integer base ISA, the shiftimmediate
instructions have a 5 bit immediate on RV32, and a 6 bit immediate on RV64, and we
reserve encoding space for a 7 bit immediate for RV128. The same sizes apply
to sbseti
, sbclri
, sbinvi
, and sbexti
.
The immediate for shfli
/unshufli
is one bit smaller than the immediate
for shift instructions, that is 4 bits on RV32, 5 bits on RV64, and we reserve 6
bits for RV128.
op[26]=1
selects funnel shifts. For funnel shifts op[30:29]
is part
if the 3rd operand and therefore unused for encoding the operation. For all other
shift operations op[26]=0
.
fsri
is also encoded with op[26]=1
, leaving a 6 bit immediate. The 7th
bit, that is necessary to perform a 128 bit funnel shift on RV64, can be
emulated by swapping rs1 and rs3.
There is no shfliw
instruction. The slliu.w
instruction occupies
the encoding slot that would be occupied by shfliw
.
On RV128 op[26]
contains the MSB of the immediate for the shift instructions.
Therefore there is no FSRI instruction on RV128. (But there is FSRIW/FSRID.)
 SLL SRL SRA  SLO SRO  ROL ROR  FSL FSR
op[30]  0 0 1  0 0  1 1   
op[29]  0 0 0  1 1  1 1   
op[26]  0 0 0  0 0  0 0  1 1
funct3  001 101 101  001 101  001 101  001 101
Only an encoding for RORI exists, as ROLI can be implemented with RORI by negating the immediate. Unary functions are encoded in the spot that would correspond to ROLI, with the function encoded in the 5 LSB bits of the immediate.
The CRC instructions are encoded as unary instructions with op[24]
set. The
polynomial is selected via op[23]
, with op[23]=0
for CRC32 and
op[23]=1
for CRC32C. The width is selected with op[22:20]
, using
the same encoding as is used in funct3
for load/store operations.
cmix
and cmov
are encoded using the two remaining ternary operator
encodings in funct3=001
and funct3=101
. (There are two ternary
operator encodings per minor opcode using the op[26]=1
scheme for
marking ternary OPs.)
The singlebit instructions are also encoded within the shift opcodes, with
op[27]
set, and using op[30]
and op[29]
to select the operation:
 SBCLR SBSET SBINV  SBEXT GORC GREV
op[30]  1 0 1  1 0 1
op[29]  0 1 1  0 1 1
op[27]  1 1 1  1 1 1
funct3  001 001 001  101 101 101
There is no sbextiw
instruction as it can be emulated trivially using
sbexti
. However, there is sbsetiw
, sbclriw
, and sbinviw
as changing bit 31 would change the sign extend. There are nonimmediate *W
instructions of all singlebit instructions, including sbextw
, because
the number of used bits in rs2 is different in sbext
and sbextw
.
GORC and GREV are encoded in the two remaining slots in the singlebit instruction encoding space.
The remaining instructions are encoded within funct7=0000100
and
funct7=0000101
.
The funct7=0000101
block contains clmul[hr]
,
min[u]
, and max[u]
.
The encoding of clmul, clmulr, clmulh
is identical to the encoding of
mulh, mulhsu, mulhu
, except that op[27]=1
.
The encoding of min[u]
/max[u]
uses funct3=100..111
. The
funct3
encoding matches op[31:29]
of the AMO min/max functions.
The remaining instructions are encoded within funct7=0000100
. The
shiftlike shfl
/unshfl
instructions uses the same funct3
values as the shift operations. bdep
and bext
are encoded in a
way so that funct3[2]
selects the “direction”, similar to shift
operations.
bmat[x]or
use funct3=011
and funct3=111
in funct7=0000100
.
pack
occupies funct3=100
in funct7=0000100
.
addwu
and subwu
are encoded like addw
and subw
, except
that op[25]=1
and op[27]=1
.
addu.w
and subu.w
are encoded like addw
and subw
, except
that op[27]=1
.
addiwu
is encoded using funct3=100
(XOR) instead of funct3=000
in OP32.
Finally, RV64 has W
instructions for all bitmanip instructions, with the
following exceptions:
andn
, cmix
, cmov
, min[u]
, max[u]
have no W
variants because they already behave in the way a W
instruction would
when presented with signexteded 32bit arguments.
bmatflip
, bmatxor
, bmator
have no W
variants because
they are 64bit only instructions.
crc32.[bhwd]
, crc32c.[bhwd]
have no W
variants because crc32[c].w
is deemed sufficient.
There is no [un]shfliw
, as a perfect outer shuffle always preserves the
MSB bit, thus [un]shfli
preserves proper sign extension when the
upper bit in the control word is set. There’s still [un]shflw
that
masks that upper control bit and signextends the output.
Relevant instruction encodings from the base ISA are included in the table below
and are marked with a .
 3 2 1 
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0

 funct7  rs2  rs1  f3  rd  opcode  Rtype
 rs3  f2 rs2  rs1  f3  rd  opcode  R4type
 imm  rs1  f3  rd  opcode  Itype
===============================================================
 0000000  rs2  rs1  111  rd  0110011  AND*
 0000000  rs2  rs1  110  rd  0110011  OR*
 0000000  rs2  rs1  100  rd  0110011  XOR*
 0100000  rs2  rs1  111  rd  0110011  ANDN
 0100000  rs2  rs1  110  rd  0110011  ORN
 0100000  rs2  rs1  100  rd  0110011  XNOR

 0000000  rs2  rs1  001  rd  0110011  SLL*
 0000000  rs2  rs1  101  rd  0110011  SRL*
 0100000  rs2  rs1  101  rd  0110011  SRA*
 0010000  rs2  rs1  001  rd  0110011  SLO
 0010000  rs2  rs1  101  rd  0110011  SRO
 0110000  rs2  rs1  001  rd  0110011  ROL
 0110000  rs2  rs1  101  rd  0110011  ROR

 0010000  rs2  rs1  010  rd  0110011  SH1ADD
 0010000  rs2  rs1  100  rd  0110011  SH2ADD
 0010000  rs2  rs1  110  rd  0110011  SH3ADD

 0100100  rs2  rs1  001  rd  0110011  SBCLR
 0010100  rs2  rs1  001  rd  0110011  SBSET
 0110100  rs2  rs1  001  rd  0110011  SBINV
 0100100  rs2  rs1  101  rd  0110011  SBEXT
 0010100  rs2  rs1  101  rd  0110011  GORC
 0110100  rs2  rs1  101  rd  0110011  GREV

 00000  imm  rs1  001  rd  0010011  SLLI*
 00000  imm  rs1  101  rd  0010011  SRLI*
 01000  imm  rs1  101  rd  0010011  SRAI*
 00100  imm  rs1  001  rd  0010011  SLOI
 00100  imm  rs1  101  rd  0010011  SROI
 01100  imm  rs1  101  rd  0010011  RORI

 01001  imm  rs1  001  rd  0010011  SBCLRI
 00101  imm  rs1  001  rd  0010011  SBSETI
 01101  imm  rs1  001  rd  0010011  SBINVI
 01001  imm  rs1  101  rd  0010011  SBEXTI
 00101  imm  rs1  101  rd  0010011  GORCI
 01101  imm  rs1  101  rd  0010011  GREVI

 rs3  11 rs2  rs1  001  rd  0110011  CMIX
 rs3  11 rs2  rs1  101  rd  0110011  CMOV
 rs3  10 rs2  rs1  001  rd  0110011  FSL
 rs3  10 rs2  rs1  101  rd  0110011  FSR
 rs3 1 imm  rs1  101  rd  0010011  FSRI

 3 2 1 
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
===============================================================
 0110000  00000  rs1  001  rd  0010011  CLZ
 0110000  00001  rs1  001  rd  0010011  CTZ
 0110000  00010  rs1  001  rd  0010011  PCNT
 0110000  00011  rs1  001  rd  0010011  BMATFLIP
 0110000  00100  rs1  001  rd  0010011  SEXT.B
 0110000  00101  rs1  001  rd  0010011  SEXT.H

 0110000  10000  rs1  001  rd  0010011  CRC32.B
 0110000  10001  rs1  001  rd  0010011  CRC32.H
 0110000  10010  rs1  001  rd  0010011  CRC32.W
 0110000  10011  rs1  001  rd  0010011  CRC32.D
 0110000  11000  rs1  001  rd  0010011  CRC32C.B
 0110000  11001  rs1  001  rd  0010011  CRC32C.H
 0110000  11010  rs1  001  rd  0010011  CRC32C.W
 0110000  11011  rs1  001  rd  0010011  CRC32C.D

 0000101  rs2  rs1  001  rd  0110011  CLMUL
 0000101  rs2  rs1  010  rd  0110011  CLMULR
 0000101  rs2  rs1  011  rd  0110011  CLMULH
 0000101  rs2  rs1  100  rd  0110011  MIN
 0000101  rs2  rs1  101  rd  0110011  MAX
 0000101  rs2  rs1  110  rd  0110011  MINU
 0000101  rs2  rs1  111  rd  0110011  MAXU

 0000100  rs2  rs1  001  rd  0110011  SHFL
 0000100  rs2  rs1  101  rd  0110011  UNSHFL
 0100100  rs2  rs1  110  rd  0110011  BDEP
 0000100  rs2  rs1  110  rd  0110011  BEXT
 0000100  rs2  rs1  100  rd  0110011  PACK
 0100100  rs2  rs1  100  rd  0110011  PACKU
 0000100  rs2  rs1  011  rd  0110011  BMATOR
 0100100  rs2  rs1  011  rd  0110011  BMATXOR
 0000100  rs2  rs1  111  rd  0110011  PACKH
 0100100  rs2  rs1  111  rd  0110011  BFP

 000010  imm  rs1  001  rd  0010011  SHFLI
 000010  imm  rs1  101  rd  0010011  UNSHFLI
===============================================================
 immediate  rs1  000  rd  0011011  ADDIW*
 immediate  rs1  100  rd  0011011  ADDIWU
 00001  imm  rs1  001  rd  0011011  SLLIU.W

 0000000  rs2  rs1  000  rd  0111011  ADDW*
 0100000  rs2  rs1  000  rd  0111011  SUBW*
 0000101  rs2  rs1  000  rd  0111011  ADDWU
 0100101  rs2  rs1  000  rd  0111011  SUBWU
 0000100  rs2  rs1  000  rd  0111011  ADDU.W
 0100100  rs2  rs1  000  rd  0111011  SUBU.W

 3 2 1 
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
===============================================================
 0010000  rs2  rs1  001  rd  0111011  SLOW
 0010000  rs2  rs1  101  rd  0111011  SROW
 0110000  rs2  rs1  001  rd  0111011  ROLW
 0110000  rs2  rs1  101  rd  0111011  RORW

 0010000  rs2  rs1  010  rd  0111011  SH1ADDU.W
 0010000  rs2  rs1  100  rd  0111011  SH2ADDU.W
 0010000  rs2  rs1  110  rd  0111011  SH3ADDU.W

 0100100  rs2  rs1  001  rd  0111011  SBCLRW
 0010100  rs2  rs1  001  rd  0111011  SBSETW
 0110100  rs2  rs1  001  rd  0111011  SBINVW
 0100100  rs2  rs1  101  rd  0111011  SBEXTW
 0010100  rs2  rs1  101  rd  0111011  GORCW
 0110100  rs2  rs1  101  rd  0111011  GREVW

 0010000  imm  rs1  001  rd  0011011  SLOIW
 0010000  imm  rs1  101  rd  0011011  SROIW
 0110000  imm  rs1  101  rd  0011011  RORIW

 0100100  imm  rs1  001  rd  0011011  SBCLRIW
 0010100  imm  rs1  001  rd  0011011  SBSETIW
 0110100  imm  rs1  001  rd  0011011  SBINVIW
 0010100  imm  rs1  101  rd  0011011  GORCIW
 0110100  imm  rs1  101  rd  0011011  GREVIW

 rs3  10 rs2  rs1  001  rd  0111011  FSLW
 rs3  10 rs2  rs1  101  rd  0111011  FSRW
 rs3  10 imm  rs1  101  rd  0011011  FSRIW

 0110000  00000  rs1  001  rd  0011011  CLZW
 0110000  00001  rs1  001  rd  0011011  CTZW
 0110000  00010  rs1  001  rd  0011011  PCNTW

 0000101  rs2  rs1  001  rd  0111011  CLMULW
 0000101  rs2  rs1  010  rd  0111011  CLMULRW
 0000101  rs2  rs1  011  rd  0111011  CLMULHW

 0000100  rs2  rs1  001  rd  0111011  SHFLW
 0000100  rs2  rs1  101  rd  0111011  UNSHFLW
 0100100  rs2  rs1  110  rd  0111011  BDEPW
 0000100  rs2  rs1  110  rd  0111011  BEXTW
 0000100  rs2  rs1  100  rd  0111011  PACKW
 0100100  rs2  rs1  100  rd  0111011  PACKUW
 0100100  rs2  rs1  111  rd  0111011  BFPW

Changes in v0.93 of the RISCV Bitmanip Spec (+
for addition,

for removal):
 3 2 1 
1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0 9 8 7 6 5 4 3 2 1 0
===============================================================
 0010000  rs2  rs1  010  rd  0110011  + SH1ADD
 0010000  rs2  rs1  100  rd  0110011  + SH2ADD
 0010000  rs2  rs1  110  rd  0110011  + SH3ADD

 0010000  rs2  rs1  010  rd  0111011  + SH1ADDU.W
 0010000  rs2  rs1  100  rd  0111011  + SH2ADDU.W
 0010000  rs2  rs1  110  rd  0111011  + SH3ADDU.W

2.13 Future compressed instructions
The RISCV ISA has no dedicated instructions for bitwise inverse (not
).
Instead not
is implemented as xori rd, rs, 1
and
neg
is implemented as sub rd, x0, rs
.
In bitmanipulation code not
is a very common operation. But there is
no compressed encoding for those operation because there is no c.xori
instruction.
On RV64 (and RV128) zext.w
and zext.d
(pack
and packw
)
are commonly used to zeroextend unsigned values <XLEN.
It presumably would make sense for a future revision of the “C” extension to include compressed opcodes for those instructions.
An encoding with the constraint rd = rs
would fit nicely in the
reserved space in c.addi16sp/c.lui
.
The entire RVC encoding space is 15.585 bits wide, the remaining reserved encoding space in RVC is 11.155 bits wide, not including space that is only reserved on RV32/RV64. This means that above encoding would use 0.0065% of the RVC encoding space, or 1.4% of the remaining reserved RVC encoding space. Preliminary experiments have shown that NOT instructions alone make up approximately 1% of bitmanipulation code size. [Wolf17A]
2.14 Macroop fusion patterns
Some bitmanip operations have been left out of this spec because of lack of a (sensible) way to encode them in the 32bit RISCV encoding space. Instead we present fuseable sequences of up to three instructions for those operations, so that highend processors can implement them in a single fused macroop, should they decide to do so.
For this document we only consider sequences fuseable if they read at most two registers, only write one register, and contain no branch instructions.
2.14.1 Fused bfp
sequences
The bfp
instruction is most commonly used in sequences of one the the following forms.
For 32bit (RV32 or *W instructions on RV64):
addi rd, zero, ...
pack[w] rd, rs2, rd
bfp[w] rd, rs1, rd
lui rd, ...
addi rd, rd, ...
bfp[w] rd, rs1, rd
And for 64bit:
lui rd, ...
pack rd, rs2, rd
bfp rd, rs1, rd
2.14.2 Loadimmediate
For 32bit code (RV32 or *W instructions on RV64) we recommend to fuse the lui+addi
pattern for loading a 32bit constants:
lui rd, imm
addi[w] rd, rd, imm
For RV64 we also recommend to fuse the followings patterns for loading a 32bit constant zeroextended into a 64bit register:
lui rd, imm
addiwu rd, rd, imm
addiw rd, zero, v
pack rd, rd, zero
Further, for loading 64bit consants in two macroops:
lui rd, imm
addiw rd, rd, imm
pack rd, rd, rs
2.14.3 Fused shift sequences
Pairs of left and right shifts are common operations for extracting a bit field.
To extract the contiguous bit field starting at pos
with length len
from rs
(with pos
> 0, len
> 0, and
pos
+ len
≤ XLEN):
slli rd, rs, (XLENlenpos)
srli rd, rd, (XLENlen)
Using srai
instead of srli
will signextend the extracted bitfield.
Similarly, placing a bit field with length len
at the position pos
:
slli rd, rs, (XLENlen)
srli rd, rd, (XLENlenpos)
Note that the postfix right shift instruction can use a compressed encoding, yielding a 48bit fused instruction if the left shift is a 32bit instruction.
More generally, a processor might fuse all destructive shift operations following any other ALU operation.
We define the following assembler pseudoops for sr[la]i
postfix fusion:
bfext rd, rs, len, pos > slli rd, rs, (XLENlenpos); srai rd, rd, (XLENlen)
bfextu rd, rs, len, pos > slli rd, rs, (XLENlenpos); srli rd, rd, (XLENlen)
bfmak rd, len, pos > sroi rd, zero, len; srli rd, rd, (XLENlenpos)
(The names bfext
, bfextu
, and bfmak
are borrowed from m88k,
that had dedicated instructions of those names (without bf
prefix) with
equivalent semantics. [m88k])
2.15 Other microarchitectural considerations
In addition to macroop fusion, we issue the following recommendations for cores that aim at better performance for bitmanipulation tasks.
2.15.1 Unaligned memory access
The base ISA spec requires load/store operations that are not naturally aligned to succeed in Umode, but explicitly states that execution may be slow.
For many bitmanipulation tasks it can be of great advantage to be able to perform load and store operations with arbitrary alignments quickly. Thus we recommend that cores optimized for bitmanipulation tasks provide fast hardware support for load/store with arbitrary alignment.
There should be a getauxval()
based mechanism as part of the RISCV Linux
ABI that can be used to query information on support for unaligned memory
access. [FastLdSt]
2.15.2 Fast multiply
A lot of bitmanipulation tricks rely on multiplication with “magic numbers” and similar tricks involving multiply/divide instructions. Thus, cores optimized for bitmanipulation tasks should provide reasonably fast implementations of the “M”extension multiply/divide instructions.
2.16 C intrinsics via <rvintrin.h>
A C header file <rvintrin.h>
is provided that contains assembler
templates for directly creating assembler instructions from C code.
The header defines _rv_(...)
functions that operate on the long
data type, _rv32_(...)
functions that operate on the int32_t
data type, and _rv64_(...)
functions that operate on the int64_t
data type. The _rv64_(...)
functions are only available on
RV64. See table 1.11 for a complete list of intrinsics defined in
<rvintrin.h>
.
Usage example:
#include <rvintrin.h>
int find_nth_set_bit(unsigned int value, int cnt) {
return _rv32_ctz(_rv32_bdep(1 << cnt, value));
}
Defining RVINTRIN_EMULATE
before including <rvintrin.h>
will
define plain C functions that emulate the behavior of the RISCV instructions.
This is useful for testing software on nonRISCV platforms.