21 “V” Standard Extension for Vector Operations, Version 0.4-DRAFT

This version is out-of-date with respect to the current working group draft, which is now hosted on https://github.com/riscv/riscv-v-spec.

This chapter presents a proposal for the RISC-V base vector instruction-set extension. The base vector extension is intended to provide general support for data-parallel execution within the 32-bit instruction encoding space, with later vector extensions supporting richer functionality for certain domains.

The vector extension is based on the style of vector register architecture introduced by Seymour Cray in the 1970s, as opposed to the earlier packed SIMD approach, introduced with the Lincoln Labs TX-2 in 1957 and now adopted by most other commercial instruction sets.

The base vector extension defines the components that must be included when the “V” bit is set in the misa register, and consequently those that will be assumed to exist by software written for an ABI specifying V.

This draft version of the chapter includes additional specifications of proposed extensions to the base vector extension to explain some of the encoding choices made for the base.

The vector extension supports a configurable vector unit, to enable implementations to tradeoff the number of active architectural vector registers and supported element widths against available maximum vector length. The vector extension is designed to allow the same binary code to work efficiently across a variety of hardware implementations varying in physical vector storage capacity and datapath spatial and/or temporal parallelism.

The vector instruction set contains many features developed in earlier research projects, including the Berkeley T0 and VIRAM [VIRAM] vector microprocessors, the MIT Scale vector-thread processor , and the Berkeley Maven and Hwacha projects.

21.1 Vector Unit State

The additional vector unit architectural state includes 32 vector registers (v0–v31), and an XLEN-bit WARL vector length CSR, vl. Each vector register vn has an associated 16-bit configuration field vtypen described below. A 6-bit global maximum element width register vmaxew defines the maximum number of bits of storage in every element of every active vector register.

Future vector extensions using wider instruction encodings can support more architectural vector registers. For example, 256 architectural vector registers in a 64-bit instruction encoding.

Future 2D shape extensions add two more vector length registers, vm and vn.

There is also a 3-bit fixed-point rounding mode CSR vxrm, and a single-bit fixed-point saturation status CSR vxsat. The vcs CSR alias provides combined access to the vl, vxrm, vxsat fields to reduce context switch time. The vcs register also includes a configuration mode field to support future extended configuration modes.

CUSTOMTAGBEGINDISCUSSION

The components of vcs might not need separate CSR addresses, depending on how they’re accessed via other non-CSR instructions.

CUSTOMTAGENDDISCUSSION

21.2 Vector Unit Type Configuration Register (`vtype`n)

The vector unit must be configured before use. Each architectural vector register, vn, is configured via 16 bits of vector type configuration state vtypen, which can be accessed via vector configuration (vcfg) CSRs and other rapid vector configuration instructions as described below. The vector register type configuration encodes the overall organization, or shape, of the elements in each vector register (e.g., scalar versus 1-D vector), as well as the bitwidth and numeric representation of each element. As shown in Figure 1.1, the 16-bit vtypen encoding is divided into a 5-bit current shape field vshapen, a 5-bit representation field verepn, and a 6-bit element bit-width field vewn held in the vcfgx CSRs. The combination of an element numeric representation and an element bitwidth is called an element format. Each vector register can also be disabled to free physical vector storage for other architectural vector registers.

Location of subfields within a single vtypen field.

It was also common in earlier vector machines to support multiple precisions within the vector datapath. In particular, the CDC STAR-100 [cdcstar100] supported single-precision and double-precision floating-point operations and also bit, byte, and nibble operations in the vector unit; TI ASC [tiasc] designs supported dividing 64-bit vector lanes into two 32-bit lanes for double throughput.

21.3 Shape Encoding

The 5-bit shape field describes the structure of the elements within the vector register. In the base vector extension, the shape can be set to either scalar or vector.

Base vector encoding of vshapen field.

For the base vector ISA, only a single bit is required in each vshape field to select between scalar and 1-D vector elements with the other bits hardwired to zero.

`vshape`	Shape
00000	scalar
00001	Reserved
0001x	Reserved
00100	1-D vector `vl`
01000	1-D vector `vm`
01100	1-D vector `vn`
00101	2-D matrix `vl` x `vl`
00110	2-D matrix `vl` x `vm`
00111	2-D matrix `vl` x `vn`
01001	2-D matrix `vm` x `vl`
01010	2-D matrix `vm` x `vm`
01011	2-D matrix `vm` x `vn`
01101	2-D matrix `vn` x `vl`
01110	2-D matrix `vn` x `vm`
01111	2-D matrix `vn` x `vn`
1xxxx	Reserved/Custom

A sketch of the proposed encodings for the 2D shape extension is shown in the Table.

21.4 Representation Encoding

The 5-bit verepn register sets the numeric representation of each element of the vector register. In the base vector extension, the representation can be set to unsigned integer, two’s-complement signed integer, or floating-point. The floating-point representations follow the IEEE 754 standards.

Base vector representation encoding.

`verep`	Representation
00000	Unsigned integer
00001	Two’s-complement signed integer
00010	Reserved (unsigned floating-point)
00011	IEEE-754 floating-point
001x0	Reserved
00101	Complex signed integer
00111	Complex floating-point
01000	Prime Galois field - integer representation
01001	Prime Galois field - Montgomery representation
01100	Binary extension Galois field - polynomial basis
01101	Binary extension Galois field - normal basis
01010	UNORM
01011	SNORM
01110	Reserved
01111	Reserved (complex SNORM?)
10xxx	Custom representations
11xxx	Reserved

The complex representations split the element width given in vewn into two equal-sized real and imaginary fields, so an element width of 64 bits can hold a single complex value with a 32-bit real and a 32-bit imaginary component.

21.5 Element Bitwidth

Each vector register, vn, has a 6-bit element width register, vewn, to specify the number of bits for each element of the current type in the vector register.

The largest element width supported is termed ELEN, and is defined to be the larger of the supported integer and floating-point type widths:
$ELEN = max( XLEN , FLEN )$$
For the base vector ISA, the bit width can be set at any power of two between 8 and ELEN.

Base vector ISA encoding of vector element width ( vewn) register fields.

Proposed extended encoding of vector element width ( vewn) register fields. Every bit width between 1 and 16 can be supported. Bit widths in steps of 2 between 16 to 32 (i.e., 16, 18, 20, ...). Bit widths in steps of 4 between 32 to 64 (i.e., 32, 36, 40, ...). Bit widths in steps of 8 between 64 and 128 (i.e., 64, 72, 80,...). For bit widths greater than 128, all powers-of-two up to 16384 and all widths 1.5× greater are supported (128, 384, 512, 768,...).

The extended bit-width encoding is designed to minimize the number of state bits required to support useful subsets of widths. For example, an RV32 system only needs two bits of state per vewn field to represent disabled, 8, 16, and 32. An RV32 system with 3 bits of state can represent disabled, 4, 8, 12, 16, 24, 32, and 48. An RV64 system with 4 bits of state can represent disabled, 4, 8, 12, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024.

21.6 Base Vector Extension Supported Types

The types supported by the base V extension depend upon the base scalar ISA and supported extensions. When the base V extension is added to a base scalar ISA, it must support the vector data element types implied by the supported scalar types as defined by Table 1.6.

Supported data element formats depending on base integer ISA and supported floating-point extensions. Ix indicates a signed integer of x bits, Ux indicates an unsigned integer of x bits, and Fx indicates an IEEE floating-point number of x bits.

Future vector extensions might expand the set of supported datatypes, including custom application-specific datatypes.

21.7 Maximum Vector Element Width (`vmaxew`)

The global vmaxew field is used to support more complex vector runtime environments where the types to be held in each register of a single configuration may vary dynamically, and may not even be known at compile time due to separate compilation.

The global maximum element width register vmaxew defines the maximum number of bits of storage in every element of every active architectural register, or if zero, defers to the per-vector-register width field.

The VIRAM processor had a virtual processor width register similar to vmaxew [VIRAM].

If vmaxew is zero, then the per-element vector element widths vewn determine the minimum storage required for each element of the associated vector register vn.

If vmaxew is non-zero, it sets the largest element width that can be supported in any vector register element in the current configuration.

21.8 Vector Configuration Registers (`vcfg0`–`vcfg15`)

The vector type configuration requires 512 bits of state (32 vector registers each with 16-bit vtypen field) that can be accessed via the vcfg CSRs.

RV128 uses four vector configuration CSRs: vcfg0 holds configuration data for v0–v7 with bits 16n to 16n + 15 holding vtypen, while vcfg4, vcfg8 and vcfg12 similarly holds configuration data for v8–v15, v16–v23, and v24–v31 respectively.

In RV64, the vcfg2 CSR provides access to the upper 64 bits of vcfg0 and vcfg6 provides access to the upper 64 bits of vcfg4. In RV32, the vcfg1, vcfg3, vcfg5 and vcfg7 CSRs provides access to the upper bits of vcfg0, vcfg2, vcfg4 and vcfg6 respectively.

Any CSR write to a vcfgx register zeros all vcfgy registers, for y > x. As a result configuration data should be written from the vcfg0 CSR upwards.

Zeroing higher-numbered vcfgy registers allows more rapid reconfiguration of the vector register file via CSR writes, and provides backward-compatibility for extensions that increase the number of possible architectural vector registers. This choice does prevent the use of CSRRW instructions to swap the configuration context; an entire old configuration must be read out before a new configuration is written in.

Additional instructions are provided to support more rapid changes to the vector unit configuration as described below.

21.9 Legal Vector Unit Configurations

To simplify hardware configuration calculations and to reduce software context-switch complexity, vector unit configurations are constrained to have non-disabled architectural vector registers numbered contiguously starting at v0. An exception will be raised if an instruction tries to change vtypen in a way that violates this constraint.

During a software vector-context save, the software handler can stop searching for active architectural registers after encountering the first disabled vector register. Hardware to calculate physical register allocation is also simplified with this constraint.

21.10 Vector Unit CSRs

Vector extension CSRs.

21.11 Maximum Vector Length (MVL)

The implementation determines an available maximum vector length (MVL) dependent on the current vector type configuration held in vcfgx and vmaxew. The available MVL depends on the configuration setting and on the implementation’s microarchitecture, but MVL must always have the same value for the same configuration parameters on a given hart.

Several earlier vector machines had the ability to configure physical vector register storage into a larger number of short vectors or a shorter number of long vectors. In particular the Fujitsu VP series [vp200] supported combining power-of-2 base vector registers into longer vector registers.

The Scale , Maven , and Hwacha processors also support configuration-dependent MVL.

Previously, the specification imposed a minimum vector length (4) on all configurations to allow stripmining code to be removed for short vector lengths. With the expanded scope of the vector unit types, this would be too onerous to support, and so the requirement is removed.

CUSTOMTAGBEGINDISCUSSION

A separate mechanism for supporting fixed vector lengths should be designed, possibly as part of an optional extension.

CUSTOMTAGENDDISCUSSION

Any change to the vector configuration that might change MVL cause the entire vector unit state to be zeroed. Any write to the global vmaxew causes the entire vector unit state to be zeroed, even if the value in vmaxew is unchanged.

If vmaxew is non-zero, any write to an individual vewn register that would set the width greater than vmaxew raises an illegal instruction exception and leaves the vector unit state unchanged.

If vmaxew is non-zero, any write to an individual vewn field with a value less than or equal to the value in vmaxew only zeros the associated vector register vn and leaves other vector unit state unchanged. The vector register data is zeroed even if vewn would be unchanged by the write.

If vmaxew is zero, then any write to an individual vewn register zeros the associated vn vector register. In addition, any write that changes the value in vewn, zeros the entire vector unit state.

The state is zeroed to hide implementation-dependent bit mappings and to provide additional security when context swapping. Zero is also a convenient initial value for some loops.

In-order implementations will probably use a flag bit per register to mux in 0 instead of garbage values on each source until it is overwritten. For in-order machines, vector lengths less than MVL complicate this zeroing, but these cases can be handled by adding a zero bit per element or element group. Machines with vector register renaming can just initialize the rename table to point entries at a physical zero register.

Each vector register can be reconfigured dynamically to hold different formats without zeroing the entire vector unit state provided that: if vmaxew is zero, the bit-width of the new format is the same as the current vew; or if vmaxew is non-zero, the format does not require more than vmaxew bits. Any change to a vector register’s format zeros the affected vector register.

If a vector register is disabled, then any vector instruction that attempts to access that vector register will raise an illegal instruction exception. Attempting to write any vmaxewn with an unsupported value will raise an illegal instruction exception.

Vector registers have both a maximum element width and a current element data type to allow the same vector register to be changed to different types during execution provided the maximum width is not exceeded. This reduces register pressure and helps support vector function calls, where the caller does not know the types needed by the callee, as described below.

The set of supported types might be greatly increased with future extensions. For example (and not limited to), new scalar types in new number systems, a complex type with real and imaginary components, a key-value type, or an application-specific structure type with multiple constituent fields. Auxiliary type configuration state might be required in these cases.

Attempting to write an unsupported type or a type that requires more than the current vmaxew width to a vetype field will raise an illegal instruction exception.

Implementations must still raise an exception for a vetypen setting that is greater than the architectural vmaxewn width, even if they internally implement a larger physical vmaxewn that could accommodate the vetypen request.

CUSTOMTAGBEGINDISCUSSION

We can either have 1) implementations raise exceptions whenever illegal values are written to vmaxew and vetype fields (current design), 2) raise exceptions at use if config holds illegal values, 3) make the fields WARL so silently reduce to supported types with no exceptions. Option 2 could complicate vector unit context switch code by having more cases to check, while Option 3 could make debugging more difficult by allowing code to run with reduced precision or incorrect types.

CUSTOMTAGENDDISCUSSION

Three broad classes of implementation can be distinguished by how they handle vmaxew settings.

The simplest is max-width-per-implementation (MWPI), where the vector unit is organized in fixed ELEN-width physical lanes, and changes to vmaxew settings simply cause portions of the physical registers and datapath to be disabled for operations narrower than ELEN bits.

The next most complex implementation, max-width-per-configuration (MWPC), uses the maximum width across all vmaxew settings in a dynamic configuration to divide the physical register storage and datapaths. For example, a MWPC machine with ELEN=64 might subdivide physical lanes into 32-bit datapaths if no vmaxew setting is greater than 32. Operations on sub-32-bit quantities would disable appropriate portions of the physical registers and functional units in each 32-bit lane. Several early vector supercomputers, including the CDC Star-100 [cdcstart100], provided a similar facility to divide 64-bit physical vector lanes into narrower 32-bit lanes.

The most complex implementations are max-width-per-register (MWPR), which reduce wasted space in the physical register files by packing elements in each vector register according to the individual vmaxew settings and which within one configuration can execute instructions with narrower datatypes at higher rates than for wider datatypes. The Berkeley Hwacha vector engine [hwachatr mixedprecision] is an example microarchitecture with this property.

Following Sections are out-of-date.

21.12 Vector Instruction Formats

The instruction encoding is a work in progress.

An important design goal was that the base vector extension fit within a few major opcodes of the 32-bit encoding. It is envisioned that future vector extensions will use 48-bit or 64-bit encodings to increase both the opcode space and the set of architectural registers. The 64-bit vector encoding would support 256 architectural vector registers and orthogonal specification of a predicate register in each instruction.

Vector arithmetic and vector memory instructions are encoded in new variants of the R-format, shown in Figure 1.8. Both new formats use one bit to hold a vp field, which usually controls the predicate register in use, either vp0 or vp1. The VR4 form is used for fused multiply-add instructions. The existing RISC-V instruction formats are used for other vector-related instructions, such as the vector configuration instructions.

New V extension instruction formats.

Most vector instructions are available in both vector-vector and vector-scalar variants. Vector-vector instructions take the first operand from the vector register specified by rs1 and the second operand from the vector register specified by rs2.

For vector-scalar operations, the rs1 field specifies the scalar register to be accessed. For most vector-scalar instructions, the type of the vector operand specified by rs2 indicates whether the integer or floating-point scalar register file is accessed using the rs1 register specifier.

Some non-commutative vector-scalar instructions (such as sub) are provided in two forms, with the scalar value used as the second operand.

The rs1 field is used to provide the scalar operand because in the base encoding, whenever an instruction has a single scalar source operand, it is encoded in the rs1 field.

21.13 Polymorphic Vector Instructions

The vector extension uses a polymorphic instruction encoding where the opcode is combined with the types of the source and destination registers to determine the operation to be performed. For example, an ADD opcode will perform a 32-bit integer vector-vector add if both vector source operands and the vector destination register are 32-bit integers, but will perform a 16-bit floating-point vector-vector operation if both vector source operands and the vector destination are 16-bit floats.

The polymorphic encoding also naturally supports operations with mixed precisions on the input and output, and also supports extending the instruction set with new types without necessarily increasing the opcode space.

Not all combinations of source and destination argument types need be supported. The base vector extension mandates only that implementations provide a subset of combinations of types on inputs and outputs. Table 1.9 shows the general rules for integer and floating-point instructions, but the detailed instruction listing should be consulted for accurate information.

General rules for supported types per instruction in base vector extension. X represents the number of bits in an integer type and F represents the number of bits in a floating-point type. Individual instruction types will provide more detailed listings. Note that the type of a scalar floating-point operand can never be different from that of the vector in Src2, hence the Src1=2F case is missing from vector-scalar operations.

A general rule in the base vector instruction set is that the destination precision is never less than any source operand, except for explicit type-conversion instructions. Another general rule is that the input operands can only be the same width or half the width of the destination operand except for the scalar operand in integer vector-scalar instructions, which is always XLEN wide. Also, src2 is never larger than src1 or src3.

Integer computations of mixed-precision values always aligns values by their LSB, and sign or zero-extends any smaller value according to its type. The result is truncated to fit in the destination type. Note a scalar integer value is already XLEN bits wide, and as wide as any possible integer vector value.

Floating-point computations on mixed-precision values acts as if the calculations are performed exactly then rounded once to the destination format.

21.14 Rapid Configuration Instructions

It can take several CSR instructions to set up the vcfg and vnp CSRs for a given configuration. Specialized configuration instructions are provided to quickly set up common configurations in the vcfg and vnp CSRs.

The vsetdcfg instruction takes a scalar register value encoded as shown in Figure 1.10, and returns the corresponding MVL in the destination register. The vsetdcfg and vsetdcfgi instructions also clear the vnp register, so no predicate registers are allocated.

CUSTOMTAGBEGINDISCUSSION

For now, only a 32-bit value supporting up to three different vector data types is supported by the vsetdcfg instruction. RV64 and RV128 could support larger number of types, though it’s not clear if the hardware cost (area, latency) to support a larger number of different types is justified.

CUSTOMTAGENDDISCUSSION

Format of the vsetdcfg value. The value contains three pairs of a 5-bit type and a 5-bit number of registers to create of that type. A value of 0 for the number of a type indicates that 32 registers should be allocated. A value of 0 for the type indicates this pair should be skipped. The types must be of monotonically increasing size from type0 to type2.

The vsetdcfg value specifies how many vector registers of each datatype are allocated, and is divided into a 2-bit mode field and pairs of 5-bit fields for each data type in the configuration.

The 2-bit mode field indicates the configuration mode of the vector unit and is zero for the base vector extension.

The standard vector extension operating mode configures the vector unit into some number of vector registers, each with some number of elements of types supported by the scalar unit.

At least one alternative mode is planned, where the vector unit is configured as some number of registers each holding a single large element, e.g., 256 bits. This would be the base for cryptographic operations, or other coprocessors that operated on large structures.

Other modes can be used to reconfigure the vector unit register file and functional units for other domain-specific purposes.

Each datatype pair contains a 5-bit typex value encoded as a vetypen value, and a 5-bit ntypex for the number of registers to allocate for that type. If the type0 field is non-zero, the vsetdcfg instruction will configure the first ntype0 vector data registers to have vetypen values of type0 with vmaxewn values set accordingly as shown in Table [tab:vetype]. If the type0 value is 0, the datatype pair is skipped. If the type1 field is non-zero, then the next ntype1 vector registers are configured to be of the type given in type1. Similarly for the type2 pair.

A value of zero in a typex field indicates this datatype pair should be ignored. A value of zero in a ntypex field indicates 32 registers should be allocated for the corresponding type.

Zero values are skipped to simplify setting a configuration with two different data types, where a single LUI instruction can set the upper 20 bits leaving the low bits zero.

A single 12-bit immediate value is sufficient to create a configuration with some number of vector registers with a single given datatype.

A compressed C.LI with a zero-extended 5-bit immediate can create a configuration with 32 vector registers of a given datatype.

A corresponding vsetdcfgi instruction takes a 12-bit immediate value to set the configuration instead of a scalar value, but otherwise is identical to the vsetcfgd instruction.

CUSTOMTAGBEGINDISCUSSION

It is not clear how many immediate bits will be made available for the vsetdcfgi instruction. If encoding space is available for both 12 immediate bits and a source register specifier, then vsetdcgfi can be defined to read the source register, OR in the bits in the immediate, then create a configuration. In this case, there is no need for a separate vsetdcfg instruction.

CUSTOMTAGENDDISCUSSION

The configuration value given must result in a legal configuration or else an illegal instruction exception will be raised.

If a zero argument is given to vsetdcfg the vector unit will be disabled and the value 0 will be returned for MVL. This instruction (vsetdcfg x0, x0) is given the assembly pseudo-code vdisable.

Separate vsetpcfg and vsetpcfgi instructions are provided that write the source value to the vnp register and return the new MVL. These writes also clear the vector data registers, set all bits in the allocated predicate registers, and set vl=MVL. A vsetpcfg or vsetpcfgi instruction can be used after a vsetdcfg to complete a reconfiguration of the vector unit.

CUSTOMTAGBEGINDISCUSSION

If vnp is made accessible as a separate CSR, the setpcfg and setpcfgi instructions are less useful. The only advantage over a CSR instruction is that they return MVL, which is rarely needed, and which can be obtained via that setvl instruction.

CUSTOMTAGENDDISCUSSION

21.15 Vector-Type-Change Instructions

To quickly change the individual types of a vector register, vetyperw and vetyperwi instructions are provided to change the type of the specified vector data register to the given scalar register value or 5-bit immediate value respectively, while returning the previous type in the destination scalar register.

A vector convert instruction, described below, can simultaneously convert a source vector register into a new type, and set that type in the destination vector register.

21.16 Vector Length

The active vector length is held in the XLEN-bit WARL vector length CSR vl, which can only hold values between 0 and MVL inclusive. Any writes to the configuration registers (vcfgx or vnp) cause vl to be initialized with MVL. Changes to vetypen via vector-type-change instructions do not affect vl.

The active vector length is usually set via the setvl instruction. The source argument to the setvl is the requested application vector length (AVL) as an unsigned XLEN-bit integer. The setvl instruction calculates the value to assign to vl according to Table [tab:vlcalc]. The result of this calculation is also returned as the result of the setvl instruction.

Earlier drafts encoded setvl using a modified CSRRW instruction whereas it is now encoded as a separate new instruction.

AVL Value	`vl` setting
AVL ≥ 2 MVL	MVL
2 MVL > AVL > MVL	⌈AVL/2⌉
MVL ≥ AVL	AVL

The rules for setting the vl register help keep vector pipelines full over the last two iterations of a stripmined loop. This version of the rules guarantees monotonically decreasing vector lengths. Similar rules were previously used in Cray-designed machines [crayx1asm].

CUSTOMTAGBEGINDISCUSSION

There are multiple possible rules for setting VL, and we could give implementations freedom to use different VL setting rules.

CUSTOMTAGENDDISCUSSION

The idea of having implementation-defined vector length dates back to at least the IBM 3090 Vector Facility [ibm370varch], which used a special “Load Vector Count and Update” (VLVCU) instruction to control stripmine loops. The setvl instruction included here is based on the simpler setvlr instruction introduced by Asanović [krstephd].

The setvl instruction is typically used at the start of every iteration of a stripmined loop to set the number of vector elements to operate on in the following loop iteration. The current MVL can be obtained from a vector configuration instruction, or by performing a setvl with a source argument that has all bits set (largest unsigned integer).

When vl is less than MVL, vector instructions will set all elements in the range [vl:MAXVL-1] in the destination vector data register or destination vector predicate register to zero.

Requiring zeroing of elements past the current active vector length simplifies the design of units with renamed vector data registers. If the specification left destination elements unchanged, renaming implementations would have to copy the tail of the old destination register to the newly allocated destination register. Alternatively, specifying the tail to be undefined will expose implementation differences and possibly cause a security hole.

Implementations that do not support renaming, will have to zero the tail of a vector, but this can reuse the mechanism that is already required to initialize all vector data registers to zero on reconfiguration, for example, by having a zero bit on each element or element group.

No element operations are performed for any vector instruction when vl=0.

Two possible choices are to 1) require destination registers to be completely zeroed when vl=0, or 2) no changes to the destination registers. Option 2 is currently chosen as this will prevents unnecessary work in some implementations, and option 1 does not provide a clear advantage beyond seeming more consistent with vl>0 case.

Example vector-vector add loop.

21.17 Predicated Execution

The 32-bit base encoding does not leave room for a fully orthogonal predicate register specifier. A single bit is dedicated to the predicate register specification, and is used to select between two active predicate registers, vp0 or vp1. An alternative scheme would have used the bit to select between vp0 and unpredicated (all elements active). However, given the ease of setting all predicate bits in a vector predicate register with a single predicate instruction, the current scheme provides more flexibility.

When there are no vector predicate registers enabled, vp0 returns all set bits when read. So, the assembler convention is to assume vp0 as the predicate register when no predicate register is explicitly given. The assembler can support a strict operands option to require the vector predicate register is explicitly specified.

At element positions where the selected predicate register bit is zero, the corresponding vector element operation has no effect (does not change architectural state or generate exceptions), except to write a zero to the element position in the destination vector register.

CUSTOMTAGBEGINDISCUSSION

The previous proposal (undisturb) left the destination vector unchanged at element positions where the predicate bit is false, whereas the current plan-of-record (zero) writes zero to the destination where the predicate bit is false.

The advantage of the undisturb option is that it can require fewer instructions and fewer architectural registers for many common code sequences. For in-order machines without register renaming, the undisturb operation simply disables writes to the destination elements, except for vector registers that have not been written since configuration time. Typically an extra zero bit per vector register or element group will be added to represent a zeroed register instead of actually zeroing state at configuration time. For predicated undisturb writes to these uninitialized registers, the predicated false elements must be explicitly written with zeros on each element group and the zero bit is then cleared down. However, in a machine with vector register renaming, undisturb does imply an additional read of the original destination register to write the value into the new physical destination register when the predicate is false. This additional read port will often be cheaper than in a scalar machine as vector machines often time-multiplex read ports, and the additional read can be skipped when the predicate registers are disabled (vnp=0) or when the source is known to be zero after configuration, but still adds complexity to a design.

The advantage of the zero option is that a machine with vector register renaming does not need to read the original destination vector register and so a read port is saved. The disadvantage of the zero option is that more instructions and architectural registers are required for common code sequences, and simpler microarchitectures without register renaming are penalized by requiring longer code sequences and greater register pressure. In particular, vector merge instructions are required to collect results from two divergent control paths, and each vector merge has to read two vector values and write a vector result. Whether the zero option saves total register file traffic in an register-renamed microarchitecture depends on the ratio of a) internal temporary writes, to b) writes creating values that are live out of each basic block, and also to the frequency of control flow merges.

Overall, the zero option removes significant complexity from the renamed machines while reducing efficiency somewhat for the non-renamed machines, and is the current plan-of-record.

CUSTOMTAGENDDISCUSSION

21.18 Vector Load/Store Instructions

Three vector load/store addressing modes are supported, unit-stride, constant stride, and indexed (scatter/gather). Each addressing mode has a 7-bit unsigned immediate offset that is scaled by the element type.

The unit-stride address mode takes a scalar base byte address, adds the scaled immediate, then generates a contiguous set of element addresses for loads or stores.

The primary use of immediates in unit-stride loads is to generate overlapping unit-stride loads for convolution operations.

The constant-stride address mode takes a scalar base byte address, a stride value encoded in bytes, and adds a scaled immediate value.

The stride value is in bytes to allow a single stride register to be used to support operations on arrays-of-structures, where not all elements in each structure have the same size. The immediate value is still scaled by element size to increase reach, given that element types will be naturally aligned.

The indexed address mode takes a scalar base byte address and a vector of byte offsets. The scalar base address and the immediate value are added to element of the offset vector to give a vector of addresses used in a scatter/gather.

Indexed stores are provided in three types. Unordered, ordered, and reverse-ordered. The unordered indexed stores might update the same memory location from two different elements in an unspecified order. The ordered stores always update memory locations in increasing vector element order. The reverse-ordered stores always update memory locations in decreasing memory order.

The reverse-ordered stores support vectorization of software memory disambiguation techniques. A reverse-ordered store of element id into a hash table indexed by a hash on a store access address, followed by a read of the hash table using a load access address and a comparison against the original element id, will indicate if there’s a potential RAW hazard with an earlier loop iteration.

CUSTOMTAGBEGINDISCUSSION

Not clear if there is sufficient realizable improvement for supporting unordered stores over ordered stores.

CUSTOMTAGENDDISCUSSION

Vector loads/stores have a simple memory model, where each vector load/store is observed to complete sequentially in program order only the local hart, i.e., a vector load on a hart will observe all earlier vector stores on the same hart, and no later vector stores.

Vector loads are available in a length-speculative form that writes predicate register vp1 in addition to the destination vector data register. These instructions raise an illegal instruction exception if vp1 is not configured. For elements that do not generate a permissions fault, the length-speculative vector loads operate as normally except to also clear the bit in vp1. If an element encounters a permission fault, a zero is written to the destination vector register element and the vp1 bit is set to a 1. Implementations may treat elements past the first faulting element as also causing a fault even if they might not cause a permissions fault when accessed alone.

Once software determines the active vector length, it should check if any loads within the active vector length caused a fault, and in this case, generate a non-length-speculative load to trigger reporting of the error.

Length-speculative vector loads are required to vectorize while loops, with data-dependent exits (e.g. strlen).

The only faults ignored by the length-speculative vector loads are ones that would have resulted in a permissions violation. Page faults and other virtualization-related faults should be handled invisibly to the user thread by the execution environment.

A malicious program can use length-speculative vector loads to probe accessible address space without fear of a fatal fault.

21.19 Vector Register Gather

A vector register gather produces a new result data vector by gathering elements from one source data vector at the element locations specified by a second source index vector. Data source and destination vector types must agree. The index vector can have any integer type. Legal element indices can range from 0 to current MAXVL. Indices out of this range raise an illegal instruction exception.

  # vindices holds values from 0..MAXVL
  vrgather  vdest, vsrc, vindices

21.20 Vector Slide

Reductions (and convolutions) are supported via a vector slide instruction that takes elements starting from the middle of one vector and places these at the beginning of a second vector register. This supports a recursive-halving reduction approach for any binary associative operator.

A similar vector register extract instruction was added to the Cray C90 after memory latency grew too large for the memory-memory reductions used in earlier Crays.

The vector unit microarchitecture can be optimized for the power-of-2 sized element offsets used for reductions.

21.21 Fixed-Point Support

Clip instruction supports scaling, rounding, and clipping to destination type. Rounding set by CSR fixed-point rounding mode (truncate, jam, round-up, round-nearest-even). Clipping set by CSR clip mode (wrap, saturate).

Add with average, rounding set by rounding mode.

Multiply with same size source and destination types, with some result scaling values (+1, 0, -1, -8?) and rounding and clipping according to CSR mode.

Accumulate with carry into predicate register to support larger precise dot-products.

21.22 Optional Transcendental Support

21.23 Instruction-Set Encoding

[ NOTE: This section is out of date. ]

On the next two pages is a proposed instruction-set encoding.

[v-instr-table]

21 “V” Standard Extension for Vector Operations, Version 0.4-DRAFT

21.1 Vector Unit State

21.2 Vector Unit Type Configuration Register (vtypen)

21.3 Shape Encoding

21.4 Representation Encoding

21.5 Element Bitwidth

21.6 Base Vector Extension Supported Types

21.7 Maximum Vector Element Width (vmaxew)

21.8 Vector Configuration Registers (vcfg0–vcfg15)

21.9 Legal Vector Unit Configurations

21.10 Vector Unit CSRs

21.11 Maximum Vector Length (MVL)

21.12 Vector Instruction Formats

21.13 Polymorphic Vector Instructions

21.14 Rapid Configuration Instructions

21.15 Vector-Type-Change Instructions

21.16 Vector Length

21.17 Predicated Execution

21.18 Vector Load/Store Instructions

21.19 Vector Register Gather

21.20 Vector Slide

21.21 Fixed-Point Support

21.22 Optional Transcendental Support

21.23 Instruction-Set Encoding

21.2 Vector Unit Type Configuration Register (`vtype`n)

21.7 Maximum Vector Element Width (`vmaxew`)

21.8 Vector Configuration Registers (`vcfg0`–`vcfg15`)