LLVM support for the draft Bit Manipulation Extension for RISC-V

Introduction

The RISC-V Instruction Set Manual describes the current status of the RISC-V ISA and its extensions. Among these there’s a mention of the ‘B’ extension that is meant to host specific instructions for bit manipulation operations.

A well known proposal for such extension comes from Claire Wolf and can be found here.

We implemented an extension of the RISC-V back end of LLVM based on such proposal that is now available on the Embecosm GitHub repository: https://github.com/embecosm/llvm-project

Our work is being presented as a poster at the LLVM Developers Meeting 2019, at the San Jose Convention Center, San Jose, CA.

Claire Wolf’s proposal

The proposed bit manipulation extension comprises of several sets of assembly instructions, a description of their functionality and potential applications and their proposed encodings.

Such instructions implement several kinds of manipulation of individual bits and bit-pattern permutations.

According to the kind of permutations they perform the instructions are divided into the following sub-extensions:

Zbb (base): the operations of most common use.
Zbc (carry-less): carry-less multiplication.
Zbe (extract/deposit): extract/deposit a mask of multiple bits in a value.
Zbf (bit-field): placement of compact bit fields.
Zbp (permutation): large scale bit permutations (e.g.: rotations, reversals, shuffling… ).
Zbm (matrix): matrix operations.
Zbr (redundancy): cyclic redundancy check operations (crc).
Zbs (single bit): single bit operations: set, clear, invert, extract.
Zbt (ternary): three operand operations.

Our implementation

In order to download our patch you can clone our LLVM trunk with git:

git clone https://github.com/embecosm/llvm-project.git

and then checkout the branch with our patch:

git checkout riscv-bitmanip

This will be integrated upstream once the bit manipulation extension has been finalized.

The MC Layer

In order to create the B extension in the RISC-V back end we first added the corresponding feature ‘b’ to RISCV.td:

def FeatureStdExtB
    : SubtargetFeature<"b", "HasStdExtB", "true",
                       "'B' (Bit Manipulation Instructions)">;
def HasStdExtB : Predicate<"Subtarget->hasStdExtB()">,
                           AssemblerPredicate<"FeatureStdExtB">;

So that we could switch on and off the B extension from the command line through the back end option -mattr=b:

llc -mtriple=riscv32 -mattr=b foo.ll -o foo.s

We then added the necessary attributes and methods to the RISCVSubtarget class in RISCVSubtarget.h:

...
bool HasStdExtD = false;
bool HasStdExtC = false;
bool HasStdExtB = false;
...

bool hasStdExtD() const { return HasStdExtD; }
bool hasStdExtC() const { return HasStdExtC; }
bool hasStdExtB() const { return HasStdExtB; }

Then we created the RISCVInstrInfoB.td file where to add the encodings for the new instructions and we included it in RISCVInstrInfo.td:

...
include "RISCVInstrInfoD.td"
include "RISCVInstrInfoC.td"
include "RISCVInstrInfoB.td"

It was time then to add the new instructions with their encodings.

The RISC-V backend defines several classes for the definition of the instruction formats that RISC-V supports: R (Register), I (Immediate), S (Signed), B (Branch), U (Unsigned) and J (Jump).
And several sub-formats like RVInstR4 (ternary instructions) and RVInstRAtomic.

RISCVInstrFormats.td defines also a class for the generic RV instruction format on which all the other formats and sub-formats are based:

class RVInst<dag outs, dag ins, string opcodestr, string argstr,
             list<dag> pattern, InstFormat format>
    : Instruction {
  field bits<32> Inst;
  // SoftFail is a field the disassembler can use to provide a way for
  // instructions to not match without killing the whole decode process. It is
  // mainly used for ARM, but Tablegen expects this field to exist or it fails
  // to build the decode table.
  field bits<32> SoftFail = 0;
  let Size = 4;
  bits<7> Opcode = 0;

  let Inst{6-0} = Opcode;

  let Namespace = "RISCV";

  dag OutOperandList = outs;
  dag InOperandList = ins;
  let AsmString = opcodestr # "\t" # argstr;
  let Pattern = pattern;

  let TSFlags{4-0} = format.Value;
}

What we did in order to add new assembly instructions was to make instances of such classes in RISCVInstrInfoB.td and to provide them with the new encodings. That is to define the opcode, the operands, the function fields and any remaining fixed encoding fields.

Take for instance the simple case of the instruction clz, an instruction that counts the leading zeros in a word:

uint_xlen_t clz(uint_xlen_t rs1)
{
  for (int count = 0; count < XLEN; count++)
    if ((rs1 << count) >> (XLEN - 1))
      return count;
  return XLEN;
}

With XLEN being the bit length of a word.

We defined it like this:

def CLZ  : RVBInstFunct12<0b011000000000, 0b001, OPC_OP_IMM, "clz">;

With RVBInstFunct12 being a class derived from the RVInst class from above with the difference that the most significant 12 bits of the instruction are used as an encoding field instead of an immediate value:

class RVBInstFunct12<bits<12> funct12, bits<3> funct3, RISCVOpcode opcode,
                       string opcodestr>
    : RVInst<(outs GPR:$rd), (ins GPR:$rs1),
              opcodestr, "$rd, $rs1", [], InstFormatI> {
  bits<5> rs1;
  bits<5> rd;

  let Inst{31-20} = funct12;
  let Inst{19-15} = rs1;
  let Inst{14-12} = funct3;
  let Inst{11-7} = rd;
  let Opcode = opcode.Value;
}

For each instruction we specified the opcode, the inputs, the outputs, the format (register, immediate, jump, etc… ) and the encoding, as described in Claire Wolf’s proposal.
The following step was to divide the newly added instructions into the respective sub-extensions (Zbb, Zbc, Zbe, Zbf, Zbm, Zbp, Zbr, Zbs and Zbt).
In order to do that we needed to specify some includion/exclusion rules in RISCVInstrInfoB.td:

let Predicates = [HasStdExtZbb] in {
  def CLZ : RVBInstFunct12<0b011000000000, 0b001, OPC_OP_IMM, "clz">;
  def CTZ : RVBInstFunct12<0b011000000001, 0b001, OPC_OP_IMM, "ctz">;
  def PCNT : RVBInstFunct12<0b011000000010, 0b001, OPC_OP_IMM, "pcnt">;
  ...
}

let Predicates = [HasStdExtZbp] in {
  def GREV : ALU_rr<0b0110100, 0b101, "grev">;
  def GORC : ALU_rr<0b0010100, 0b101, "gorc">;
  def SHFL : ALU_rr<0b0000100, 0b001, "shfl">;
  def UNSHFL : ALU_rr<0b0000100, 0b101, "unshfl">;
  ...
}

To allow that we added accordingly the features Zbb, Zbbp (for instructions belonging both to the base and the permutation sub-extensions), Zbc, Zbe, Zbf, Zbm, Zbp, Zbr, Zbs and Zbt.
As with the B feature we added the following to RISCV.td:

...
def FeatureExtZbb
    : SubtargetFeature<"bb", "HasStdExtZbb", "true",
                       "'Zbb' (Base Bit Manipulation Instructions)">;
def HasStdExtZbb : Predicate<"Subtarget->hasStdExtZbb()">,
                           AssemblerPredicate<"FeatureExtZbb">;

def FeatureExtZbc
    : SubtargetFeature<"bc", "HasStdExtZbc", "true",
                       "'Zbc' (Carry-Less Bit Manipulation Instructions)">;
def HasStdExtZbc : Predicate<"Subtarget->hasStdExtZbc()">,
                           AssemblerPredicate<"FeatureExtZbc">;
...

We then created the following dependence:

def FeatureStdExtB
    : SubtargetFeature<"b", "HasStdExtB", "true",
                       "'B' (Bit Manipulation Instructions)",
                                [FeatureExtZbb,
                                 FeatureExtZbc,
                                 FeatureExtZbe,
                                 FeatureExtZbf,
                                 FeatureExtZbm,
                                 FeatureExtZbp,
                                 FeatureExtZbr,
                                 FeatureExtZbs,
                                 FeatureExtZbt]>;
def HasStdExtB : Predicate<"Subtarget->hasStdExtB()">,
                         AssemblerPredicate<"FeatureStdExtB">;

That means that by specifying the ‘b’ feature on the command line (-mattr=b) all the sub-features are enabled. This is a temporary solution until the extension is formalized and the boundaries of the sub-extensions are defined definitively.

We then added the following attributes and methods to the RISCVSubtarget class in RISCVSubtarget.h:

...
bool HasStdExtZbb = false;
bool HasStdExtZbc = false;
bool HasStdExtZbe = false;
bool HasStdExtZbf = false;
bool HasStdExtZbm = false;
bool HasStdExtZbp = false;
bool HasStdExtZbr = false;
bool HasStdExtZbs = false;
bool HasStdExtZbt = false;

...

bool hasStdExtZbb() const { return HasStdExtZbb; }
bool hasStdExtZbc() const { return HasStdExtZbc; }
bool hasStdExtZbe() const { return HasStdExtZbe; }
bool hasStdExtZbf() const { return HasStdExtZbf; }
bool hasStdExtZbm() const { return HasStdExtZbm; }
bool hasStdExtZbp() const { return HasStdExtZbp; }
bool hasStdExtZbr() const { return HasStdExtZbr; }
bool hasStdExtZbs() const { return HasStdExtZbs; }
bool hasStdExtZbt() const { return HasStdExtZbt; }
...

Codegen pattern-matching

Once we added the assembly instructions to the RISC-V back-end we were able to base on them some target specific optimizations.
The idea is to instruct LLVM to recognize IR code patterns that perform a specific bit manipulation and translate them with the correspondent assembly instruction from the bit manipulation extension.

In order to do that it is necessary to provide a description of the pattern that we want to optimize. That is done in a back-end by using the syntax of Tablegen:

def : Pat<(and GPR:$rs1, (not GPR:$rs2)), (ANDN GPR:$rs1, GPR:$rs2)>;

This example shows the description of a piece of the target independent SelectionDAG

def : Pat<(and GPR:$rs1, (not GPR:$rs2)), (ANDN GPR:$rs1, GPR:$rs2)>;

and the piece of the RISC-V SelectionDAG it must be subsituted with:

def : Pat<(and GPR:$rs1, (not GPR:$rs2)), (ANDN GPR:$rs1, GPR:$rs2)>;

ANDN being an assembly instruction we defined in RISCVInstrInfoB.td:

def ANDN : ALU_rr<0b0100000, 0b111, "andn">;

The RISCVInstrInfoB.td file contains both the definitions of the assembly instructions as described above and the definitions of these rules to pattern-match fragments of IR code into equivalent bit manipulation assembly instructions.

The patterns used to define the left side of the matching rules were taken by observing the DAG nodes right before the instruction selection, after the SelectionDAG had gone through the target-independent optimizations and legalization.

You can see information about the SelectionDAG by adding the --debug option to the command line of the static compiler llc:

llc --debug -mtriple=riscv32 -mattr=b foo.ll -o foo.s

The ouput shows the optimization and legalization phases that lead to the instruction selection. The SelectionDAG is displayed before the selection starts as follows:

SelectionDAG has 11 nodes:
  t0: ch = EntryToken
        t4: i32,ch = CopyFromReg t0, Register:i32 %1
      t6: i32 = xor t4, Constant:i32<-1>
      t2: i32,ch = CopyFromReg t0, Register:i32 %0
    t7: i32 = and t6, t2
  t9: ch,glue = CopyToReg t0, Register:i32 $x10, t7
  t10: ch = RISCVISD::RET_FLAG t9, Register:i32 $x10, t9:1

===== Instruction selection begins: %bb.0 'entry'

ISEL: Starting selection on root node: t10: ch = RISCVISD::RET_FLAG t9, Register:i32 $x10, t9:1
ISEL: Starting pattern match
  Morphed node: t10: ch = PseudoRET Register:i32 $x10, t9, t9:1
ISEL: Match complete!

ISEL: Starting selection on root node: t9: ch,glue = CopyToReg t0, Register:i32 $x10, t7
ISEL: Starting selection on root node: t7: i32 = and t6, t2
ISEL: Starting pattern match
  Initial Opcode index to 32446
  Skipped scope entry (due to false predicate) at index 32449, continuing at 32914
  Match failed at index 32921
  Continuing at 33083
  Match failed at index 33087
  Continuing at 33249
  Match failed at index 33251
  Continuing at 33410
  Match failed at index 33413
  Continuing at 33469
  OpcodeSwitch from 33472 to 33634
  TypeSwitch[i32] from 33648 to 33651
  Morphed node: t7: i32 = ANDN t2, t4
ISEL: Match complete!

The selection goes through several attempts before recognising the xor node as a negation. It eventually finds the node ANDN as best match for the and node.

This approach though has a limitation: the SelectionDAGs are created one per basic block of code. That means that every branch creates two different DAGs.
Since during instruction selection LLVM looks at one SelectionDAG per time we could not make a single selection for a pattern that spans through multiple basic blocks.
For this reason we could not select instructions that implement operations that normally are coded with branches, loops or other flow control constructs.

An exception was brought by the cmov instruction. It is a ternary operation that according to the condition, that is stored in one parameter it selects one of the other two:

uint_xlen_t cmov(uint_xlen_t rs1, uint_xlen_t rs2, uint_xlen_t rs3)
{
  return rs2 ? rs1 : rs3;
}

We observed that also some simple branches are translated into single instances of cmov thus simplfying the selection process.

Some bit manipulation operations instead require more complex control flow constructs.

The operations clz, ctz and pcnt, that respectively count the number of leading zeroes, of trailing zeroes and of ones in the input value, all require a loop to iterate on all the bits.

There are examples of pattern-matching of loops in some LLVM middle-end passes that pattern match specific implementations of operations like counting leading/trailing zeroes and counting ones, with the corresponding intrinsic functions (llvm.ctlz.*, llvm.cttz.*, llvm.ctpop.*). At that stage though the effectiveness of pattern matching depends on the implementation.

A possible improvement would be to add an extra optimization stage after the instruction selection in the RISC-V back-end. That would look for any emerging pattern that could be substituted with other bit manipulation instructions.

Take for instance an implementation of a bit field placement like the following:

uint32_t _bfp(uint32_t rs1, uint32_t rs2)
{
  int len = (rs2 >> 24) & 15;
  int off = (rs2 >> 16) & 31;
  len = len ? len : 16;
  uint_xlen_t mask = rol(slo(0, len), off);
  uint_xlen_t data = rol(rs2, off);
  return (data & mask) | (rs1 & ~mask);
}

This operation places up to 16 least significant bits from rs2 to rs1. The upper bits of rs2 control the number of bits that are moved and where in rs1 they’re placed.

With simple pattern-matching during the instruction selection phase you can achieve the assembly code on the right (compared to the case without pattern-matching shown on the left):

    _bfp:                       # @_bfp        ||    _bfp:                       # @_bfp
    # %bb.0:                    # %entry       ||    # %bb.0:                    # %entry
            srli    a2, a1, 24                 ||            srli    a2, a1, 24
            andi    a2, a2, 15                 ||            andi    a2, a2, 15
            beqz    a2, .LBB35_2               ||            addi    a3, zero, -1
    # %bb.1:                    # %entry       ||            sll     a3, a3, a2
            addi    a3, zero, -1               ||            not     a3, a3
            sll     a2, a3, a2                 ||            lui     a4, 16
            not     a2, a2                     ||            addi    a4, a4, -1
            j       .LBB35_3                   ||            cmov    a2, a2, a3, a4
    .LBB35_2:                                  ||            cmix    a0, a2, a1, a0
            lui     a2, 16                     ||            ret
            addi    a2, a2, -1                 ||
    .LBB35_3:                   # %entry       ||
            and     a1, a2, a1                 ||
            not     a2, a2                     ||
            and     a0, a2, a0                 ||
            or      a0, a1, a0                 ||
            ret                                ||

With a second optimization on the after-selection code though we could pattern-match more easily the operation to a single instance of the bfp assembly instruction:

    _bfp:                       # @_bfp        ||    _bfp:                       # @_bfp
    # %bb.0:                    # %entry       ||    # %bb.0:                    # %entry
            srli    a2, a1, 24                 ||            bfp    a0, a1, a2
            andi    a2, a2, 15                 ||            ret
            addi    a3, zero, -1               ||
            sll     a3, a3, a2                 ||
            not     a3, a3                     ||
            lui     a4, 16                     ||
            addi    a4, a4, -1                 ||
            cmov    a2, a2, a3, a4             ||
            cmix    a0, a2, a1, a0             ||
            ret                                ||

It is unlikely that a user would implement any of the proposed bit manipulation operations on their own if they’re already available as assembly instructions. Nevertheless some patterns might happen to correspond to some of the more complex bit manipulation instructions described in this extension.

As the proposed extension is updated and new complex instructions and pseudo instructions are born on top of the basic ones, such further optimizations should be considered in order to make sure that the maximum potential of the extension is used.

Clang header file

Included in Clang is a header file, rvintrin.h, that provides C functions that can be used as an interface to access the extension’s instructions from a C program.

These are designed to translate into a single assembly instruction when using the Bitmanip extension.

This contains a function for each instruction in the Bitmanip extension. For example:
_rv_clz(...) corresponds to the clz instruction.

The specified target will determine which functions are available, to prevent instructions being used where they shouldn’t be. For example, the clzw instruction is only available on 64 bit RISC-V, so the corresponding _rv_clzw(...) function is also only present on 64 bit RISC-V; additionally there are aliases for these functions depending on the architecture, such as _rv32_clz(...) and _rv64_clzw.

If the Bitmanip extension is not present then C functions will provide the equivalent behaviour, ensuring these still work.

This can be included like so:
#include <rvintrin.h>

Performance tests with Embench

We tested the efficiency of our implementation against the collection of benchmarks contained in Embench.

In order to run the same tests as we did you’ll need to set up both Embench and our implementation.

You’ll need an installed RISC-V GNU tool chain to build LLVM+CLANG for RISC-V.

You can rely on the Embecosm’s RISC-V GNU tool chain repository to obtain all the necessary tools from upstream, plus gdbserver for RISC-V and the ri5cy and picorv32 verilog models (these are necessary for execution-speed tests).

As also explained in the instructions here you can run these commands to build the necessary tools:

mkdir riscv-toolchain
cd riscv-toolchain
git clone https://github.com/embecosm/riscv-toolchain.git setup
cd setup
./clone-all.sh
./build-targets.sh
./build-tools.sh

Then in another directory you can checkout the latter from Embecosm’s github LLVM repository on the branch riscv-bitmanip:

git clone git@github.com:embecosm/llvm-project.git
cd llvm-project
git checkout riscv-bitmanip

Here are the configuration options we used to build LLVM and CLANG:

cmake -G Ninja -DLLVM_ENABLE_PROJECTS=clang \
               -DBUILD_SHARED_LIBS=on \
               -DDEFAULT_SYSROOT=/path/to/installed/riscv-gnu-toolchain/install/riscv32-unknown-elf/ \
               -DGCC_INSTALL_PREFIX=/path/to/installed/riscv-gnu-toolchain/install/ \
               -DLLVM_DEFAULT_TARGET_TRIPLE=riscv32-unknown-elf \
               ../llvm

You can now download Embench and build the benchmarks for the tests (it is not necessary to install it).
We propose here to run the size test as we’re still working on setting up reliable speed tests based on verilog emulation of the new bit manipulation instructions.

Checkout Embench:

git clone https://github.com/embench/embench-iot.git

Build the size test:

cd embench-iot
export PATH=/path/to/riscv-toolchain/install/bin/:$PATH
./build_all.py --arch riscv32 --chip size-test-llvm --board ri5cyverilator --cc /path/to/llvm-project/build/bin/clang --cflags '-Xclang -target-feature -Xclang +b'

The option -target-feature is directed to the front-end clang-cc1 by the option -Xclang and says that the following string is the label of a feature of the back-end that needs to be enabled.
The feature we’re enabling is the feature ‘b’. The feature we introduced earlier to enable the bit manipulation extension in the RISC-V back-end.

Run the test:

./benchmark_size.py

The default output is with values relative to the baseline of the benchmarks in Embench. Such baseline is described in ./baseline-data/size.json.

If you want absolute sizes (bytes) you just need to run the script like this:

./benchmark_size.py --absolute

Here is a graph showing the relative sizes of the benchmarks with regard to the baseline. The graphs compares the relative sizes produced by the master branch of LLVM (dark blue) and the relative sizes produced by the riscv-bitmanip branch of LLVM (light blue), that is the branch containing our patch.

The results show some improvements on several benchmarks, above all the nettle-sha256 (), a program that makes great use of bit manipulation operations.

Uncommitted changes

Our repository contains some patches that are currently under review but not upstream yet.
There’s a branch that contains our bit manipulation implementation placed on top of such patches: ljr-saverestore-shrinkwrap-machineoutliner-riscv-bitmanip

The same benchmarks run on such branch give the following results.

Share this article