GCC support for the draft Bit Manipulation Extension for RISC-V

Introduction

One of the currently proposed draft ISA extensions for RISC-V is the Bit Manipulation Instructions extension (from henceonwards referred to as the “Bitmanip” or “BMI” extension.) It proposes to provide fast and direct instructions for commonly-used bitwise operations, often found in cryptographic, logarithmic, bit-counting, and logical operations. The GCC toolchain work has been developed by the community with contributions from SiFive and myself at Embecosm, providing bug fixes, regression tests, and benchmarks.

The specification proposes 95 new instructions (including all variants) spread across 9 subclasses, some of them overlapping. These are organised as shown in the following figure.


The GCC backend code generator currently supports approx. 31 of 95 total instructions. This means that it is possible for GCC to emit these instructions when compiling generic C or C++ code, if it is judged to be an improvement. The following figure highlights such instructions.


Other instructions can be emitted via intrinsic functions declared in a header called “rvintrin.h”, that is bundled together with the compiler. This is identical to how “xmmintrin.h” is bundled with the GCC x64/x86 Intel compiler, except with the difference that currently the intrinsincs are defined as inline assembly blocks, rather than functions known by the compiler.

Quick Start

This section will focus on providing the reader with guidance to quickly get started with a Bitmanip compiler. At a minimum, you will need to build a stage-1 GCC and Binutils from source. Optionally, you may also want to build stage-2. The repository for the GCC and Binutils branches may be found here:

GCC: Checkout branch “riscv-bitmanip”: https://github.com/embecosm/riscv-gcc
Binutils: Checkout branch “riscv-bitmanip”: https://github.com/embecosm/riscv-binutils-gdb

Below is an example test program that should give you different results upon enabling Bitmanip
instructions.

#include <stdint.h>

enum E { E0, E1, E2, E3 };

#define SOME_STATUS (1ULL << 50)
#define ANOTHER_STATUS (1ULL << 58)

void
f (enum E e, uint64_t *v)
{
  switch (e)
    {
    case E0:
      *v |= SOME_STATUS;
      break;

    case E1:
    case E2:
    case E3:
      *v |= ANOTHER_STATUS;
      break;
    }
}

Compiled with riscv64-unknown-elf-gcc -march=rv32imafdcb_zbb -O2 -S -o test.s test.c, the assembly should look as follows:

f:						f:
	beq	a0,zero,.L2				beq	a0,zero,.L2
	addiw	a0,a0,-1				addiw	a0,a0,-1
	li	a5,2					li	a5,2
	bgtu	a0,a5,.L7				bgtu	a0,a5,.L7
	ld	a5,0(a1)				ld	a5,0(a1)
	sbseti	a5,a5,58		      |		li	a4,1
					      >		slli	a4,a4,58
					      >		or	a5,a5,a4
	sd	a5,0(a1)				sd	a5,0(a1)
	ret						ret
.L2:						.L2:
	ld	a5,0(a1)				ld	a5,0(a1)
	sbseti	a5,a5,50		      |		li	a4,1
					      >		slli	a4,a4,50
					      >		or	a5,a5,a4
	sd	a5,0(a1)				sd	a5,0(a1)
	ret						ret
.L7:						.L7:
	ret						ret

This particular example demonstrates the usage of the sbset instruction, which is used to set a particular bit index to one.

Code Generation

As previously mentioned, not all of the new Bitmanip instructions are known to the RTL backend. This section will focus on those that are, and demonstrates the typical code cases for which they may be selected.

Direct instruction generation

GCC has naitive support for some bit counting instructions. Native in this context means that GCC intrinsically understands the notion of counting bits, trailing bits and leading bits, such that it has a dedicated named RTL instruction for it. Consequently all that is necessary, is to define an RTL template for it, and GCC will automatically consider it for code generation. From the bitmanip.md backend file:

(define_insn "clzsi2"
  [(set (match_operand:SI 0 "register_operand" "=r")
        (clz:SI (match_operand:SI 1 "register_operand" "r")))]
  "TARGET_ZBB"
  { return TARGET_64BIT ? "clzw\t%0,%1" : "clz\t%0,%1"; }
  [(set_attr "type" "bitmanip")])

...

(define_insn "ctzsi2"
  [(set (match_operand:SI 0 "register_operand" "=r")
        (ctz:SI (match_operand:SI 1 "register_operand" "r")))]
  "TARGET_ZBB"
  { return TARGET_64BIT ? "ctzw\t%0,%1" : "ctz\t%0,%1"; }
  [(set_attr "type" "bitmanip")])

...

(define_insn "popcountsi2"
  [(set (match_operand:SI 0 "register_operand" "=r")
        (popcount:SI (match_operand:SI 1 "register_operand" "r")))]
  "TARGET_ZBB"
  { return TARGET_64BIT ? "pcntw\t%0,%1" : "pcnt\t%0,%1"; }
  [(set_attr "type" "bitmanip")])

Indirect instruction generation

Some instructions are used in context of other operations. One such example is the “grev” instruction (Generalized reverse). A special case of this instruction is to reverse the 16-bit quantities within a register word, which is a direct implementation of one of the GCC generic intrinsic functions, __bswap16. This provides a single instruction implementation to what would otherwise take over 10 instructions.

Another such example is the zero extend operation. Armed with the Bitmanip extension, the GCC backend now has an additional strategy for performing zero-extend operations, using the “pack” instruction. This instruction has the semantics of taking the lower 32-bits of two source register operands, concatenating them together, and storing the result in the destination register. If one were to make the 2nd source register r0, the upper 32 bits of the destination register will be forced to zero.

Finally, the riscv.c generic backend code has also learned some new tricks with the help of Bitmanip. One such example is the loading of constants that happen to be a power of two. Normally, this operation is performed by the “li” psuedo-op, which itself expands to 3 instructions. With Bitmanip, we can simply use “sbseti”, with the index of the bit as an immediate argument.

Benchmarks

In order to quantify the effectiveness of the Bitmanip support implementation in GCC, we used the Embench benchmarking suite for both code size, and code performance measurements.

Code size

One of the primary motivations for the Bitmanip ISA development is to increase code density.


There is a very strong positive result for nettle-sha256. This is due to heavy use of rotate instructions. Other benchmarks do not see such a significant decrease in code size. crc32 in this case is something of an outlier since the decrease is only 4 bytes, and also due to the fact that the draft defines a CRC32 instruction that the code generator does not know about.

Another example of code size decrease can be found in the GCC runtime support library, libgcc. In particular, the generic soft-float implementation experiences code size decrease mainly thanks to the presence of bitcounting instructions clz, ctz, pcnt, thus permitting the compiler to select fast paths within the implementation.

Runtime performance

To measure runtime performance, we connected the PicoRV32 implementation together with the reference implementation of a Bitmanip ALU, and then ran the Embench testsuite through an instance of GDB, connected to the Embecosm Debug Server, which was itself connected to a verilated model of the above system.


The results were quite similar to that of code size. Again, nettle-sha256 benefited the most from the additional instructions, with a 40% decrease in runtime. The rest stayed within 5%, and as usual there is an outlier, in this case it being the qrduino benchmark. We have not yet investigated why it has regressed. Nonetheless, this shows a promising start for what is an experimental compiler backend.

Instruction subset selection

In the introduction, it was mentioned that the BMI extension defines 95 instructions spread across 9 classes. The draft specification goes into further detail as to the motivation behind this, which we will omit here. Instead, this section focuses on the user-facing configurations available.

You may select one or multiple subsets for code generation by simply passing them in alphabetical order, like so:

-march=rv32imc_zbc_zbe
-march=rv32imc_zbc_zbf_zbm

You may also select the “default” Bitmanip group which includes zbb, zbc, zbe, zbf, zbp, and zbs (all except zbm, zbr, zbt) by simply passing ‘b’ in the primary architecture string:

-march=rv32imc

So, in order to have the entire bitmanip ISA available, the shortest way to indicate this is as below:

-march=rv32imcb_zbm_zbr_zbt

Note, that the string you pass in for the -march directive will also be passed onto the assembler, which will use it to decide which assembly instructions are legal.

Conclusion

The Bitmanip ISA has a real future in terms of boosting both code performance, and improving code density. There are still several codegen opportunities in the machine description that could be filled in, but otherwise this ISA extension is just waiting for users.