Non 8-bit char support in Clang and LLVM

Recently I spoke at FOSDEM on support for non 8-bit characters in LLVM. This is something I have been working on as part of bringing up a backend for a new DSP architecture. This architecture is completely 16-bit word addressed, so we endeavoured to support this natively in LLVM.

The C language has the concept of CHAR_BIT. This is the width in bits of the ‘char’ type, and the result of the sizeof() operator is TOTAL_BITS / CHAR_BIT. According to the C standard, CHAR_BIT must be >=8 (limits.h). In general, the only time a programmer will care about the value of CHAR_BIT in when they are a doing low level bit-operations which require knowing the number of bits in a field.

Although the C standard permits CHAR_BIT != 8, for most commonly used architectures it is generally 8. For efficiency it is best if CHAR_BIT matches the minimum addressable unit of memory as it means that loads and stores of character values will map down to a single instruction — however, even architectures which have >8 bit addressed memory will often specify CHAR_BIT==8. There are several reasons that this is desirable:

  1. A lot of programs are written assuming CHAR_BIT == 8, because many programs will never target an architecture where that is not the case. Many programs also take advantage of the uint8_t and int8_t types provided by C99, even though the existence of these types is implementation dependent.
  2. The POSIX standard mandates that CHAR_BIT == 8 to maximize compatibility with anything which does networking which almost exclusively deals in octets.
  3. For a minimum addressable unit greater than 8 bits, each ‘char’ will waste extra space. Therefore, if possible it is beneficial to try and pack multiple ‘char’ values into each memory location to avoid wasting space.

There are a lot of widely used architectures where the minimum addressable unit is 8-bits, however there are also a lot of small embedded systems where this is not the case; many embedded systems are word addressed with 16-bit words, and there are some architectures which have 10-bit or 24-bit words. In particular, DSPs often have 16-bits as the minimal addressable unit.

Where an architecture uses CHAR_BIT == 8, but the minimum addressable unit is not 8-bits, the compiler will sometimes need to emit extra code whenever the ‘char’ type is used. Whenever a ‘char’ is loaded or stored, the minimum addressable unit is loaded or stored instead. Extra code may then be emitted to truncate or sign extend values to make them word length. If multiple ‘chars’ are packed into memory location, then code also needs to be emitted to calculate the address and mask off the correct part of the loaded value.

The above handling done by the compiler gives a performance and code size hit, which is not desirable in embedded systems where both come at premium. It is possible to avoid this problem by avoiding the ‘char’ type when programming these systems. However, this essentially means that strings are off limits, while introducing a performance gotcha if the programmer ever ports code or uses a ‘char’ type. There are also some places where you can’t avoid ‘char’, such as in library calls. ‘memcpy’ and ‘memset’ in particular are used very frequently — and often generated by the compiler — and really need to be fast.

The above downsides may be acceptable. But if they are not then you need to be able to support CHAR_BIT != 8 in the compiler. By supporting CHAR_BIT != 8 without any trickery it is possible to use portable legacy code, tests and benchmarks without concerns about performance or code size problems. Clang + LLVM are capable of supporting this, although some modifications are required.

Support in LLVM and Clang

At a first glance, LLVM and Clang support for non 8-bit chars looks like it would be straightforward. The LLVM intermediate representation specifies all types in terms of their size in bits, as does the DataLayout of a target. In Clang, targets can also specify CharWidth to change the size of CHAR_BIT.

Unfortunately, from a closer look, it is not quite so straightforward. There is nothing inherent in the design of LLVM which prevents non 8-bit chars from working — and in fact the clean design of LLVM makes it far easier than it could be. However, there are a lot of small changes which need to be made in order to support this:

  1. LLVM has no interface for a target to express the size of the minimal addressable unit.
  2. Hard-coded assumptions where 8-bit chars are assumed. There are a lot of places where /8 or *8 are hard-coded into the codebase. These must be replaces with queries to the size of the minimum addressable unit for the target. This is primarily a mechanical change but also involves changing a number of interfaces.
  3. Many sizes are in units of bytes. Some of these can be changed to be bit-sizes, and others need to be changed to be in multiples of the minimal addressable unit.
  4. Numerous calls to getInt8PtrTy. These are sometimes created in optimizations, but also appear because LLVM does not have a void* type. Again this is a mechanical change. These can generally be replaced with calls to getIntNPtrTy.
  5. Updating intrinsics with i8* parameters/returns to use a ‘char’ pseudo type instead. This means that important intrinsics like memset/memcpy will use the correct type.

None of these changes require any major redesign of any LLVM components. It is mainly just a large number of manual changes to the compiler. For our out-of-tree target we have made most of the necessary changes to the compiler, and we’re working on teasing these changes out and submitting them as small patches.

Because none of these changes require any major changes to LLVM, it should be possible to add this changes in an incremental manner. The process will probably be something like the following:

  1. Add an interface for targets to specify the minimal addressable unit.
  2. Incrementally migrate cases where an 8-bit addressable unit is assumed to use this new interface. Replaces uses of Int8PtrTy with IntNPtrTy. Update sizes to be in multiples of the minimal addressable unit size.
  3. Update intrinsics to use a special char* type instead of i8*.

One missing piece is a target to act as a test for this new behavior. At Embecosm, we have been working on AAP for just this purpose. At the moment AAP has 8-bit byte addressed memory, however, the purpose of the architecture is to work as a test case for interesting features, so in order to support non 8-bit characters we are creating a version of the architecture which is 16-bit word addressed.