Flexible Runtime Testing of LLVM on Embedded Systems

The LLVM Integrated Tester (aka lit) and LLVM Nightly Test infrastructure (aka lnt) systems are the primary means of testing LLVM tools. Lit is the driver system for the regression tests within LLVM and tests the non-target specific behavior of the compiler.  The nightly tests, lnt, comprise complete programs can are used to test the execution of generated code on a particular target.  Together lit and lnt form the LLVM testing infrastructure. While the regression tests are useful, the test suite is not well suited to an embedded system lacking an OS and file system, and some changes need to be made to allow the tests to execute remotely in this environment.

In this post, I expand on some of the alterations we have made to run the test suite remotely through the GNU Debugger, GDB, and I will also examine how we can use the GCC regression test suite with Clang as a (very large) additional set of execution tests.

A little clarification

The official documentation of LLVM is sloppy in places, using the generic term “test suite”, when it is not clear if it means the tests run by lit, the tests run by lnt or both.  And to confuse things further, lit and lnt are the names of tools to run tests, but are quite often used to refer to the set of tests to which they are usually applied.  Not helped by the fact that both sets of tests can be run in different ways, nor by the fact that the tools can be used to run other tests.

In this blog post we’ll keep things simple.  Lit and lnt are the tools to run tests.  They respectively run the LLVM regression tests (which involve no execution) and the nightly tests (which do involve execution and therefore relate to the particular target on which they run).

Running lnt using GDB

The LLVM test suite has the capability to run through a remote shell, but for a lightweight embedded system this is not feasible. If a target has GDB support then we can potentially debug remotely through the GNU Remote Serial Protocol (RSP). However, this is not supported out of the box and some changes need to be made in order to make it work. For running the test suite through GDB, the target and GDB server also need to support the File I/O Remote Protocol Extension, which allows the target to use the client file system and terminal to execute system calls. This is necessary as lnt operates by comparing the terminal output of the program against a reference output.

To run on targets where a remote shell is not supported (i.e. most deeply embedded targets) there is a hook in lnt, the RUNUNDER environment variable. This can be used to provide a wrapper which handles the execution of a compiled test, for example by running the program through a simulator on the client machine.

This hook can also be used to execute remotely through GDB. I have created a small gdb wrapper script — available from the flexible-llvm-testing repository on the Embecosm github — which performs the necessary steps:

  • connecting to the remote target;
  • executing the program until it reaches a breakpoint; and
  • extracting the return value.

In order to produce the output, we capture any output printed by the target using File-I/O, and then echo it to stderr when the program is complete.

The wrapper script is very simple, using the pexpect python library to handle GDB. Lnt should be run as usual for cross-compiling, however the –make-param option should be passed to set RUNUNDER to the path of the script. For example:

lnt runtest nt -j 1 \
 --sandbox /path/to/sandbox \
 --cc compile-command \
 --test-suite /path/to/test-suite \
 --make-param=RUNUNDER=/path/to/gdb-wrapper.py

This will need a GDB server to be set up for the target.  There are also currently a number of parameters hard-coded in the script, such as the GDB server IP address and port, which need to be set as appropriate.

Running GCC regression with Clang

GCC regression is a very large and comprehensive set of small, portable tests, used to test the GNU Compiler Collection.  At the time of writing it comprises around 75,000 C tests and 50,000 C++ tests.  Importantly it uses a POSIX 1003.3 test harness, DejaGnu, which has out-of-the-box support for running on remote targets through GDB. The downside of this test suite is that it is not entirely compatible with Clang, with some options and optimizations having minor differences in semantics between the compilers.

An important benefit of the tests in the regression suite are that they are all self-testing, and do not require any reference output. This is ideal for embedded systems where I/O is not always available. The suite is incredibly useful for finding errors in a new backend which does not have mature tools or libraries.

Running the GCC regression with a clang target is relatively straightforward. Starting the test suite can be done the same way as for running standard GCC regression, the only extra step is to make the test suite use clang as the compiler rather than GCC. This can be done in a couple of different ways:

  1. clang can be specified in the dejagnu board specification for the target, by adding the following lines to the board definition file for DejaGnu:

set_board_info compiler “clang” set GCC_UNDER_TEST “clang”

where “clang” is the command run to execute the compiler. Additionally, extra flags can be setup here:

set_board_info cflags “-fmessage-length=0 -fno-color-diagnostics -fno-caret-diagnostics”

These flags help clang to produce diagnostics closer to those expected by the test suite.

  1. Alternatively, clang can be specified through the runtest command itself, with the –tool_exec flag:
runtest --tool_exec="clang -fmessage-length=0 -fno-color-diagnostics -fno-caret-diagnostics"

And if running the test through make check the options can be passed in using the RUNTESTFLAGS environment variable:

make check RUNTESTFLAGS='--tool_exec="clang -fmessage-length=0 -fno-color-diagnostics -fno-caret-diagnostics"'

Regression can now be run and will yield a set of test results. We next need to overcome the incompatibility between the GCC test suite and clang.

The first set of problems which will be encountered is ‘test for excess errors’ and ‘test for excess warnings’ failures. These tests fail because the compiler produced output when the test was expecting none. One of the main causes of these failures is from warnings due to unused arguments, which can be alleviated by adding -Qunused-arguments to the command line. Another cause of these errors is extra warnings triggered by Clang, as well as differences in how warnings and errors are handled. Some GCC warnings are errors in Clang, and some errors and warnings are triggered in different circumstances by each.

Warnings and errors can be selectively suppressed (as appropriate) by passing them in through CFLAGS. For some tests this is insufficient as there may be unsupported flags provided as part of the flag combinations handled by the test. There is another mechanism which can be used to do more complicated things with the flags passed to Clang, called QA_OVERRIDE_GCC3_OPTIONS or CCC_OVERRIDE_OPTIONS (depending on the version of Clang). This can be used to override and add/remove flags provided to the command. In general this is neither necessary nor appropriate.

Altering the flags can cause clang to mimic the behaviour of GCC very closely, but there will still be a large number of outstanding tests which need to be omitted, or have their expected result changed for Clang. There are three ways of handling this:

1) The most straightforward way to handle this is to have a script which temporarily moves test files out of the test suite before it is run. This is remarkably effective, but it cannot do anything except blacklist tests, and can miss some tests which are generated by the suite itself.

2) A second approach is to diff the test summary against a reference output. This does work, but depends on the structure of the summary being unchanging. It also requires the whole test suite to be run including the unsupported tests which can be time consuming if tests take a while to run, or there are too many tests which timeout.

3) The approach sanctioned by the test suite itself is to mark the expected results of tests by annotating the source file with dg-options or by adding a .x file to exclude the test. Tests can be remarked as XFAIL and UNSUPPORTED based on the target triple, as well as a number of different differentiators. It is relatively straightforward to add  clang  and  llvm as an additional differentiator for the test suite. This is included in our changes discussed later.

Long term, the third option is the ideal solution. However, updating each unsupported test is time consuming and any changes to the test suite must be maintained when tests are changed or added if the target is not in GCC trunk.

Our solution to this problem is a patch to DejaGnu which adds a global override . The override takes the form of a file provided on the command line to the runtest command.

Each entry in the override specifies the test file and the new expected result (PASS, XFAIL, UNSUPPORTED). There are also optional fields which specify flags and the test-type (for example compile or execute) which allow the test to be overridden at much finer granularity (for example, only for the execution stage when it is compiled at -O1).

The patched DejaGnu and GCC regression suites are available on Embecosm GitHub. There is also an example override file which shows the format of the entries in the flexible-llvm-testing repository. In order to run with the override file, the file is provided to the runtest command through the –override_manifest flag.

It is possible to experiment with the override file from the command line. The changes to DejaGnu introduce a new python script called override_test.py which is found alongside runtest in the DejaGnu install/ directory and in the lib/ directory of the DejaGnu source. This can be run with the override file as an argument, for example:

override_test.py override_manifest

Once the script prints READY a query can be specified on stdin in the same format as the entries in the override file:

[gcc.dg/pr20216.c] [-O1] [execute]

The script will then print, XFAIL, UNSUPPORTED, or ABSENT if a matching entry is not found.

Even more testing in the future

The use of GCC regression tests with Clang/LLVM is not new.  Apple and Intel used the GCC 4.2.2 tests with their compilers in 2007.  However, we need to be able to move forward.  The GCC test suite is now much larger and includes tests for the latest 2011 and 2014 C/C++ standards.  With the database approach described, it is not hard to roll the tests forward to the latest versions, nor to include other tests such as the GDB regression tests, or the binutils and newlib tests suites.

The patch to DejaGnu is entirely generic and we’ll be submitting it upstream for inclusion in the standard tool.  Then any test suite will be able to provide such an external hook.