Services - tools - models - for embedded software development
Embecosm divider strip
Prev  Next

7.4.  Profiling the Completed Model

The final stage is to look at the finished model for any modules which are dominating the compute time. These are candidates for replacement with equivalent modules optimized for cycle accurate modeling.

Common causes of performance bottlenecks are:

Verilator provides the -profile-cfuncs flag, which adds additional information to the compiled code, identifying the module to which it belongs. Compiling the model using the GNU C++ compiler's -g and -pg flags will instrument the compiled code for profiling. A subsequent run will generate a gmon.out file, which can be analyzed using the standard gprof command.

Verilator provides a utility, verilator_profcfunc, for post-processing the results of the gprof. This breaks out the processing time by Verilog module name, rather than the underlying C++ function.

When profiling, no optimization should be used. Although the GNU C++ compiler allows optimized profiling, it can be a source of confusion, when parts of the code are optimized away. Unoptimized models are just as effective in highlighting any performance bottlenecks. With the example design, the following sequence of commands is appropriate:

make verilate COMMAND_FILE=cf-optimized-8.scr \
     VFLAGS="-profile-cfuncs" NUM_RUNS=1000 OPT="-g -pg"
gprof Vorpsoc_fpga_top > gprof.out
verilator_profcfunc gprof.out vprof.out
      

The first part of the output file, vprof.out identifies where the execution time went:

Overall summary by type:
  % time  type
    4.62  C++
   17.45  Common code under Vorpsoc_fpga_top
   72.74  Verilog Blocks under Vorpsoc_fpga_top
    5.19  Unaccounted for/rounding error
      

The C++ code is code outside the Verilator model. In the example used here, that is the SystemC test bench. The common code under Vorpsoc_fpga_top is the common infrastructure code. The Verilog blocks are the C++ code of directly derived from the Verilog. Finally, there is time that was spent outside profiled code. In this example, that will be largely due to the SystemC kernel, but since gprof is based on statistical sampling it also includes a small amount of time which cannot be accounted for.

There is nothing significant in this example A warning sign to watch for is if the either the C++ or unaccounted figure is very high. That could be a problem with a SystemC test bench—perhaps with very wide ports.

The next section is a summary of the same information, grouping the common code and Verilog blocks:

Overall summary by design:
  % time  design
    4.62  C++
   90.19  Vorpsoc_fpga_top
    5.19  Unaccounted for/rounding error
      

In both these cases, instantiation of multiple models would make for more entries.

The third section is the most important. It shows how the execution time was broken down by originating Verilog module:

Overall summary by module:
  % time  module
    4.62  C++
   17.45  Vorpsoc_fpga_top common code
    0.11  dbg_crc8_d1
    0.00  dbg_register
    0.17  dbg_registers
    0.76  dbg_sync_clk1_clk2
    ...
      

This is provided in alphabetical order, but it is useful to cut out this section and sort it (using the command sort -n -r):

   17.45  Vorpsoc_fpga_top common code
    7.69  eth_wishbone_4
    5.17  or1200_du
    5.05  uart_regs_2
    4.62  C++
    3.77  tc_top
    3.41  eth_registers
    3.38  eth_crc
    3.07  dbg_top_3
      

The common code can be ignored—that is beyond control. Look for any small modules that are using a lot of processing.

[Note]Note

The names used are that of the originating file, not the module name, with any hyphen ("-") mapped to underscore ("_"). Thus the first example here is the module eth_wishbone, but in the file eth_wishbone-4.v

There are no real bit CPU hogs in this example. The largest user, eth_wishbone-4.v uses over 7% of the execution time, but it is a large block (more than 2,500 lines of Verilog), so this is not unreasonable. The other modules at the top of the list are also all big blocks of code.

It is worth observing that in the current model, the Ethernet is tied off and unused. If there is no intention to develop the model to use the Ethernet the instantiation could be removed altogether, perhaps improving performance by 20% or so. The same observation applies to a lesser extent with the other peripherals, currently unused.

Embecosm divider strip