2. User Guide

2.1. Introduction to High-Level Synthesis

High-level synthesis (HLS) refers to the synthesis of a hardware circuit from a software program specified in a high-level language, where the hardware circuit performs the same functionality as the software program. For LegUp, the input is a C-language program, and the output is a circuit specification in the Verilog hardware description language. The LegUp-generated Verilog can be input to RTL synthesis, place and route tools to produce an FPGA or ASIC implementation of the circuit. The underlying motivation for HLS is to raise the level of abstraction for hardware design, by allowing software methodologies to be used to design hardware. This eases the effort required for hardware design, improving productivity and reducing time-to-market.

While a detailed knowledge of HLS is not required for the use of LegUp, it is nevertheless worthwhile to highlight key steps involved in converting software to hardware:

  • Allocation: The allocation step defines the constraints on the generated hardware, including the number of hardware resources of a given type that may be used (e.g. how many divider units may be used, the number of RAM ports, etc.), as well as the target clock period for the hardware, and other user-supplied constraints.
  • Scheduling: Software programs are written without any notion of a clock or finite state machine (FSM). The scheduling step of HLS bridges this gap, by assigning the computations in the software to happen in specific clock cycles in the hardware. Putting this another way, the scheduling step defines the FSM for the hardware and associates the computations of the software with specific states in the FSM. This must be done in a manner that honors the data-dependencies in the software to ensure that, for example, the inputs to a computation are available/ready before the computation executes. Likewise, given a user-supplied clock period constraint, e.g. 10 ns, the schedule step must ensure the computations in a single FSM state do require time exceeding the target period.
  • Binding: While a software program may contain an arbitrary number of operations of a given type (e.g. multiplications), the hardware may contain only a limited number of units capable of performing such a computation. The binding step of HLS is to associate (bind) each computation in the software with a specific unit in the hardware.
  • Verilog generation: After the steps above, an FSM and datapath have been defined. The final step of HLS to generate a description of the circuit in a hardware description language (Verilog).

Executing computations in hardware brings speed and energy advantages over performing the same computations in software running on a processor. The underlying reason for this is that the hardware is dedicated to the computational work being performed, whereas a processor is generic and has the consequent overheads of fetching and decoding instructions, loads/stores to memory, and so on. Drastic speed advantages are possible by using hardware if one is able to exploit hardware parallelism, namely, the execution of computations concurrently that otherwise, on a processor, would need to happen sequentially. With LegUp, one can exploit four styles of hardware parallelism.

2.1.1. Instruction-level Parallelism

Instruction-level parallelism refers to the ability to execute computations concurrently based on an analysis of data dependencies. Computations that do not depend on each other can be executed at the same time. Consider the following C-code snipped which performs three addition operations.

z = a + b;
x = c + d;
q = z + x;
...

Observe that the first and second additions do not depend on one another. These additions can therefore be executed concurrently, as long as there are two adder units available in the hardware. LegUp automatically analyzes the dependencies between computations in the software to discover such opportunities and maximize concurrency in the execution of computations in hardware. In the above example, the third addition operation depends on the results of the first two, and hence, its execution cannot be overlapped with the others. Instruction-level parallelism is referred to as fine-grained parallelism, as the number of computations that are overlapped with one another in time is usually small.

2.1.2. Loop-level Parallelism

Software programs typically spend much of their time executing loops. From a software programmer’s perspective, the execution of loop iterations is sequential and non-overlapping. That is, loop iteration i executes to completion before iteration i + 1 commences. Now, imagine a loop with N iterations and whose loop body takes 3 clock cycles to complete. Under typical sequential software execution, the loop would take 3N clock cycles to execute. With LegUp, however, it is possible to overlap the loop iterations with one another using a technique called loop pipelining. The idea is to execute a portion of a loop iteration i and then commence executing iteration i + 1 even before iteration i is complete. Returning to the previous example, if loop pipelining were able to commence a new loop iteration every clock cycles, then the total number of cycles to execute the loop would be approximately N – a significant reduction relative to 3N.

As will be elaborated upon below, LegUp performs loop pipelining on the basis of user constraints. By default, it is not applied automatically.

2.1.3. Thread-level Parallelism

Today’s CPUs have multiple processor cores contained within and, with multi-threaded software, it is possible to parallelize an application to use multiple cores at once, thereby improving run-time. POSIX threads, Pthreads, is the most popular approach for multi-threaded programming in the C language, where, parallelism is realized at the granularity of entire C functions. As such, thread-level parallelism is referred to as coarse-grained parallelism since a large amount of computational work is overlapped in time. LegUp supports the synthesis of multi-threaded C programs into hardware, where concurrently executing software threads are synthesized into concurrently executing hardware units. This allows a software developer to take advantage of spatial parallelism in hardware using a programming style they are likely already familiar with. Moreover, the design and debugging of a parallel hardware implementation can happen in software, which it is considerably easier to identify and resolve bugs.

There are two challenging aspects to multi-threaded software programming: 1) synchronization among threads, and 2) interactions among threads through shared memory. In the former case, because threads execute asynchronously within an operating system, it is often desirable to ensure multiple threads have reached a common point in the program’s execution before allowing any to advance further. This is typically realized using a barrier. LegUp supports the synthesis of barriers into hardware, where the generated hardware’s execution matches the software behaviour. Regarding interactions through shared memory, it is typical that one wishes to restrict the ability of multiple threads to access a given memory location at the same time. This is done using critical sections, which are code segments wherein at most one thread can execute at a given time. Critical sections are specified in software using mutexes, and LegUp supports the synthesis of mutexes into hardware.

2.1.4. Data Flow (Streaming) Parallelism

The second style of coarse-grained parallelism is referred to as data flow parallelism. This form of parallelism arises frequently in streaming applications, which constitute the majority of video and audio processing, but also applications in machine learning and computational finance. In such applications, there is a stream of input data that is fed into the application at regular intervals. For example, in an audio processing application, a digital audio sample may be input to the circuit every clock cycle. In streaming applications, a succession of computational tasks is executed on the stream of input data, producing a stream of output data. For example, in audio processing, a first task may be to filter the input audio to remove high-frequency components. Subsequently, a second task may receive the filtered audio, and boost the bass low-frequency components. Observe that, in such a scenario, the two tasks may be overlapped with one another. Input samples are received by the first task and second task on a continuous basis.

LegUp provides a unique way for a software developer to specify data flow parallelism, namely, through the use of Pthreads and the use of a special API to interconnect the computational tasks. As will be elaborated upon below, to specify data flow hardware, one creates a thread for each computational task, and then interconnects the tasks using FIFO buffers via a LegUp-provided API. In hardware, each thread is synthesized into a hardware module that receives/produces inputs/outputs at regular intervals. The FIFO buffers between the modules hold results produced by one module, and waiting to be picked up by a downstream module.

2.2. LegUp Overview

LegUp accepts a C software program as input and automatically generates hardware described in Verilog HDL (hardware description language) that can be synthesized onto an FPGA.

LegUp has two different synthesis flows:
  • Hardware Flow: Synthesizes the whole C program (or user-specified C functions) into a hardware circuit.
  • Processor-Accelerator Hybrid Flow: Synthesize the whole C program into a processor-accelerator SoC (System-on-chip).

2.3. Hardware Flow

LegUp’s hardware flow synthesizes a C program into a hardware circuit (without a processor). One can compile the entire C program, or specific functions of the program, into hardware. To specify certain C functions to be compiled to hardware in the hardware flow, you need to specify a top-level function (see Specifying a Custom Top-level Function), then LegUp compiles the top-level function and all of its descendant functions to hardware. By default, the main function is the top-level function, hence the entire program is compiled to hardware.

To run the hardware flow click the following button:

_images/icon_hardware_flow.png

You can also run this flow in command line by running legup hw.

2.4. Processor-Accelerator Hybrid Flow

LegUp can automatically compile a C program into a complete processor-accelerator SoC comprising an ARM processor and one or more hardware accelerators. One can designate C functions to be compiled to hardware accelerators and LegUp will automatically partition the program so that the remaining program segments are executed on the ARM processor. The communication between the processor and hardware accelerators, as well as the generation of the system interconnect, are automatically handled by LegUp.

To accelerate a function, you can specify the function name in the config.tcl file as shown below:

set_accelerator_function "function_name"

Then run legup hybrid in command line, which generates the complete system. The architecture of a processor-accelerator hybrid system is described in Hardware Architecture.

Note

The hybrid flow is currently only available for Intel FPGAs in the fully licensed version of LegUp. It is not available with an evaluation license.

2.5. LLVM IR Input Flow

By default LegUp accepts C/C++ as input. However, some advanced users may wish to input LLVM intermediate representation directly into LegUp. LegUp’s LLVM IR flow synthesizes LLVM intermediate representation code into a hardware circuit. You can compile an LLVM IR into a hardware circuit by specifying an LLVM IR file (either a .ll or .bc file) using the INPUT_BITCODE variable inside the makefile of your project:

INPUT_BITCODE=input.ll

2.6. Verification and Simulation

The output of the circuit generated by LegUp should always match the output of the C program for the same set of inputs. Users should not modify the Verilog generated by LegUp, as this is overwritten when LegUp runs.

For debugging purposes, LegUp converts any C printf statements into Verilog $write statements so that values printed during software execution will also be printed during hardware simulation. This allows easy verification of the correctness of the hardware circuit. Verilog $write statements are unsynthesizable and will not affect the final FPGA hardware.

2.7. SW/HW Co-Simulation

SW/HW co-simulation can be used when the user specifies a custom top-level function (Specifying a Custom Top-level Function). This avoids having to manually write a custom RTL testbench (Specifying a Custom Test Bench).

To use SW/HW co-simulation, the input software program will be composed of two parts,

  • A top-level function (may call other sub-functions) to be synthesized to hardware by LegUp,
  • A C/C++ test bench (typically implemented in main()) that invokes the top-level function with test inputs and verifies outputs.

SW/HW co-simulation consists of the following automated steps:

  1. LegUp runs your software program and saves all the inputs passed to the top-level function.
  2. LegUp automatically creates an RTL testbench that reads in the inputs from step 1 and passes them into the LegUp-generated hardware module.
  3. Modelsim simulates the testbench and saves the LegUp-generated module outputs.
  4. LegUp runs your software program again, but uses the simulation outputs as the output of your top-level function.

We assume that if the return value of main() is 0 then all tests pass, so you should set a non-zero return value from main() when the top-level function produces incorrect outputs. In step 1, we verify that the program returns 0. In step 4, we run the program using the outputs from simulation and if the LegUp-generated circuit matches the C program then main() should still return 0.

If the C program matches the RTL simulation then you should see: C/RTL co-simulation: PASS

Note that you must pass in as arguments into the top-level function any values that are shared between software and hardware. For example, if there are arrays that are initialized in the C testbench that are used as inputs to the hardware function, they need to be pass in as arguments into the top-level function, even if the arrays are declared are global variables. In essence, by passing in an argument into the top-level function, you are creating an interface for the hardware core generated by LegUp. Arguments into the top-level function can be constants, pointers, arrays, and FIFO data types. For FIFO arguments, they use our C++ FIFO library (legup::FIFO<int>), which also supports FIFOs of structs (legup::FIFO<struct>). The top-level function can also have a return value. Please see the provided example, C++ Canny Edge Detection, as a reference.

If a top-level argument is a dynamically allocated array (with malloc), the number of elements of the array must be specified with the Set top-level argument number of elements constraint. Please see the set_top_level_argument_num_elements page for more details. Statically allocated arrays do not need to be specified.

Limitations:

  • For FIFOs of structs, there cannot be nested structs. A struct cannot contain another struct inside.

Possible reasons for SW/HW co-simulation failure:

  • Assigning a value to an arbitrary bitwidth that overflows (i.e. int2 = 40)

2.8. LegUp Constraints

The input language to LegUp is C. LegUp does not require tool-specific pragmas or special keywords. Instead, LegUp accepts user-specified constraints that guide the generated hardware. Each project specifies its constraints in the config.tcl file in the project directory. This file is automatically generated by the LegUp IDE. To modify the constraints, click the HLS Constraints button:

_images/icon_constraints.png

The following window will open:

_images/empty_constraint_setting_window.png

You can add, edit, or remove constraints from this window. Select a constraint type from the first dropdown menu. If you want more information about a constraint, click the Help button, which will open the corresponding Constraints Manual page.

An important constraint is the target clock period (shown as Set target clock period in the dropdown menu). With this constraint, LegUp schedules the operations of a program to meet the specified clock period. When this constraint is not given, LegUp uses the default clock period for each device, as shown below.

FPGA Vendor Device Default Clock Frequency (MHz) Default Clock Period (ns)
Intel Arria 10 200 5
Intel Arria V 200 5
Intel Stratix V 200 5
Intel Stratix IV 167 6
Intel Cyclone V 100 10
Intel Cyclone IV 67 15
Xilinx Virtex UltraScale 9+ 200 5
Xilinx Virtex 7 200 5
Xilinx Virtex 6 167 6
Lattice ECP5 100 10
Microsemi PolarFire 100 10
Microsemi Fusion 100 10
Achronix Speedster 200 5

Details of all LegUp constraints are given in the Constraints Manual.

2.9. Loop Pipelining

Loop pipelining is an optimization that can automatically extract loop-level parallelism to create an efficient hardware pipeline. It allows executing multiple loop iterations concurrently on the same pipelined hardware.

To use loop pipelining, a label needs to be added the loop:

my_loop_label:
for (i = 1; i < N; i++) {
    a[i] = a[i-1] + 2
}

Then add a loop pipeline constraint in the HLS Constraints window:

_images/pipeline_loop_constraint.png

An important concept in loop pipelining is the initiation interval (II), which is the cycle interval between starting successive iterations of the loop. The best performance and hardware utilization is achieved when II=1, which means that successive iterations of the loop can begin every clock cycle. A pipelined loop with an II=2 means that successive iterations of the loop can begin every two clock cycles, corresponding to half of the throughput of an II=1 loop.

By default, LegUp always attempts to create a pipeline with an II=1. However, this is not possible in some cases due to resource constraints or cross-iteration dependencies. This is described in more detail in Optimization Guide. When II=1 cannot be met, LegUp’s pipeline scheduling algorithm will try to find the smallest possible II that satisfies the constraints and dependencies.

2.9.1. A Simple Loop Pipelining Example

We will use a simple example to demonstrate loop pipelining. First import the provided example project, loop_pipelining_simple, contained within the LegUp installation directory (please refer to Using Example Projects for importing this example project into the LegUp IDE). Let’s first run the C program in software by clicking the Compile Software and Run Software buttons.

#include <stdio.h>

#define N 4
// For the purpose of this example, we mark input arrays as volatile to
// prevent optimization based on input values. In a real program, the volatile keyword
// should only be used when absolutely necessary, as it prevents many optimizations.
volatile int a[N] = {1, 2, 3, 4};
volatile int b[N] = {5, 6, 7, 8};
int c[N] = {0};

// Simple loop with an array
int main() {
    int i = 0, sum = 0;

#pragma unroll 1 // Prevents the loop below from being unrolled.
// The loop label is used for setting the pipelining constraint.
my_loop_label:
    for (i = 0; i < N; i++) {
        printf("Loop body\n");
        printf("a[%d] = %d\n", i, a[i]);
        printf("b[%d] = %d\n", i, b[i]);
        c[i] = a[i] * b[i];
        printf("c[%d] = %d\n", i, c[i]);
        sum += c[i];
    }

    if (sum == 5 + 12 + 21 + 32)
        printf("PASS\n");
    else
        printf("FAIL\n");

    return sum;
}

Note

LegUp automatically unrolls loops with small trip counts. However, for a loop to be pipelined, it cannot be unrolled. Hence, for this example, we have added “#pragma unroll 1” to prevent the loop from being unrolled.

When the program is executed, you can see in the console window that the array elements of a, b, and c are printed in order. Now click Compile Software to Hardware and Simulate Hardware. You can see that the simulation output matches with the output from software execution, and the reported cycle latency is 24. Note that we have not yet pipelined the loop and it is still executed in sequential order.

To pipeline the loop, open up the HLS Constraints window and add the Pipeline loop constraint with the value set to the loop label, “my_loop_label”. Re-run Compile Software to Hardware and you should see that LegUp reports “Pipeline Initiation Interval (II) = 1”. When you click on Simulate Hardware, you should see that the cycle latency is now reduced to 12 clock cycles, and the arrays are no longer accessed in order. For instance a[1] is printed out before c[0]. This is because the second iteration (that prints a[1]) now runs in parallel with the first iteration (that prints c[0]). The second iteration’s load operations for a[1] and b[1] are now happening at the same time as the first iteration’s multiply that computes c[0].

# Loop body
# Loop body
# a[          0] =           1
# b[          0] =           5
# Loop body
# a[          1] =           2
# b[          1] =           6
# c[          0] =           5
# Loop body
# a[          2] =           3
# b[          2] =           7
# c[          1] =          12
# a[          3] =           4
# b[          3] =           8
# c[          2] =          21
# c[          3] =          32
# PASS
# At t=              290000 clk=1 finish=1 return_val=        70
# Cycles:                   12

To get more information about the schedule of the pipelined loop body, click Launch Schedule Viewer to open up LegUp’s schedule viewer:

_images/open_scheduleviewer.png

You will first see the call graph of the program — the main function calling the printf function. Double-click the main function’s circle to see the control-flow graph of the function. As shown below, you should find a circle named BB_1 with a back edge (an arrow pointing to itself). This means that the block is a loop body, and in this case it corresponds to the pipelined loop.

_images/loop_pipelining_simple_cfg.png

Double-clicking on BB_1 will take you to the Pipeline Viewer, which illustrates the pipeline schedule:

_images/loop_pipelining_simple_pipeline_viewer.png

The top two rows show the time steps (in terms of clock cycles) and the number of pipeline stages. For instance, the loop in the example has 5 pipeline stages and each iteration takes 5 cycles to complete. Each cell in the table shows all operations that are scheduled for each time step of a loop iteration. When looking at table horizontally, it shows all operations that are to occur (over the span of 5 clock cycles) for each loop iteration. When looking at table vertically, it shows all concurrent operations for each time step. For example, the first iteration’s multiply operation (%10 = mul nsw i32 %9, %8 [1], shown in Cycle 1 column of Iteration: 0 row) is scheduled in cycle 1. Meanwhile, also in cycle 1, the second iteration’s load operations are also scheduled. In cycle 4 (circled by a bold box), the pipeline reaches steady-state. This is when pipelined hardware reaches maximum utilization and is concurrently executing 5 loop iterations (since the loop is divided into 5 time steps, the pipelined hardware can execute 5 different loop iterations at the same time).

[1]Use mouse to hover over the cell to see the complete statement of an operation.

2.10. Multi-threading with Pthreads

In an FPGA hardware system, the same module can be instantiated multiple times to exploit spatial parallelism, where all module instances execute in parallel to achieve higher throughput. LegUp allows easily inferring such parallelism with the use of POSIX Threads (Pthreads), a standard multi-threaded programming paradigm that is commonly used in software. Parallelism described in software with Pthreads is automatically compiled to parallel hardware with LegUp. Each thread in software becomes an independent module that concurrently executes in hardware.

For example, the code snippet below creates N threads running the Foo function in software. LegUp will correspondingly create N hardware instances all implementing the Foo function, and parallelize their executions. LegUp also supports mutex and barrier APIs (from Pthreads library) so that synchronization between threads can be specified using locks and barriers.

void* Foo (int* arg);

for (i = 0; i < N; i++) {
    pthread_create(&threads[i], NULL, Foo, &args[i]);
}

To see a complete multi-threading example, please refer to the example project, multi_threading_simple, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE). The example also demonstrates the use of a mutex lock to protect a critical region.

LegUp supports the most commonly used Pthread APIs, which are listed in Supported Pthread/OpenMP APIs.

Note that for a Pthread kernel, LegUp will automatically inline any of its descendant functions. The inlining cannot be overridden with the noinline_function constraint.

2.11. Loop Multi-threading with OpenMP

LegUp also supports the use of OpenMP, which allows parallelizing loops in a multi-threaded fashion (as opposed to loop pipelining which parallelizes loop iterations in a pipelined fashion). OpenMP provides a simple and a high-level approach for parallelization. With a single pragma, a user is able to parallelize loops without performing complicated code changes. For example, in the code snippet below, the loop performs a dot product of two arrays, A_array and B_array. To parallelize this loop using OpenMP, one simply puts an OpenMP pragma before the loop.

#pragma omp parallel for num_threads(2) private(i)
for (i = 0; i < SIZE; i++) {
    output[i] = A_array[i] * B_array[i];
}

The OpenMP pragma, #pragma omp parallel for, is used to parallelize a for loop. The pragma uses a number of clauses. The num_threads clause sets the number of threads to use in the parallel execution of the for loop. The private clause declares the variables in its list to be private to each thread. In the above example, two threads will execute the loop in parallel, with one thread handling the first half of the arrays, and the other handling the second half of the arrays. LegUp will synthesize the two parallel threads into two concurrently running accelerators, each working on half of the arrays. Note that the parallel pragma in OpenMP is blocking – all threads executing the parallel section need to finish before the program execution continues.

To see a complete OpenMP example, please refer to the example project, openmp_reduction, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE). In this example, you will see how one can simply use OpenMP’s reduction clause to sum up all elements in an array with parallel threads.

LegUp supports a subset of OpenMP that is used for parallelizing loops. The supported OpenMP pragmas and OpenMP functions are listed in Supported Pthread/OpenMP APIs.

2.12. Supported Pthread/OpenMP APIs

LegUp currently supports the following Pthread and OpenMP functions/pragmas:

Pthread Functions OpenMP Pragmas OpenMP Functions
pthread_create omp parallel omp_get_num_threads
pthread_join omp parallel for omp_get_thread_num
pthread_exit omp master  
pthread_mutex_lock omp critical  
pthread_mutex_unlock omp atomic  
pthread_barrier_init reduction(operation: var)  
pthread_barrier_wait    

2.13. Data Flow Parallelism with Pthreads

Data flow parallelism is another commonly used technique to improve hardware throughput, where a succession of computational tasks that process continuous streams of data can execute in parallel. The concurrent execution of computational tasks can also be accurately described in software using Pthread APIs. In addition, the continuous streams of data flowing through the tasks can be inferred using LegUp’s built-in FIFO data structure (see Streaming Library).

Let’s take a look at the example project, “Fir Filter (Loop Pipelining with Pthreads)”, provided in LegUp IDE (please refer to Quick Start Tutorial for instructions on how to create a project from provided example). In the example, the main function contains the following code snippet:

// Create input and output FIFOs
legup::FIFO<int> input_fifo;
legup::FIFO<int> output_fifo;

// Build struct of FIFOs for the FIR thread.
struct thread_data data;
data.input = &input_fifo;
data.output = &output_fifo;

// Launch pthread kernels.
pthread_t thread_var_fir, thread_var_injector, thread_var_checker;
pthread_create(&thread_var_fir, NULL, FIRFilterStreaming, (void *)&data);
pthread_create(&thread_var_injector, NULL, test_input_injector, input_fifo);
pthread_create(&thread_var_checker, NULL, test_output_checker, output_fifo);

// Join threads.
pthread_join(thread_var_injector, NULL);
pthread_join(thread_var_checker, NULL);

The corresponding hardware is illustrated in the figure below.

_images/FIR_Pthreads_schematic.pdf

The two legup::FIFO<int>s in the C++ code corresponds to the creation of the two FIFOs, where the bit-width is set according to the type shown in the constructor argument <int>. The three pthread_create calls initiate and parallelize the executions of three computational tasks, where each task is passed in a FIFO (or a pointer to a struct containing more than one FIFO pointers) as its argument.

The FIFO connections and data flow directions are implied by the uses of FIFO read() and write() APIs. For example, the test_input_injector function has a write() call writing data into the input_fifo, and the FIRFilterStreaming function uses a read() call to read data out from the input_fifo. This means that the data flows through the input_fifo from test_input_injector to FIRFilterStreaming.

The pthread_join API is called to wait for the completion of test_input_injector and test_output_checker. We do not “join” the FIRFilterStreaming thread since it contains an infinite loop (see code below) that is always active and processes incoming data from input_fifo whenever the fifo is not empty. This closely matches the always running behaviour of streaming hardware.

Now let’s take a look at the implementation of the main computational task (i.e., the FIRFilterStreaming threading function).

void *FIRFilterStreaming(void *threadArg) {
    struct thread_data *arg = (struct thread_data *)threadArg;
    legup::FIFO<int> *input_fifo = arg->input, *output_fifo = arg->output;

loop_fir:
    // This loop is pipelined and will be "always running", just like how a
    // streaming module runs whenever a new input is available.
    while (1) {
        // Read from input FIFO.
        int in = input_fifo->read();

        static int previous[TAPS] = {0}; // Need to store the last TAPS - 1 samples.
        const int coefficients[TAPS] = {0, 1, 2,  3,  4,  5,  6,  7,
                                        8, 9, 10, 11, 12, 13, 14, 15};

        int j = 0, temp = 0;

        for (j = (TAPS - 1); j >= 1; j -= 1)
            previous[j] = previous[j - 1];

        previous[0] = in;

        for (j = 0; j < TAPS; j++)
            temp += previous[TAPS - j - 1] * coefficients[j];

        int output = (previous[TAPS - 1] == 0) ? 0 : temp;

        // Write to output FIFO.
        output_fifo->write(output);
    }
    pthread_exit(NULL);
}

In the code shown in the example project, you will notice that all three threading functions contain a loop, which repeatedly reads and/or writes data from/to FIFOs to perform processing. In LegUp, this is how one can specify that functions are continuously processing data streams that are flowing through FIFOs.

2.13.1. Further Throughput Enhancement with Loop Pipelining

In this example, the throughput of the streaming circuit will be limited by how frequently the functions can start processing new data (i.e., how frequently the new loop iterations can be started). For instance, if the slowest function among the three functions can only start a new loop iteration every 4 cycles, then the throughput of the entire streaming circuit will be limited to processing one piece of data every 4 cycles. Therefore, as you may have guessed, we can further improve the circuit throughput by pipelining the loops in the three functions. If you run LegUp synthesis for the example (Compile Software to Hardware), you should see in the report file, summary.legup.rpt, that all loops can be pipelined with an initiation interval of 1. That means all functions can start a new iteration every clock cycle, and hence the entire streaming circuit can process one piece of data every clock cycle. Now run the simulation (Simulate Hardware) to confirm our expected throughput. The reported cycle latency should be just slightly more than the number of data samples to be processed (INPUTSIZE is set to 128; the extra cycles are spent on activating the parallel accelerators, flushing out the pipelines, and verifying the results).

2.14. Function Pipelining

You have just seen how an efficient streaming circuit can be described in software by using loop pipelining with Pthreads. An alternative way to describe such a streaming circuit is to use Function Pipelining. When a function is marked to be pipelined (by using the Pipeline Function constraint), LegUp will implement the function as a pipelined circuit that can start a new invocation every II cycles. That is, the circuit can execute again while its previous invocation is still executing, allowing it to continuously process incoming data in a pipelined fashion. This essentially has the same circuit behaviour as what was described in the previous example (loop pipelining with Pthreads) in the Data Flow Parallelism with Pthreads section. This feature also allows multiple functions that are added to the Pipeline function constraint to execute in parallel, achieving the same hardware behaviour as the previous loop pipelining with Pthreads example.

When using this feature, LegUp can only generate a hardware IP for the pipelined streaming circuit. That is, users will need to specify a custom top-level function by adding a Set top-level function constraint (LegUp Constraints). The top-level function has to be a function specified with the Pipeline function constraint, or a wrapper function that simply calls sub-functions that are specified with Pipeline function constraint.

The example introduced in the Quick Start Tutorial demonstrates the synthesis of a single pipelined function. You should see that the FIRFilterStreaming function is specified with Pipeline function and Set top-level function constraints. To test the generated IP, the example project comes with a custom test bench (in streaming_tb.v), which interfaces with the generated top module’s FIFO ports to inject inputs and receive/verify outputs.

When synthesizing a top-level function with multiple pipelined sub-functions, LegUp will automatically parallelize the execution of all sub-functions that are called in the top-level function, forming a streaming circuit with data flow parallelism. Consider the following code snippet from the example project, FIR_function_pipelining_wrapper, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE).

void pipeline_wrapper(legup::FIFO<bool> &done_signal_fifo) {
    legup::FIFO<int> input_fifo(/*depth*/ 2);
    legup::FIFO<int> output_fifo(/*depth*/ 2);

    test_input_injector(input_fifo);
    FIRFilterStreaming(input_fifo, output_fifo);
    test_output_checker(output_fifo, done_signal_fifo);
}

In this example, the wrapper function pipeline_wrapper is set as the top-level function. The three sub-functions, test_input_injector, FIRFilterStreaming, and test_output_checker are specified with the Pipeline function constraint in LegUp. The input_fifo and output_fifo are arguments into the three sub-functions. test_input_injector writes to the input_fifo, which is read from FIRFilterStreaming. FIRFilterStreaming writes to the output_fifo, which is read from test_output_checker. A function pipeline executes as soon as its inputs are ready. In this case FIRFilterStreaming executes as soon as there is data in the input_fifo, and test_output_checker starts running as soon as there is data in the output_fifo. In other words, a function pipeline does not wait for its previous function pipeline to completely finish running before it starts to execute, but rather, it starts running as early as possible. test_input_injector function also starts working on the next data to write to the input_fifo while the previous data is being processed. If the initiation interval (II) is 1, a function pipeline starts processing new data every clock cycle. Once the function pipelines reach steady-state, all function pipelines execute concurrently.

This example showcases the synthesis of a streaming circuit that consists of a succession of concurrently executing pipelined functions. A custom test bench is also provided in the example project. This time the test_output_checker function is also part of the generated circuit, so that the testbench does not need to verify the FIR filter’s output, but simply waits for a done signal from the done_signal_fifo and terminates the simulation.

Note

The top-level wrapper function should not have any additional logic other than calling the functions that have the Pipeline function constraint.

Note

The start input port of the generated circuit (module top) serves as an enable signal to the circuit. The circuit stops running when the start signal is de-asserted. To have the circuit running continuously, the start input port should be kept high, as you can see in the provided custom test bench.

2.15. LegUp C/C++ Library

LegUp includes a number of C/C++ libraries that allow creation of efficient hardware.

2.15.1. Streaming Library

The streaming library includes the FIFO (first-in first-out) data structure along with its associated API functions. The library can be compiled in software to run on the user’s development machine (e.g., x86). Each FIFO instance in software is implemented as a First Word Fall Through (FWFT) FIFO in hardware.

The streaming library comes in two versions, one uses the plain C language and the other uses C++ template class. When using the C version library, the data being written to or read from a FIFO must be an integer with a bit width of up to 64 bits.

In the C++ template class version, the FIFO data type can be flexibly defined and specified as a template argument of the FIFO object. For example, the FIFO data type could be defined as a struct containing multiple integers:

struct AxisWord { uint64 data; uint8 keep; uint1 last; };

legup::FIFO<AxisWord> my_axi_stream_interface_fifo;

2.15.1.1. Streaming Library - C++

The C++ version of the library provides more flexibility to define the FIFO data type. The FIFO data type will be defined as a template argument of the FIFO template class. A valid data type could be any of the C/C++ primitive integer types or a struct containing primitive integer types.

You can use the C++ streaming library by including the header file:

#include "legup/streaming.hpp"

Note

Users should use the APIs below to create and access FIFOs. Any other uses of FIFOs are not supported in LegUp.

Class Method Description
FIFO<T> () Create a new FIFO.
FIFO<T> (unsigned depth) Create a new FIFO with the specified depth.
FIFO<T> (unsigned depth, legup::FIFOType type) Create a new FIFO with the specified depth and implementation type (only for Xilinx).
void write(T data) Write data to the FIFO.
T read() Read an element from the FIFO.
bool empty() Returns 1 if the FIFO is empty.
bool full() Returns 1 if the FIFO is full.
unsigned get_usedw() Returns the number of elements in the FIFO.
void setDepth(unsigned depth) Set the FIFO’s depth.
void setType(legup::FIFOType type) Set the FIFO’s implementation type (only for Xilinx).

An example code for using the streaming library is shown below.

// declare a 32-bit wide fifo
legup::FIFO<unsigned> my_fifo;
// set the fifo's depth to 10
my_fifo.setDepth(10);
// set the fifo to be implemented in LUTRAMs (only for Xilinx)
my_fifo.setDepth(legup::LUTRAM);

// write to the fifo
my_fifo.write(data);
// read from the fifo
MyStructT data = my_fifo.read();
// check if fifo is empty
bool is_empty = my_fifo.empty();
// check if the fifo is full
bool is_full = my_fifo.full();
// get the number of words stored in the fifo
unsigned numWords = my_fifo.get_usedw();

// declare a 32-bit wide fifo with a depth of 10
legup::FIFO<unsigned> my_fifo_depth_10(10);
// declare a 32-bit wide fifo with a depth of 10
// which is to be implemented in LUTRAMs (only for Xilinx)
legup::FIFO<unsigned> my_lutram_fifo_depth_10(10, legup::LUTRAM);

As shown above, there are three ways of creating a FIFO. The width of the FIFO is determined based on the templated data type of the FIFO. For example, FIFO<unsigned> my_fifo creates a FIFO that is 32 bits wide. The FIFO’s data type can be any primitive type or arbitrary bitwidth types (ap_int/ap_uint), or a struct of primitive/arbitrary bitwidth types (or nested structs of those types) but cannot be a pointer or an array (or a struct with a pointer/array). An array of FIFOs is supported.

The depth of the FIFO can be privided by the user as a contructor argument when the FIFO is declared, or it can also be set afterwards with the setDepth(unsigned depth) function. If the depth is not provided by the user, LegUp uses a default FIFO depth of 2. The depth of a FIFO can also be set to 0, in which case LegUp will create direct ready/valid/data wire connections (without a FIFO) between the source and the sink.

FIFOs are typically implemented with block RAMs on an FPGA, where block RAMs are important resources for an FPGA design. Hence, user may want to explicitly specify which type of FPGA resource is to be used for implementing a FIFO. Specifying the implementation type of a FIFO is only currently supported when targetting a Xilinx device, which can be done by specifying the type as a constructor argument (FIFO<unsigned> my_fifo(10, legup::LUTRAM), or by using the setType function (my_fifo.setType(legup::LUTRAM)). The following FIFO implementation types are supported.

Implementation Type Description
legup::REG Implement the FIFO with registers.
legup::LUTRAM Implement the FIFO with LUTRAMs.
legup::RAM Implement the FIFO with block RAMs.

If the implementation type is not specified by the user when targetting a Xilinx device, LegUp will automatically determine the implmentation type based on the depth of the FIFO. For non-Xilinx devices, we implement a generic FIFO module in Verilog and allow the vendor synthesis tool to decide which FPGA resource to use.

2.15.1.2. Streaming Library - Blocking Behaviour

Note that the fifo read() and write() calls are blocking. Hence if a module attempts to read from a FIFO that is empty, it will be stalled. Similarly, if it attempts to write to a FIFO that is full, it will be stalled. If you want non-blocking behaviour, you can check if the FIFO is empty (with empty()) before calling read(), and likewise, check if the FIFO is full (with full()) before calling write().

With the blocking behaviour, if the depths of FIFOs are not sized properly, it can cause a deadlock. LegUp prints out messages to alert the user that a FIFO is causing stalls.

In hardware simulation, the following messages are shown.

Warning: fifo_write() has been stalled for     1000000 cycles due to FIFO being full.
Warning: fifo_read() has been stalled for     1000000 cycles due to FIFO being empty.
Warning: fifo_read() has been stalled for     1000000 cycles due to FIFO being empty.
Warning: fifo_write() has been stalled for     1000000 cycles due to FIFO being full.
Warning: fifo_read() has been stalled for     1000000 cycles due to FIFO being empty.
Warning: fifo_read() has been stalled for     1000000 cycles due to FIFO being empty.

If you continue to see these messages, you can suspect that there may be a deadlock, and we recommend increasing the depth of the FIFOs.

Note

We recommend the minimum depth of a FIFO to be 2, as a depth of 1 FIFO can cause excessive stalls.

2.15.1.3. Streaming Library - Non-Blocking Behaviour

As mentioned above, non-blocking FIFO behaviour can be created with the use of empty() and full() functions. Non-blocking FIFO read and write can be achieved as shown below.

if (!fifo_a.empty())
    unsigned data_in = fifo_a.read();

if (!fifo_b.full())
    fifo_b.write(data_out);

Note

A deadlock may occur if a fifo with a depth of 0 uses non-blocking write on its source and non-block read on its sink.

2.15.2. Arbitrary Bit-width Data Type Library

For efficient hardware generation, LegUp allows specifying data types of any bit-widths from 1 to 64. You can use LegUp’s arbitrary bit-width data type library by including the following header file:

#include "legup/types.h"

This library defines data types for unsigned integers from 1 bit (uint1) to 64 bits (uint64), as well as for signed integers from 1 bit (int1) to 64 bits (int64).

In software, standard data types (i.e., char, int, long long) are used even though their entire bit-widths are not required in many cases. Hardware implemented on an FPGA operates at the bit-level, hence a circuit can be optimized to the exact bit-width that is required. Using arbitrary bit-width data types provides information to LegUp which is used to produce hardware with the exact specified bit-widths. This can help to reduce circuit area.

An example using arbitrary bit-width data types are shown below:

#include "legup/types.h"
#include <stdio.h>
#define SIZE 8

// marked as volatile, as for a simple program such as this,
// LegUp optimizes away the arrays with constant propagation
volatile uint3 a[SIZE] = {0, 1, 2, 3, 4, 5, 6, 7};
volatile uint4 b[SIZE] = {8, 9, 10, 11, 12, 13, 14, 15};

int main() {
  volatile uint7 result = 0;
  volatile uint4 i;

  #pragma unroll 1
  for (i=0; i<SIZE; i++) {
      result += a[i] + b[i];
  }

  printf("result = %d\n", result);
}

In the example, we have reduced all variables and arrays to their minimum bit-widths. This is translated to reduced widths in hardware. When LegUp runs, it prints out the following to inform the user that it has detected the use of arbitrary bit-width data types:

Note

We have used the volatile keyword to prevent the variables from being optimized away for the purpose of this example. We do not recommend using this keyword unless absolutely necessary, as it can lead to higher runtime and area.

Info: Setting the width of memory 'a' to the user-specified custom width of
      3 bit(s)
Info: Setting the width of memory 'b' to the user-specified custom width of
      4 bit(s)
Info: Setting the width of memory 'main_0_result' to the user-specified custom
      width of 7 bit(s)
Info: Setting the width of memory 'main_0_i' to the user-specified custom
      width of 4 bit(s)

Note

Note that in an actual program (where the volatile keyword is not used), some variables may not show up in the LegUp printout. This occurs when LegUp optimizes the program so that the variables are no longer required.

2.15.3. Bit-level Operation Library

LegUp provides a library to efficiently perform bit-level operations in hardware. You can use LegUp’s bit-level operation library by including the following header file:

#include "legup/bit_level_operations.h"

This library defines a number of C functions that perform common bit-level operations. The software implementation of the library matches to that of the generated hardware. Users can simulate the low-level bit operations accurately in software prior to generating its hardware.

Note

Note that for the arguments and return values of the API functions shown below, you may use either arbitrary bit-width data types (Arbitrary Bit-width Data Type Library) or standard data types. We use arbitrary bit-width data types to show the maximum allowable bit-widths of the values of the arguments and return values. For instance, for the legup_bit_select function, the msb_index argument can be of int type, but its maximum value is 63 (since it is selecting between bit 63 and bit 0 of value v).

Note

The index and width arguments must be constant integers and must be within the bit range of the selecting or updating variable.

2.15.3.1. Selecting A Range of Bits

uint64 legup_bit_select(uint64 v, uint6 msb_index, uint6 lsb_index);

This function selects a range of bits, from the msb_index bit down to the lsb_index bit (where the bit index starts from 0), from the input variable, v, and returns it. The lower (msb_index - lsb_index + 1) bits of the return value are set to the specified range of v, and the rest of the upper bits of the return value are set to 0.

The equivalent Verilog statement will be:

return_val[63:0] = { (64 - (msb_index - lsb_index + 1)){1'b0}, // Upper bits set to 0.
                     v[msb_index : lsb_index]                  // Selected bits of v.
                   };

2.15.3.2. Updating A Range of Bits

uint64 legup_bit_update(uint64 v, uint6 msb_index, uint6 lsb_index, uint64 value);

This function updates a range of bits, from the msb_index bit down to the lsb_index bit (where the bit index starts from 0), of the input variable, v, with the given value, value, and returns the updated value.

The equivalent Verilog statement will be:

return_val[63:0] = v[63:0];
return_val[msb_index : lsb_index]       // Update selected range of bits.
    = value[msb_index - lsb_index : 0]; // The lower (msb_index - lsb_index) bits of value.

2.15.3.3. Concatenation

uint64 legup_bit_concat_2(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1);

This function returns the bit-level concatenation of the two input variables, v_0, and v_1. The lower width_0 bits of v_0 are concatenated with the lower width_1 bits of v_1, with the v_0 bits being the upper bits of the v_1 bits. The concatenated values are stored in the lower (width_0 + width_1) bits of the return value, hence if the bit-width of the return value is bigger than (width_0 + width_1), the rest of the upper bits of the return value are set to 0.

The equivalent Verilog statement will be:

return_val[63:0] = { (64 - width_0 - width_1){1'b0},  // Upper bits set to 0.
                     v_0[width_0 - 1 : 0],            // Lower width_0 bits of v_0.
                     v_1[width_1 - 1 : 0]             // Lower width_1 bits of v_1.
                   };

Similarly, the following functions concatenate three, four, five, up to eight variables respectively.

uint64 legup_bit_concat_3(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2);

uint64 legup_bit_concat_4(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3);

uint64 legup_bit_concat_5(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3,
                          uint64 v_4, uint6 width_4);

uint64 legup_bit_concat_6(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3,
                          uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5);

uint64 legup_bit_concat_7(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3,
                          uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5,
                          uint64 v_6, uint6 width_6);

uint64 legup_bit_concat_8(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3,
                          uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5,
                          uint64 v_6, uint6 width_6, uint64 v_7, uint6 width_7);

2.15.3.4. Bit Reduction Operations

The functions shown below perform bitwise reduction operations on the input variable, v. All of these functions return a one bit value.

Bit Reduction AND:
uint1 legup_bit_reduce_and(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the AND operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 0 if any bit in the specified range of v is 0, returns 1 otherwise.
The equivalent Verilog statement is: &(v[msb_index : lsb_index]).
Bit Reduction OR:
uint1 legup_bit_reduce_or(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the OR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 1 if any bit in the specified range of v is 1, returns 0 otherwise.
The equivalent Verilog statement is: |(v[msb_index : lsb_index]).
Bit Reduction XOR:
uint1 legup_bit_reduce_xor(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the XOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 1 if there is an odd number of bits being 1 in the specified range of v, returns 0 otherwise.
The equivalent Verilog statement is: ^(v[msb_index : lsb_index]).
Bit Reduction NAND:
uint1 legup_bit_reduce_nand(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the NAND operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 1 if any bit in the specified range of v is 0, returns 0 otherwise.
Equivalent to !legup_bit_reduce_and(v, msb_index, lsb_index)
The equivalent Verilog statement is: ~&(v[msb_index : lsb_index]).
Bit Reduction NOR:
uint1 legup_bit_reduce_nor(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the NOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 0 if any bit in the specified range of v is 1, returns 1 otherwise.
Equivalent to !legup_bit_reduce_or(v, msb_index, lsb_index)
The equivalent Verilog statement is: ~|(v[msb_index : lsb_index]).
Bit Reduction XNOR:
uint1 legup_bit_reduce_xnor(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the XNOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 1 if there is an even number of bits being 1 in the specified range of v, returns 0 otherwise.
Equivalent to !legup_bit_reduce_xor(v, msb_index, lsb_index)
The equivalent Verilog statement is: ~^(v[msb_index : lsb_index]).

2.15.4. C++ Arbitrary Precision Data Types Library

The C++ arbitrary precision data types library provides numeric types ap_[u]int and ap_[u]fixpt which can take on arbitrary widths, and be efficiently translated to hardware. It also provides bit selection and concatenation utilities for bit-level access to data.

2.15.5. C++ Arbitrary Precision Integer Library

The C++ ap_[u]int type allows specifying signed and unsigned data types of any bitwidth. They can be used for arithmetic, concatenation, and bit level operations. You can use the ap_[u]int type by including the following header file.

#include "legup/ap_int.hpp"

The desired width of the ap_[u]int can be specified as a template parameter, ap_[u]int<W>, allowing for wider types than the existing C arbitrary bit-width library.

An example using the C++ library is shown below.

#include "legup/ap_int.hpp"
#include <iostream>

using namespace legup;

int main() {
    ap_uint<128> data("0123456789ABCDEF0123456789ABCDEF");
    ap_int<4> res(0);

    for (ap_uint<8> i = 0; i < data.length(); i += 4) {
        // If this four bit range of data is <= 7
        if (data(i + 3, i) <= 7) {
            res -= 1;
        } else {
            res += 1;
        }
    }
    // iostream doesn't synthesize to hardware, so only include this
    // line in software compilation
    #ifdef LEGUP_SW
    std::cout << res << std::endl;
    #endif
}

In the above code we iterate through a 128 bit unsigned integer in four bit segments, and track the difference between how many segments are above and below 7. All variables have been reduced to their minimum widths.

2.15.5.1. C++ Arbitrary Precision Integer Arithmetic

The C++ arbitrary precision integer library supports all standard arithmetic (except for division and modulo), logical bitwise, shifts, and comparison operations. Note for shifting that >> and << are logical, and the .ashr(x) function implements arithmetic right shift. The output types of an operation are wider than their operands as necessary to hold the result. Operands of ap_int, and ap_uint type, as well as operands of different widths can be mixed freely. By default ap_int will be sign extended to the appropriate width for an operation, while ap_uint will be zero extended. When mixing ap_int and ap_uint in an arithmetic operation the resulting type will always be ap_int. Some of this behaviour is demonstrated in the example below.

#include "legup/ap_int.hpp"

using namespace legup;

...

ap_int<8> a = 7;
ap_int<12> b = 100;
ap_uint<7> c = 3;

// Multiply expands to the sum of a and b's width
ap_int<20> d = a * b;

// Add result in max of widths + 1
ap_int<13> e = a + b;

// Logical bitwise ops result in max of widths
ap_int<12> f = a & b;

// Mixing ap_int and ap_uint results in ap_int
ap_int<9> g = a + c;

// ap_(u)int types can be mixed freely with integral types
ap_int<33> h = -1 - a;

2.15.6. C++ Arbitrary Precision Bit-level Operations

The arbitrary precision library provides utilities to select, and update ranges of arbitrary precision data, as well as perform concatenation.

Bit selection and updating is defined for all C++ arbitrary precision numeric types. Concatenation is defined on all C++ Arbitrary Precision Library constructs including arbitrary precision numeric types, as well as bit selections, and other concatenations.

2.15.6.1. Selecting and Assigning to a Range of Bits

#include "legup/ap_int.hpp"

using namespace legup;

...

ap_uint<8> A(0xBC);
ap_int<4> B = A(7, 4); // B initialized as 0xB
ap_int<4> C = A[2];    // C initialized as 0x1
                       // A[2] is zero extended to match widths
A(3,0) = 0xA;          // A becomes 0xBA

On C++ arbitrary precision types num(a, b) will select and create a reference to the underlying arbitrary precision value. The operator num[a] selects and creates a reference to a single bit. This reference can be assigned to, and used to access the underlying data.

2.15.6.2. Bit Concatenation

#include "legup/ap_int.hpp"

using namespace legup;

...

ap_uint<4>  A(0xA);
ap_uint<8>  B(0xCB);
ap_uint<8>  AB( (A, B(3,0)) );                   // AB initialized as 0xAB
ap_uint<12> ABC( (A, ap_uint<4>(0xB), B(7,4)) ); // ABC initialized as 0xABC

Putting any C++ arbitrary precision types in a comma separated list will generate a concatenation. The concatenation can currently be cast to, and used to initialize ap_int and ap_uint types, but can not be assigned to.

2.15.7. C++ Arbitrary Precision Fixed Point Library

The C++ ap_[u]fixpt types allow specifying signed and unsigned fixed point numbers of arbitrary width. Currently the decimal point is restricted to be within the range represented by the ap_[u]fixpt. They can be used for arithmetic, concatenation, and bit level operations. You can use the ap_[u]fixpt type by including the following header file.

#include "legup/ap_fixpt.hpp"

The ap_[u]fixpt template allows specifying the width of the type, the number of bits used to represent the integer part, as well as several quantization and overflow modes.

Quantization and Overflow handling is triggered during assignment and construction. The policies used for Quantization and Overflow are based on the Quantization and Overflow modes of the left hand side of an assignment, or the value being constructed.

The template ap_[u]fixpt<W, I_W, Q_M, O_M> is described in the following table. The last two template parameters are optional.

Parameter Description
W The width in bits.
I_W

The number of bits used to represent the integer portion

i.e. the number of bits before the decimal.

Q_M

The Quantization (rounding) mode used when a result has precision below the least significant bit.

Defaults to AP_TRN.

AP_TRN Truncate bits below the LSB bringing the result closer to -∞.
AP_TRN_ZERO Truncate bits below the LSB bringing the result closer to zero.
AP_RND Round to the nearest representable value with the midpoint going towards +∞.
AP_RND_INF Round to the nearest representable value with the midpoint going towards -∞ for negative numbers, and +∞ for positive numbers.
AP_RND_MIN_INF Round to the nearest representable value with the midpoint going towards -∞.
AP_RND_CONV Round to the nearest representable value with the midpoint going towards the nearest even multiple of the quantum. (This helps to remove bias in rounding).
O_M

The Overflow mode used when a result exceeds the maximum or minimum representable value.

Defaults to AP_WRAP.

AP_WRAP Wraparound between the minimum and maximum representable values in the range.
AP_SAT On positive and negative overflow saturate the result to the maximum or minimum value in the range respectively.
AP_SAT_ZERO On any overflow set the result to zero.
AP_SAT_SYM

On positive and negative overflow saturate the result to the maximum or minimum value in the range symmetrically about zero.

For ap_ufixpt this is the same as AP_SAT.

An ap_[u]fixpt is a W bit wide integer, in 2’s complement for the signed case, which has some fixed position relative to the decimal. This means that arithmetic is efficiently implemented as integer operations with some shifting to line up decimals. Generally a fixed point number can be though of as a signed or unsigned integer word multiplied by 2^(I_W - W). The range of values that an ap_[u]fixpt can take on, as well as the quantum that separates those values is determined by the W, and I_W template parameters. The AP_SAT_SYM overflow mode forces the range to be symmetrical about zero for signed fixed point types. This information is described in the following table. Q here represents the quantum.

Type Quantum Range AP_SAT_SYM Range
ap_ufixpt 2^(I_W - W)

0

to

2^(I_W) - Q

0

to

2^(I_W) - Q

ap_fixpt 2^(I_W - W)

-2^(I_W - 1)

to

2^(I_W - 1) - Q

-2^(I_W - 1) + Q

to

2^(I_W - 1) - Q

An example using ap_fixpt is show below.

#include "legup/ap_fixpt.hpp"
#include "legup/streaming.hpp"

#define TAPS 8

// A signed fixed point type with 10 integer bits and 6 fractional bits
// It employs convergent rounding for quantization, and saturation for overflow.
typedef legup::ap_fixpt<16, 10, legup::AP_RND_CONV, legup::AP_SAT> fixpt_t;

// A signed fixed point type with 3 integer bits and 1 fractional bit
// It uses the default truncation, and wrapping modes.
typedef legup::ap_fixpt<4, 3> fixpt_s_t;

// This function is marked function_pipeline in the config
void fir(legup::FIFO<fixpt_t> &input_fifo,
         legup::FIFO<fixpt_t> &output_fifo) {
    fixpt_t in = input_fifo.read();

    static fixpt_t previous[TAPS] = {0};
    const fixpt_s_t coefficients[TAPS] = {-2, -1.5, -1, -0.5, 0.5, 1, 1.5, 2};

    for (unsigned i = (TAPS - 1); i > 0; --i) {
        previous[i] = previous[i - 1];
    }

    previous[0] = in;

    fixpt_t accumulate[TAPS];
    for (unsigned i = 0; i < TAPS; ++i) {
        accumulate[i] = previous[i] * coefficients[i];
    }

    // Accumulate results, doing adds and saturation in
    // a binary tree to reduce the number of serial saturation
    // checks. This significantly improves pipelining results
    // over serially adding results together when saturation
    // is required.
    for (unsigned i = TAPS >> 1; i > 0; i >>= 1) {
        for (unsigned j = 0; j < i; ++j) {
            accumulate[j] += accumulate[j + i];
        }
    }

    output_fifo.write(accumulate[0]);
}

This example implements a streaming FIR filter with 8 taps. Using the minimum width ap_fixpt to represent the constant coefficients allows the multiply to happen at a smaller width than if they were the same (wider) type as the inputs. This example ensures that no overflows occur by always assigning to an ap_fixpt that uses the AP_SAT overflow mode. This does incur a performance penalty, but this is minimized here by accumulating the results in a binary fashion, such that there are only log(TAPS) = 3 saturating operations that depend on each other. If the results were accumulated in a single variable in one loop then there would be TAPS = 8 saturating operations depending on each other. Having more saturating operations in a row is slower because at each step overflow needs to be checked before the next operation can occur.

2.15.7.1. Working With ap_[u]fixpt Types

The Arbitrary Precision Fixed Point library provides fast bit accurate software simulation, and efficient equivalent hardware generation.

The ap_[u]fixpt types can be constructed and assigned from other fixed points, the ap_[u]int types, C++ integer and floating point types, as well as concatenations and bit selections. They can also be initialized from a hexadecimal string describing the exact bits. Note that construction and assignment will always trigger the quantization and overflow handling of the left hand side, except when copying from the exact same type, or initializing from a hexadecimal string. They can also be freely mixed with numeric types for arithmetic, logical bitwise operations, and comparisons with some caveats for floating point types.

Note

For arithmetic and logical bitwise operations floating point types must be explicitly cast to an ap_[u]fixpt type before being used, because of the wide range of possible values the floating point type could represent. It is also a good idea, but not required, to use ap_[u]int types in place of C++ integers when less width is required.

For convenience floating point types can be used directly in fixed point comparisons, however floating points are truncated and wrapped as if they were assigned to a signed ap_fixpt just big enough to hold all values of the ap_[u]fixpt type being compared against, with the AP_TRN and AP_WRAP modes on.

There are also some utilities for printing ap_[u]fixpt types in software demonstrated below. The to_fixpt_string function takes an optional base argument which is one of 2 or 16, and defaults to 16. The to_double function can be useful for printing, but it can lose precision over a wide fixed point, and is very expensive when used in hardware.

#include "legup/ap_fixpt.hpp"
#include <stdio.h>
#include <iostream>

using namespace legup;
using namespace std;

...

ap_ufixpt<8, 4> fixed = 12.75;
ap_fixpt<8, 4> s_fixed("CC");

// prints: CC * 2^-4
// Read signed CC * 2^-4 = -52 * 0.0625
// = -3.25
cout << s_fixed << endl;

// prints: 11001100 * 2^-4
// Read unsigned 11001100 * 2^-4 = 204 * 0.0625
// = 12.75
printf("%s\n", fixed.to_fixpt_string(2).c_str());

// prints: -3.25
printf("%.2f\n", s_fixed.to_double());

2.15.7.2. Arithmetic With ap_[u]fixpt Types

The Arbitrary Precision Fixed Point library supports all standard arithmetic (except for division and modulo), logical bitwise, shifts, and comparison operations. During arithmetic intermediate results are kept in a wide enough type to hold all of the possible resulting values. Operands are shifted to line up decimal points, and sign or zero extended to match widths before an operation is performed. For fixed point arithmetic, whenever the result of a calculation can be negative the intermediate type is an ap_fixpt instead of ap_ufixpt regardless of whether any of the operands were ap_fixpt. Overflow and quantization handling only happen when the result is assigned to a fixed point type.

Note

Overflow and quantization handling is not performed for any shifting operations (<<, >>, .ashr(x), <<=, >>=) on ap_[u]fixpt types, and shifts do not change the width or type of the fixed point they are applied to. This means that bits can be shifted out of range.

An example demonstrating some of this behaviour is show below.

#include "legup/ap_fixpt.hpp"

using namespace legup;

...

ap_ufixpt<65, 14> a = 32.5714285713620483875274658203125;
ap_ufixpt<15, 15> b = 7;
ap_fixpt<8, 4> c = -3.125;

// the resulting type is wide enough to hold all
// 51 fractional bits of a, and 15 integer bits of bits
// the width and integer widths is increased by 1 to hold
// all possible results of the addition
ap_ufixpt<67, 16> d = a + b; // 39.5714285713620483875274658203125

// the resulting type is a signed fixed point
// with widths and integer widths that are the sum
// of the two operands
ap_fixpt<23, 19> e = b * d; // -21.875

// Assignment triggers the AP_TRN_ZERO quantization mode
ap_fixpt<8, 7, AP_TRN_ZERO> f = e; // -21.5

// Mask out bits above the decimal
f &= 0xFF; // -22

// Assignment triggers the AP_SAT overflow mode,
// and saturates the negative result to 0
ap_ufixpt<8, 4, AP_TRN, AP_SAT> g = b * d; // 0

2.15.7.3. Explicit Conversions of ap_[u]fixpt

There are several functions to explicitly convert ap_[u]fixpt types into other types, besides value based assignments. The to_logical_uint function produces a uint of the same width as the ap_[u]fixpt with the same raw data, and to_double returns a double representing the value of the ap_[u]fixpt. Note that for wide enough ap_[u]fixpt to_double can lose precision, and can be inefficient in hardware. These are demonstrated in the following code snippet.

#include "legup/ap_fixpt.hpp"

using namespace legup;

...

ap_fixpt<12, 5> fixed("898");

ap_uint<12> logical_fixed = fixed.to_logical_uint();
logical_fixed == 0x898; // true

double double_fixed = fixed.to_double();
double_fixed == -14.8125; // true

2.16. AXI Slave Interface (Beta)

LegUp allows generating an AXI slave interface for the top-level function in hardware. The slave interface is implemented by using FIFOs in the streaming library. To specify a slave interface, you will need to

  1. define a struct for the slave memory in a header file.
  2. instantiate a global variable of the struct above.
  3. specify a top-level function in the Tcl configuration file.
  4. add set_argument_interface_type <global_varirable_name> axi_s in the Tcl configuration file.

The AXI slave interface support is a beta feature with several limitations:

  • Only one AXI slave interface can be generated for each LegUp project.
  • The interface only supports the AXI-lite protocol with additional support for bursting.
  • The interface always uses 32-bit address and 64-bit data width.
  • SW/HW Co-Simulation is not supported for LegUp project with AXI slave interface specified.

We will improve and stablize this feature in the later releases. Please contact support@legupcomputing.com if you require more details about this feature.

2.17. AXI Master Interface (Beta)

Similar to the AXI Slave Interface, LegUp allows generating AXI master interfaces, which are implemented as groups of FIFOs. You can use the AXI master interface by including the header file:

#include "legup/axi_interface.hpp"

To add a AXI master interface, you will need to

  1. create an instance of the AxiInterface class and specify the address width, data width and wstrb width through template parameters.
  2. pass the created instance by reference to the top-level function.
  3. use the utility functions (APIs) defined in the header to control the AXI master interface.

Below are the list of API functions to access the AXI master interface.

template <class T_ADDR, class T_DATA, class T_WSTRB>
void axi_m_read_req(AxiInterface<T_ADDR, T_DATA, T_WSTRB> &m, T_ADDR byte_addr, uint9 burst_len);

template <class T_ADDR, class T_DATA, class T_WSTRB>
T_DATA axi_m_read_data(AxiInterface<T_ADDR, T_DATA, T_WSTRB> &m);

template <class T_ADDR, class T_DATA, class T_WSTRB>
void axi_m_write_req(AxiInterface<T_ADDR, T_DATA, T_WSTRB> &m, T_ADDR byte_addr, uint9 burst_len);

template <class T_ADDR, class T_DATA, class T_WSTRB>
void axi_m_write_data(AxiInterface<T_ADDR, T_DATA, T_WSTRB> &m, T_DATA val, T_WSTRB strb, bool last);

template <class T_ADDR, class T_DATA, class T_WSTRB>
uint2 axi_m_write_resp(AxiInterface<T_ADDR, T_DATA, T_WSTRB> &m);

Please note that the read and write operations are completely separated due to the nature of the AXI interface, so they can be executed in parallel in the same cycle.

Same as the AXI slave interface, this interface only supports the AXI-lite protocol with additional support for bursting. And SW/HW Co-Simulation is not supported for this feature at the moment.

We will improve and stablize this feature in the later releases. Please contact support@legupcomputing.com if you require more details about this feature.

2.18. Software Profiler

The LegUp IDE comes integrated with a profiler (gprof) that allows you to profile your software program to determine the runtime hot spots. This information can be used to decide which program segment should be accelerated in hardware.

To use the profiler, one simply has to click the Profile Software button on an opened LegUp project:

_images/profiler_button.png

Alternatively, you could also click LegUp -> Profile Software on the top menu. This executes the software program, and when it finishes running, a gprof tab should open up at the bottom window (where the Console window is).

_images/profiler.png

The gprof tab shows a hierarchy of the percentage of runtime spent on executing the program. In the figure shown above, you can see that most of the time is spent on executing the FIRFilterStreaming function. It also shows which parts of the function contribute to that percentage of runtime. Clicking on one of the lines (e.g., FIRFilterStreaming (fir.c:42) which contributes the highest percentage of runtime) takes you to that line of code in the IDE.

Using the software profiler is a useful way to determine which parts of your program is taking up the most runtime. You can use this information to decide whether you want to accelerate that portion of code with LegUp, or re-write the code in software to be more efficient.

Note

Note that grof profiles a program based on samples that it takes, and its minimum sampling period is 10 ms. Hence if your program’s runtime is very short, it will not give any meaningful result. If this is the case, try to increase the runtime by providing more inputs to the program.

Note

The figure shown above for the profiled result is from LegUp running on Linux. On Windows, it shows a text-based result.

2.19. Using Pre-existing Hardware Modules

It is possible to connect existing hardware modules to the hardware generated by LegUp. This can be useful when there are optimized IP blocks that one wants to use as part of a larger circuit. This is called the custom Verilog flow in LegUp.

To use this, a wrapper C function needs to be created, where its prototype matches the module declaration of the custom Verilog module. The name of the C function needs to match the name of the Verilog module, and the C function must have the same number of arguments as the number of input ports of the Verilog module. The bit-widths of the function arguments also need to match the bit-widths of the input ports. In addition, the bit-width of the return type in C must match the width of the output port corresponding to the return value.

To use the custom Verilog flow, one can specify the following in the config.tcl file:

set_custom_verilog_function "function_name" noMemory
set_custom_verilog_file "verilog_file_name"

Note

Currently, a custom Ver module cannot access any memories. All arguments need to be passed in by value. It cannot be in a pipelined section (within a pipelined loop/function). In addition, a custom Ver module cannot invoke other modules (functions or other custom Ver modules).

There can be cases where a custom Ver module needs to access I/Os. However, accessing I/Os cannot be easily described in C. This can be done in LegUp by simply specifying the I/Os in the config.tcl file as the following:

set_custom_verilog_function "function_name" noMemory \
  input/output high_bit:low_bit IO_signal_name
set_custom_verilog_file "verilog_file_name"

This connects the specified I/O signals of the custom Verilog module directly to the top-level module generated by LegUp. For example, if one specifies the following:

set_custom_verilog_function "assignSwitchesToLEDs" noMemory \
  output 5:0 LEDR \
  input 5:0 SW \
  output 5:0 KEY
set_custom_verilog_file "assignSwitchesToLEDs.v"

This indicates to LegUp that it needs to connect LEDR, SW, and KEY ports directly to the ports of the top-level module generated by LegUp.

2.19.1. Custom Verilog Module Interface

A custom Verilog module needs to conform to the interface expected by LegUp in order for it to be connected properly. The following interface is required by LegUp for the custom Verilog module:

  • input clk: Clock signal for the custom Verilog module
  • input reset: Reset signal for the custom Verilog module
  • input start: Signal set to start the custom Verilog module (set high for 1 clock cycle)
  • output finish: Set by the custom Verilog module to indicate the to rest of the LegUp-generated hardware that it has finished running.
  • return_val: Only necessary if the wrapper C function has a return value. Its width must equal to the width of the return type in the C function.
  • ``arg_<argument_name> : Only necessary if the wrapper C function has an argument. The <argument_name> must the same as the name of the argument in

the C wrapper function prototype, and its width must be equal to the bit-width of the argument type.

In addition, a custom Verilog module may have ports that directly to I/Os, as described above. However, as these ports are not specified in C, they do not have to follow any C conventions. LegUp will create ports at the top-level module with the user-specified names and widths, and connect them directly to the custom Verilog module.

2.20. Specifying a Custom Top-level Function

In the hardware flow, if you only want to compile certain C functions to hardware, rather than the entire program, you can specify a custom top-level function. This allows LegUp to compile only the specified top-level function and all of its descendant functions to hardware. If there are multiple functions to be specified, you can create a wrapper function in C which calls all of the desired functions. The custom top-level function can be specified via the HLS Constraints window:

_images/set_custom_top_level_module.png

It can also be specified directly in the config.tcl file as the following:

set_custom_top_level_module "topLevelModuleName"

This constraint is also described in set_custom_top_level_module.

2.21. Specifying a Custom Test Bench

LegUp allows one to use a custom test bench to simulate the hardware generated by LegUp. When a custom top-level function is specified by the user, there are two options for simulation:

A custom test bench can be specified to LegUp via the HLS Constraints window:

_images/set_custom_test_bench.png

One must specify both the name of custom Verilog module as well as the name of the custom test bench file. It can also be specified directly in the config.tcl file as the following:

set_custom_test_bench_module "testBenchModuleName"
set_custom_test_bench_file "testBenchFileName.v"

This constraint is also described in set_custom_test_bench_module and set_custom_test_bench_file.

2.22. LegUp Command Line Interface

LegUp can run in command line with the following command:

legup [-h] <cmd>
Where <cmd> can be one of:
  • hw (default): Run the hardware-only flow.
  • sw : Compile and run the program in software on host machine.
  • sim: Simulate the LegUp-generated circuit in Modelsim (only for hardware flow).
  • cosim: Verify the LegUp-generated circuit using C test bench (only for hardware flow).
  • fpga: Fully synthesize design for target FPGA.
  • scheduleviewer: Show the scheduler viewer.
  • clean: Delete files generated by LegUp.
Commands for the hybrid flow:
  • hybrid: Run the hybrid flow.
  • hybrid_compile: Synthesize the generated SoC to FPGA bitstream with Intel Quartus (legup hybrid must have been run previously)
  • legup program_board: Program the FPGA with the generated bitstream (.sof) file (legup hybrid_compile must have been run previously)
  • legup run_on_board: Download and run the program on FPGA (legup program_board must have been run previously)

Intel (Altera) does not provide a simulation model for the ARM processor, hence it is not possible to simulate the ARM hybrid system.