2. User Guide

2.1. Introduction to High-Level Synthesis

High-level synthesis (HLS) refers to the synthesis of a hardware circuit from a software program specified in a high-level language, where the hardware circuit performs the same functionality as the software program. For LegUp, the input is a C-language program, and the output is a circuit specification in the Verilog hardware description language. The LegUp-generated Verilog can be input to RTL synthesis, place and route tools to produce an FPGA or ASIC implementation of the circuit. The underlying motivation for HLS is to raise the level of abstraction for hardware design, by allowing software methodologies to be used to design hardware. This eases the effort required for hardware design, improving productivity and reducing time-to-market.

While a detailed knowledge of HLS is not required for the use of LegUp, it is nevertheless worthwhile to highlight key steps involved in converting software to hardware:

  • Allocation: The allocation step defines the constraints on the generated hardware, including the number of hardware resources of a given type that may be used (e.g. how many divider units may be used, the number of RAM ports, etc.), as well as the target clock period for the hardware, and other user-supplied constraints.
  • Scheduling: Software programs are written without any notion of a clock or finite state machine (FSM). The scheduling step of HLS bridges this gap, by assigning the computations in the software to happen in specific clock cycles in the hardware. Putting this another way, the scheduling step defines the FSM for the hardware and associates the computations of the software with specific states in the FSM. This must be done in a manner that honors the data-dependencies in the software to ensure that, for example, the inputs to a computation are available/ready before the computation executes. Likewise, given a user-supplied clock period constraint, e.g. 10 ns, the schedule step must ensure the computations in a single FSM state do require time exceeding the target period.
  • Binding: While a software program may contain an arbitrary number of operations of a given type (e.g. multiplications), the hardware may contain only a limited number of units capable of performing such a computation. The binding step of HLS is to associate (bind) each computation in the software with a specific unit in the hardware.
  • Verilog generation: After the steps above, an FSM and datapath have been defined. The final step of HLS to generate a description of the circuit in a hardware description language (Verilog).

Executing computations in hardware brings speed and energy advantages over performing the same computations in software running on a processor. The underlying reason for this is that the hardware is dedicated to the computational work being performed, whereas a processor is generic and has the consequent overheads of fetching and decoding instructions, loads/stores to memory, and so on. Drastic speed advantages are possible by using hardware if one is able to exploit hardware parallelism, namely, the execution of computations concurrently that otherwise, on a processor, would need to happen sequentially. With LegUp, one can exploit four styles of hardware parallelism.

2.1.1. Instruction-level Parallelism

Instruction-level parallelism refers to the ability to execute computations concurrently based on an analysis of data dependencies. Computations that do not depend on each other can be executed at the same time. Consider the following C-code snipped which performs three addition operations.

z = a + b;
x = c + d;
q = z + x;
...

Observe that the first and second additions do not depend on one another. These additions can therefore be executed concurrently, as long as there are two adder units available in the hardware. LegUp automatically analyzes the dependencies between computations in the software to discover such opportunities and maximize concurrency in the execution of computations in hardware. In the above example, the third addition operation depends on the results of the first two, and hence, its execution cannot be overlapped with the others. Instruction-level parallelism is referred to as fine-grained parallelism, as the number of computations that are overlapped with one another in time is usually small.

2.1.2. Loop-level Parallelism

Software programs typically spend much of their time executing loops. From a software programmer’s perspective, the execution of loop iterations is sequential and non-overlapping. That is, loop iteration i executes to completion before iteration i + 1 commences. Now, imagine a loop with N iterations and whose loop body takes 3 clock cycles to complete. Under typical sequential software execution, the loop would take 3N clock cycles to execute. With LegUp, however, it is possible to overlap the loop iterations with one another using a technique called loop pipelining. The idea is to execute a portion of a loop iteration i and then commence executing iteration i + 1 even before iteration i is complete. Returning to the previous example, if loop pipelining were able to commence a new loop iteration every clock cycles, then the total number of cycles to execute the loop would be approximately N – a significant reduction relative to 3N.

As will be elaborated upon below, LegUp performs loop pipelining on the basis of user constraints. By default, it is not applied automatically.

2.1.3. Thread-level Parallelism

Today’s CPUs have multiple processor cores contained within and, with multi-threaded software, it is possible to parallelize an application to use multiple cores at once, thereby improving run-time. POSIX threads, Pthreads, is the most popular approach for multi-threaded programming in the C language, where, parallelism is realized at the granularity of entire C functions. As such, thread-level parallelism is referred to as coarse-grained parallelism since a large amount of computational work is overlapped in time. LegUp supports the synthesis of multi-threaded C programs into hardware, where concurrently executing software threads are synthesized into concurrently executing hardware units. This allows a software developer to take advantage of spatial parallelism in hardware using a programming style they are likely already familiar with. Moreover, the design and debugging of a parallel hardware implementation can happen in software, which it is considerably easier to identify and resolve bugs.

There are two challenging aspects to multi-threaded software programming: 1) synchronization among threads, and 2) interactions among threads through shared memory. In the former case, because threads execute asynchronously within an operating system, it is often desirable to ensure multiple threads have reached a common point in the program’s execution before allowing any to advance further. This is typically realized using a barrier. LegUp supports the synthesis of barriers into hardware, where the generated hardware’s execution matches the software behaviour. Regarding interactions through shared memory, it is typical that one wishes to restrict the ability of multiple threads to access a given memory location at the same time. This is done using critical sections, which are code segments wherein at most one thread can execute at a given time. Critical sections are specified in software using mutexes, and LegUp supports the synthesis of mutexes into hardware.

2.1.4. Data Flow (Streaming) Parallelism

The second style of coarse-grained parallelism is referred to as data flow parallelism. This form of parallelism arises frequently in streaming applications, which constitute the majority of video and audio processing, but also applications in machine learning and computational finance. In such applications, there is a stream of input data that is fed into the application at regular intervals. For example, in an audio processing application, a digital audio sample may be input to the circuit every clock cycle. In streaming applications, a succession of computational tasks is executed on the stream of input data, producing a stream of output data. For example, in audio processing, a first task may be to filter the input audio to remove high-frequency components. Subsequently, a second task may receive the filtered audio, and boost the bass low-frequency components. Observe that, in such a scenario, the two tasks may be overlapped with one another. Input samples are received by the first task and second task on a continuous basis.

LegUp provides a unique way for a software developer to specify data flow parallelism, namely, through the use of Pthreads and the use of a special API to interconnect the computational tasks. As will be elaborated upon below, to specify data flow hardware, one creates a thread for each computational task, and then interconnects the tasks using FIFO buffers via a LegUp-provided API. In hardware, each thread is synthesized into a hardware module that receives/produces inputs/outputs at regular intervals. The FIFO buffers between the modules hold results produced by one module, and waiting to be picked up by a downstream module.

2.2. LegUp Overview

LegUp accepts a C software program as input and automatically generates hardware described in Verilog HDL (hardware description language) that can be synthesized onto an FPGA.

LegUp has two different synthesis flows:
  • Hardware Flow: Synthesizes the whole C program (or user-specified C functions) into a hardware circuit.
  • Processor-Accelerator Hybrid Flow: Synthesize the whole C program into a processor-accelerator SoC (System-on-chip).

2.3. Hardware Flow

LegUp’s hardware flow synthesizes a C program into a hardware circuit (without a processor). One can compile the entire C program, or specific functions of the program, into hardware. To specify certain C functions to be compiled to hardware in the hardware flow, you need to specify a top-level function (see Specifying a Custom Top-level Function), then LegUp compiles the top-level function and all of its descendant functions to hardware. By default, the main function is the top-level function, hence the entire program is compiled to hardware.

To run the hardware flow click the following button:

_images/icon_hardware_flow.png

You can also run this flow in command line by running legup hw.

2.4. Processor-Accelerator Hybrid Flow

LegUp can automatically compile a C program into a complete processor-accelerator SoC comprising an ARM processor and one or more hardware accelerators. One can designate C functions to be compiled to hardware accelerators and LegUp will automatically partition the program so that the remaining program segments are executed on the ARM processor. The communication between the processor and hardware accelerators, as well as the generation of the system interconnect, are automatically handled by LegUp.

To accelerate a function, you can specify the function name in the config.tcl file as shown below:

set_accelerator_function "function_name"

Then run legup hybrid in command line, which generates the complete system. The architecture of a processor-accelerator hybrid system is described in Hardware Architecture.

Note

The hybrid flow is currently only available for Intel FPGAs in the fully licensed version of LegUp. It is not available with an evaluation license.

2.5. Verification and Simulation

The output of the circuit generated by LegUp should always match the output of the C program for the same set of inputs. Users should not modify the Verilog generated by LegUp, as this is overwritten when LegUp runs.

For debugging purposes, LegUp converts any C printf statements into Verilog $write statements so that values printed during software execution will also be printed during hardware simulation. This allows easy verification of the correctness of the hardware circuit. Verilog $write statements are unsynthesizable and will not affect the final FPGA hardware.

2.6. LegUp Constraints

The input language to LegUp is C. LegUp does not require tool-specific pragmas or special keywords. Instead, LegUp accepts user-specified constraints that guide the generated hardware. Each project specifies its constraints in the config.tcl file in the project directory. This file is automatically generated by the LegUp IDE. To modify the constraints, click the HLS Constraints button:

_images/icon_constraints.png

The following window will open:

_images/empty_constraint_setting_window.png

You can add, edit, or remove constraints from this window. Select a constraint type from the first dropdown menu. If you want more information about a constraint, click the Help button, which will open the corresponding Constraints Manual page.

An important constraint is the target clock period (shown as Set target clock period in the dropdown menu). With this constraint, LegUp schedules the operations of a program to meet the specified clock period. When this constraint is not given, LegUp uses the default clock period for each device, as shown below.

FPGA Vendor Device Default Clock Frequency (MHz) Default Clock Period (ns)
Intel Arria V 200 5
Intel Stratix V 200 5
Intel Stratix IV 167 6
Intel Cyclone V 100 10
Intel Cyclone IV 67 15
Xilinx Virtex 6 167 6
Xilinx Virtex 7 200 5
Lattice ECP5 100 10
Microsemi Fusion 100 10
Achronix Speedster 200 5

Details of all LegUp constraints are given in the Constraints Manual.

2.7. Loop Pipelining

Loop pipelining is an optimization that can automatically extract loop-level parallelism to create an efficient hardware pipeline. It allows executing multiple loop iterations concurrently on the same pipelined hardware.

To use loop pipelining, a label needs to be added the loop:

my_loop_label:
for (i = 1; i < N; i++) {
    a[i] = a[i-1] + 2
}

Then add a loop pipeline constraint in the HLS Constraints window:

_images/pipeline_loop_constraint.png

An important concept in loop pipelining is the initiation interval (II), which is the cycle interval between starting successive iterations of the loop. The best performance and hardware utilization is achieved when II=1, which means that successive iterations of the loop can begin every clock cycle. A pipelined loop with an II=2 means that successive iterations of the loop can begin every two clock cycles, corresponding to half of the throughput of an II=1 loop.

By default, LegUp always attempts to create a pipeline with an II=1. However, this is not possible in some cases due to resource constraints or cross-iteration dependencies. This is described in more detail in Optimization Guide. When II=1 cannot be met, LegUp’s pipeline scheduling algorithm will try to find the smallest possible II that satisfies the constraints and dependencies.

2.7.1. A Simple Loop Pipelining Example

We will use a simple example to demonstrate loop pipelining. First import the provided example project, loop_pipelining_simple, contained within the LegUp installation directory (please refer to Using Example Projects for importing this example project into the LegUp IDE). Let’s first run the C program in software by clicking the Compile Software and Run Software buttons.

#include <stdio.h>

#define N 4
// For the purpose of this example, we mark input arrays as volatile to
// prevent optimization based on input values. In a real program, the volatile keyword
// should only be used when absolutely necessary, as it prevents many optimizations.
volatile int a[N] = {1, 2, 3, 4};
volatile int b[N] = {5, 6, 7, 8};
int c[N] = {0};

// Simple loop with an array
int main() {
    int i = 0, sum = 0;

#pragma unroll 1 // Prevents the loop below from being unrolled.
// The loop label is used for setting the pipelining constraint.
my_loop_label:
    for (i = 0; i < N; i++) {
        printf("Loop body\n");
        printf("a[%d] = %d\n", i, a[i]);
        printf("b[%d] = %d\n", i, b[i]);
        c[i] = a[i] * b[i];
        printf("c[%d] = %d\n", i, c[i]);
        sum += c[i];
    }

    if (sum == 5 + 12 + 21 + 32)
        printf("PASS\n");
    else
        printf("FAIL\n");

    return sum;
}

Note

LegUp automatically unrolls loops with small trip counts. However, for a loop to be pipelined, it cannot be unrolled. Hence, for this example, we have added “#pragma unroll 1” to prevent the loop from being unrolled.

When the program is executed, you can see in the console window that the array elements of a, b, and c are printed in order. Now click Compile Software to Hardware and Simulate Hardware. You can see that the simulation output matches with the output from software execution, and the reported cycle latency is 24. Note that we have not yet pipelined the loop and it is still executed in sequential order.

To pipeline the loop, open up the HLS Constraints window and add the Pipeline loop constraint with the value set to the loop label, “my_loop_label”. Re-run Compile Software to Hardware and you should see that LegUp reports “Pipeline Initiation Interval (II) = 1”. When you click on Simulate Hardware, you should see that the cycle latency is now reduced to 12 clock cycles, and the arrays are no longer accessed in order. For instance a[1] is printed out before c[0]. This is because the second iteration (that prints a[1]) now runs in parallel with the first iteration (that prints c[0]). The second iteration’s load operations for a[1] and b[1] are now happening at the same time as the first iteration’s multiply that computes c[0].

# Loop body
# Loop body
# a[          0] =           1
# b[          0] =           5
# Loop body
# a[          1] =           2
# b[          1] =           6
# c[          0] =           5
# Loop body
# a[          2] =           3
# b[          2] =           7
# c[          1] =          12
# a[          3] =           4
# b[          3] =           8
# c[          2] =          21
# c[          3] =          32
# PASS
# At t=              290000 clk=1 finish=1 return_val=        70
# Cycles:                   12

To get more information about the schedule of the pipelined loop body, click Launch Schedule Viewer to open up LegUp’s schedule viewer:

_images/open_scheduleviewer.png

You will first see the call graph of the program — the main function calling the printf function. Double-click the main function’s circle to see the control-flow graph of the function. As shown below, you should find a circle named BB_1 with a back edge (an arrow pointing to itself). This means that the block is a loop body, and in this case it corresponds to the pipelined loop.

_images/loop_pipelining_simple_cfg.png

Double-clicking on BB_1 will take you to the Pipeline Viewer, which illustrates the pipeline schedule:

_images/loop_pipelining_simple_pipeline_viewer.png

The top two rows show the time steps (in terms of clock cycles) and the number of pipeline stages. For instance, the loop in the example has 5 pipeline stages and each iteration takes 5 cycles to complete. Each cell in the table shows all operations that are scheduled for each time step of a loop iteration. When looking at table horizontally, it shows all operations that are to occur (over the span of 5 clock cycles) for each loop iteration. When looking at table vertically, it shows all concurrent operations for each time step. For example, the first iteration’s multiply operation (%10 = mul nsw i32 %9, %8 [1], shown in Cycle 1 column of Iteration: 0 row) is scheduled in cycle 1. Meanwhile, also in cycle 1, the second iteration’s load operations are also scheduled. In cycle 4 (circled by a bold box), the pipeline reaches steady-state. This is when pipelined hardware reaches maximum utilization and is concurrently executing 5 loop iterations (since the loop is divided into 5 time steps, the pipelined hardware can execute 5 different loop iterations at the same time).

[1]Use mouse to hover over the cell to see the complete statement of an operation.

2.8. Multi-threading with Pthreads

In an FPGA hardware system, the same module can be instantiated multiple times to exploit spatial parallelism, where all module instances execute in parallel to achieve higher throughput. LegUp allows easily inferring such parallelism with the use of POSIX Threads (Pthreads), a standard multi-threaded programming paradigm that is commonly used in software. Parallelism described in software with Pthreads is automatically compiled to parallel hardware with LegUp. Each thread in software becomes an independent module that concurrently executes in hardware.

For example, the code snippet below creates N threads running the Foo function in software. LegUp will correspondingly create N hardware instances all implementing the Foo function, and parallelize their executions. LegUp also supports mutex and barrier APIs (from Pthreads library) so that synchronization between threads can be specified using locks and barriers.

void* Foo (int* arg);

for (i = 0; i < N; i++) {
    pthread_create(&threads[i], NULL, Foo, &args[i]);
}

To see a complete multi-threading example, please refer to the example project, multi_threading_simple, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE). The example also demonstrates the use of a mutex lock to protect a critical region.

LegUp supports the most commonly used Pthread APIs, which are listed in Supported Pthread/OpenMP APIs.

2.9. Loop Multi-threading with OpenMP

LegUp also supports the use of OpenMP, which allows parallelizing loops in a multi-threaded fashion (as opposed to loop pipelining which parallelizes loop iterations in a pipelined fashion). OpenMP provides a simple and a high-level approach for parallelization. With a single pragma, a user is able to parallelize loops without performing complicated code changes. For example, in the code snippet below, the loop performs a dot product of two arrays, A_array and B_array. To parallelize this loop using OpenMP, one simply puts an OpenMP pragma before the loop.

#pragma omp parallel for num_threads(2) private(i)
for (i = 0; i < SIZE; i++) {
    output[i] = A_array[i] * B_array[i];
}

The OpenMP pragma, #pragma omp parallel for, is used to parallelize a for loop. The pragma uses a number of clauses. The num_threads clause sets the number of threads to use in the parallel execution of the for loop. The private clause declares the variables in its list to be private to each thread. In the above example, two threads will execute the loop in parallel, with one thread handling the first half of the arrays, and the other handling the second half of the arrays. LegUp will synthesize the two parallel threads into two concurrently running accelerators, each working on half of the arrays. Note that the parallel pragma in OpenMP is blocking – all threads executing the parallel section need to finish before the program execution continues.

To see a complete OpenMP example, please refer to the example project, openmp_reduction, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE). In this example, you will see how one can simply use OpenMP’s reduction clause to sum up all elements in an array with parallel threads.

LegUp supports a subset of OpenMP that is used for parallelizing loops. The supported OpenMP pragmas and OpenMP functions are listed in Supported Pthread/OpenMP APIs.

2.10. Supported Pthread/OpenMP APIs

LegUp currently supports the following Pthread and OpenMP functions/pragmas:

Pthread Functions OpenMP Pragmas OpenMP Functions
pthread_create omp parallel omp_get_num_threads
pthread_join omp parallel for omp_get_thread_num
pthread_exit omp master  
pthread_mutex_lock omp critical  
pthread_mutex_unlock omp atomic  
pthread_barrier_init reduction(operation: var)  
pthread_barrier_wait    

2.11. Data Flow Parallelism with Pthreads

Data flow parallelism is another commonly used technique to improve hardware throughput, where a succession of computational tasks that process continuous streams of data can execute in parallel. The concurrent execution of computational tasks can also be accurately described in software using Pthread APIs. In addition, the continuous streams of data flowing through the tasks can be inferred using LegUp’s built-in FIFO data structure (see Streaming Library).

Let’s take a look at the example project, “Fir Filter (Loop Pipelining with Pthreads)”, provided in LegUp IDE (please refer to Quick Start Tutorial for instructions on how to create a project from provided example). In the example, the main function contains the following code snippet:

// Create input and output FIFOs
FIFO *input_fifo = fifo_malloc(/*width*/ 32, /*depth*/ 2);
FIFO *output_fifo = fifo_malloc(/*width*/ 32, /*depth*/ 2);

// Build struct of FIFOs for the FIR thread.
struct thread_data data;
data.input = input_fifo;
data.output = output_fifo;

// Launch pthread kernels.
pthread_t thread_var_fir, thread_var_injector, thread_var_checker;
pthread_create(&thread_var_fir, NULL, FIRFilterStreaming, (void *)&data);
pthread_create(&thread_var_injector, NULL, test_input_injector, input_fifo);
pthread_create(&thread_var_checker, NULL, test_output_checker, output_fifo);

// Join threads.
pthread_join(thread_var_injector, NULL);
pthread_join(thread_var_checker, NULL);

The corresponding hardware is illustrated in the figure below.

_images/FIR_Pthreads_schematic.pdf

The two fifo_malloc calls in the C code corresponds to the creation of the two FIFOs, where the bit-width and depth of the FIFOs are set according to the arguments in the fifo_malloc call. The three pthread_create calls initiate and parallelize the executions of three computational tasks, where each task is passed in a FIFO (or a pointer to a struct containing more than one FIFO pointers) as its argument.

The FIFO connections and data flow directions are implied by the uses of fifo_read and fifo_write APIs. For example, the test_input_injector function has a fifo_write call writing data into the input_fifo, and the FIRFilterStreaming function uses a fifo_read call to read data out from the input_fifo. This means that the data flows through the input_fifo from test_input_injector to FIRFilterStreaming.

The pthread_join API is called to wait for the completion of test_input_injector and test_output_checker. We do not “join” the FIRFilterStreaming thread since it contains an infinite loop (see code below) that is always active and processes incoming data from input_fifo whenever the fifo is not empty. This closely matches the always running behaviour of streaming hardware.

Now let’s take a look at the implementation of the main computational task (i.e., the FIRFilterStreaming threading function).

void *FIRFilterStreaming(void *threadArg) {
    struct thread_data *arg = (struct thread_data *)threadArg;
    FIFO *input_fifo = arg->input, *output_fifo = arg->output;

loop_fir:
    // This loop is pipelined and will be "always running", just like how a
    // streaming module runs whenever a new input is available.
    while (1) {
        // Read from input FIFO.
        int in = fifo_read(input_fifo);

        static int previous[TAPS] = {0}; // Need to store the last TAPS - 1 samples.
        const int coefficients[TAPS] = {0, 1, 2,  3,  4,  5,  6,  7,
                                        8, 9, 10, 11, 12, 13, 14, 15};

        int j = 0, temp = 0;

        for (j = (TAPS - 1); j >= 1; j -= 1)
            previous[j] = previous[j - 1];

        previous[0] = in;

        for (j = 0; j < TAPS; j++)
            temp += previous[TAPS - j - 1] * coefficients[j];

        int output = (previous[TAPS - 1] == 0) ? 0 : temp;

        // Write to output FIFO.
        fifo_write(output_fifo, output);
    }
    pthread_exit(NULL);
}

In the code shown in the example project, you will notice that all three threading functions contain a loop, which repeatedly reads and/or writes data from/to FIFOs to perform processing. In LegUp, this is how one can specify that functions are continuously processing data streams that are flowing through FIFOs.

2.11.1. Further Throughput Enhancement with Loop Pipelining

In this example, the throughput of the streaming circuit will be limited by how frequently the functions can start processing new data (i.e., how frequently the new loop iterations can be started). For instance, if the slowest function among the three functions can only start a new loop iteration every 4 cycles, then the throughput of the entire streaming circuit will be limited to processing one piece of data every 4 cycles. Therefore, as you may have guessed, we can further improve the circuit throughput by pipelining the loops in the three functions. If you run LegUp synthesis for the example (Compile Software to Hardware), you should see in the report file, summary.legup.rpt, that all loops can be pipelined with an initiation interval of 1. That means all functions can start a new iteration every clock cycle, and hence the entire streaming circuit can process one piece of data every clock cycle. Now run the simulation (Simulate Hardware) to confirm our expected throughput. The reported cycle latency should be just slightly more than the number of data samples to be processed (INPUTSIZE is set to 128; the extra cycles are spent on activating the parallel accelerators, flushing out the pipelines, and verifying the results).

2.12. Function Pipelining

You have just seen how an efficient streaming circuit can be described in software by using loop pipelining with Pthreads. An alternative way to describe such a streaming circuit is to use Function Pipelining. When a function is marked to be pipelined (by using the Pipeline Function constraint), LegUp will implement the function as a pipelined circuit that can start a new invocation every II cycles. That is, the circuit can execute again while its previous invocation is still executing, allowing it to continuously process incoming data in a pipelined fashion. This essentially has the same circuit behaviour as what was described in the previous example (loop pipelining with Pthreads) in the Data Flow Parallelism with Pthreads section. This feature also allows multiple functions that are added to the Pipeline function constraint to execute in parallel, achieving the same hardware behaviour as the previous loop pipelining with Pthreads example.

When using this feature, LegUp can only generate a hardware IP for the pipelined streaming circuit. That is, users will need to specify a custom top-level function by adding a Set top-level function constraint (LegUp Constraints). The top-level function has to be a function specified with the Pipeline function constraint, or a wrapper function that simply calls sub-functions that are specified with Pipeline function constraint.

The example introduced in the Quick Start Tutorial demonstrates the synthesis of a single pipelined function. You should see that the FIRFilterStreaming function is specified with Pipeline function and Set top-level function constraints. To test the generated IP, the example project comes with a custom test bench (in streaming_tb.v), which interfaces with the generated top module’s FIFO ports to inject inputs and receive/verify outputs.

When synthesizing a top-level function with multiple pipelined sub-functions, LegUp will automatically parallelize the execution of all sub-functions that are called in the top-level function, forming a streaming circuit with data flow parallelism. Consider the following code snippet from the example project, FIR_function_pipelining_wrapper, contained within the LegUp installation directory (see Using Example Projects for importing this example project into the LegUp IDE).

void pipeline_wrapper(FIFO* done_signal_fifo) {
    FIFO *input_fifo = fifo_malloc(/*width*/ 32, /*depth*/ 2);
    FIFO *output_fifo = fifo_malloc(/*width*/ 32, /*depth*/ 2);

    test_input_injector(input_fifo);
    FIRFilterStreaming(input_fifo, output_fifo);
    test_output_checker(output_fifo, done_signal_fifo);

    fifo_free(input_fifo);
    fifo_free(output_fifo);
}

In this example, the wrapper function pipeline_wrapper is set as the top-level function. The three sub-functions, test_input_injector, FIRFilterStreaming, and test_output_checker are specified with the Pipeline function constraint in LegUp. The input_fifo and output_fifo are arguments into the three sub-functions. test_input_injector writes to the input_fifo, which is read from FIRFilterStreaming. FIRFilterStreaming writes to the output_fifo, which is read from test_output_checker. A function pipeline executes as soon as its inputs are ready. In this case FIRFilterStreaming executes as soon as there is data in the input_fifo, and test_output_checker starts running as soon as there is data in the output_fifo. In other words, a function pipeline does not wait for its previous function pipeline to completely finish running before it starts to execute, but rather, it starts running as early as possible. test_input_injector function also starts working on the next data to write to the input_fifo while the previous data is being processed. If the initiation interval (II) is 1, a function pipeline starts processing new data every clock cycle. Once the function pipelines reach steady-state, all function pipelines execute concurrently.

This example showcases the synthesis of a streaming circuit that consists of a succession of concurrently executing pipelined functions. A custom test bench is also provided in the example project. This time the test_output_checker function is also part of the generated circuit, so that the testbench does not need to verify the FIR filter’s output, but simply waits for a done signal from the done_signal_fifo and terminates the simulation.

Note

The top-level wrapper function should not have any additional logic other than calling the functions that have the Pipeline function constraint.

Note

The start input port of the generated circuit (module top) serves as an enable signal to the circuit. The circuit stops running when the start signal is de-asserted. To have the circuit running continuously, the start input port should be kept high, as you can see in the provided custom test bench.

2.13. LegUp C Library

LegUp includes a number of C libraries that allow creation of efficient hardware.

2.13.1. Streaming Library

The streaming library includes the FIFO (first-in first-out) data structure along with its associated API functions. The library can be compiled in software to run on the user’s development machine (e.g., x86). The library is thread-safe — a mutex is used to arbitrate concurrent accesses from parallel threads. Each FIFO created with fifo_malloc in software is implemented as a First Word Fall Through (FWFT) FIFO in hardware.

You can use the streaming library by including its header file:

#include "legup/streaming.h"

Note

Users should always use the provided APIs below to create and access FIFOs. Any other uses of FIFOs are not supported in LegUp.

Function Description
FIFO* fifo_malloc(int width, int depth) Creates a FIFO with its bit-width set to width and depth set to depth.
void fifo_write(FIFO *fifo, long long data) Writes data to fifo.
long long fifo_read(FIFO *fifo) Reads an element from fifo.
int fifo_empty(FIFO *fifo) Returns 1 if fifo is empty. Returns 0 otherwise.
int fifo_full(FIFO *fifo) Returns 1 if fifo is full. Returns 0 otherwise.

Note that fifo_read and fifo_write calls are blocking. Hence if a module attempts to read from a FIFO that is empty, it will be stalled. Similarly, if it attempts to write to a FIFO that is full, it will be stalled. If you do not want the blocking behaviour, you can check if the FIFO is empty (with fifo_empty) before calling fifo_read, and likewise, check if the FIFO is full (with fifo_full) before calling fifo_write.

With the blocking behaviour, if the depths of FIFOs are not sized properly, it can cause a deadlock. LegUp prints out messages to alert the user that a FIFO is causing stalls, in both the software model and in hardware simulation.

In software, the following messages are shown.

Warning: fifo_write() has been stalled for 5 seconds due to FIFO being full.
Warning: fifo_read() has been stalled for 5 seconds due to FIFO being empty.
Warning: fifo_read() has been stalled for 5 seconds due to FIFO being empty.
Warning: fifo_write() has been stalled for 5 seconds due to FIFO being full.
Warning: fifo_read() has been stalled for 5 seconds due to FIFO being empty.
Warning: fifo_read() has been stalled for 5 seconds due to FIFO being empty.

In hardware simulation, the following messages are shown.

Warning: fifo_write() has been stalled for     1000000 cycles due to FIFO being full.
Warning: fifo_read() has been stalled for     1000000 cycles due to FIFO being empty.
Warning: fifo_read() has been stalled for     1000000 cycles due to FIFO being empty.
Warning: fifo_write() has been stalled for     1000000 cycles due to FIFO being full.
Warning: fifo_read() has been stalled for     1000000 cycles due to FIFO being empty.
Warning: fifo_read() has been stalled for     1000000 cycles due to FIFO being empty.

If you continue to see these messages, you can suspect that there may be a deadlock, and we recommend increasing the depth of the FIFOs.

Note

We recommend the minimum depth of a FIFO to be 2, as a depth of 1 FIFO can cause excessive stalls.

2.13.2. Arbitrary Bit-width Data Type Library

For efficient hardware generation, LegUp allows specifying data types of any bit-widths from 1 to 64. You can use LegUp’s arbitrary bit-width data type library by including the following header file:

#include "legup/types.h"

This library defines data types for unsigned integers from 1 bit (uint1) to 64 bits (uint64), as well as for signed integers from 1 bit (int1) to 64 bits (int64).

In software, standard data types (i.e., char, int, long long) are used even though their entire bit-widths are not required in many cases. Hardware implemented on an FPGA operates at the bit-level, hence a circuit can be optimized to the exact bit-width that is required. Using arbitrary bit-width data types provides information to LegUp which is used to produce hardware with the exact specified bit-widths. This can help to reduce circuit area.

An example using arbitrary bit-width data types are shown below:

#include "legup/types.h"
#include <stdio.h>
#define SIZE 8

// marked as volatile, as for a simple program such as this,
// LegUp optimizes away the arrays with constant propagation
volatile uint3 a[SIZE] = {0, 1, 2, 3, 4, 5, 6, 7};
volatile uint4 b[SIZE] = {8, 9, 10, 11, 12, 13, 14, 15};

int main() {
  volatile uint7 result = 0;
  volatile uint4 i;

  #pragma unroll 1
  for (i=0; i<SIZE; i++) {
      result += a[i] + b[i];
  }

  printf("result = %d\n", result);
}

In the example, we have reduced all variables and arrays to their minimum bit-widths. This is translated to reduced widths in hardware. When LegUp runs, it prints out the following to inform the user that it has detected the use of arbitrary bit-width data types:

Note

We have used the volatile keyword to prevent the variables from being optimized away for the purpose of this example. We do not recommend using this keyword unless absolutely necessary, as it can lead to higher runtime and area.

Info: Setting the width of memory 'a' to the user-specified custom width of
      3 bit(s)
Info: Setting the width of memory 'b' to the user-specified custom width of
      4 bit(s)
Info: Setting the width of memory 'main_0_result' to the user-specified custom
      width of 7 bit(s)
Info: Setting the width of memory 'main_0_i' to the user-specified custom
      width of 4 bit(s)

Note

Note that in an actual program (where the volatile keyword is not used), some variables may not show up in the LegUp printout. This occurs when LegUp optimizes the program so that the variables are no longer required.

2.13.3. Bit-level Operation Library

LegUp provides a library to efficiently perform bit-level operations in hardware. You can use LegUp’s bit-level operation library by including the following header file:

#include "legup/bit_level_operations.h"

This library defines a number of C functions that perform common bit-level operations. The software implementation of the library matches to that of the generated hardware. Users can simulate the low-level bit operations accurately in software prior to generating its hardware.

Note

Note that for the arguments and return values of the API functions shown below, you may use either arbitrary bit-width data types (Arbitrary Bit-width Data Type Library) or standard data types. We use arbitrary bit-width data types to show the maximum allowable bit-widths of the values of the arguments and return values. For instance, for the legup_bit_select function, the msb_index argument can be of int type, but its maximum value is 63 (since it is selecting between bit 63 and bit 0 of value v).

Note

The index and width arguments must be constant integers and must be within the bit range of the selecting or updating variable.

2.13.3.1. Selecting A Range of Bits

uint64 legup_bit_select(uint64 v, uint6 msb_index, uint6 lsb_index);

This function selects a range of bits, from the msb_index bit down to the lsb_index bit (where the bit index starts from 0), from the input variable, v, and returns it. The lower (msb_index - lsb_index + 1) bits of the return value are set to the specified range of v, and the rest of the upper bits of the return value are set to 0.

The equivalent Verilog statement will be:

return_val[63:0] = { (64 - (msb_index - lsb_index + 1)){1'b0}, // Upper bits set to 0.
                     v[msb_index : lsb_index]                  // Selected bits of v.
                   };

2.13.3.2. Updating A Range of Bits

uint64 legup_bit_update(uint64 v, uint6 msb_index, uint6 lsb_index, uint64 value);

This function updates a range of bits, from the msb_index bit down to the lsb_index bit (where the bit index starts from 0), of the input variable, v, with the given value, value, and returns the updated value.

The equivalent Verilog statement will be:

return_val[63:0] = v[63:0];
return_val[msb_index : lsb_index]       // Update selected range of bits.
    = value[msb_index - lsb_index : 0]; // The lower (msb_index - lsb_index) bits of value.

2.13.3.3. Concatenation

uint64 legup_bit_concat_2(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1);

This function returns the bit-level concatenation of the two input variables, v_0, and v_1. The lower width_0 bits of v_0 are concatenated with the lower width_1 bits of v_1, with the v_0 bits being the upper bits of the v_1 bits. The concatenated values are stored in the lower (width_0 + width_1) bits of the return value, hence if the bit-width of the return value is bigger than (width_0 + width_1), the rest of the upper bits of the return value are set to 0.

The equivalent Verilog statement will be:

return_val[63:0] = { (64 - width_0 - width_1){1'b0},  // Upper bits set to 0.
                     v_0[width_0 - 1 : 0],            // Lower width_0 bits of v_0.
                     v_1[width_1 - 1 : 0]             // Lower width_1 bits of v_1.
                   };

Similarly, the following functions concatenate three, four, five, up to eight variables respectively.

uint64 legup_bit_concat_3(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2);

uint64 legup_bit_concat_4(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3);

uint64 legup_bit_concat_5(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3,
                          uint64 v_4, uint6 width_4);

uint64 legup_bit_concat_6(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3,
                          uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5);

uint64 legup_bit_concat_7(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3,
                          uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5,
                          uint64 v_6, uint6 width_6);

uint64 legup_bit_concat_8(uint64 v_0, uint6 width_0, uint64 v_1, uint6 width_1,
                          uint64 v_2, uint6 width_2, uint64 v_3, uint6 width_3,
                          uint64 v_4, uint6 width_4, uint64 v_5, uint6 width_5,
                          uint64 v_6, uint6 width_6, uint64 v_7, uint6 width_7);

2.13.3.4. Bit Reduction Operations

The functions shown below perform bitwise reduction operations on the input variable, v. All of these functions return a one bit value.

Bit Reduction AND:
uint1 legup_bit_reduce_and(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the AND operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 0 if any bit in the specified range of v is 0, returns 1 otherwise.
The equivalent Verilog statement is: &(v[msb_index : lsb_index]).
Bit Reduction OR:
uint1 legup_bit_reduce_or(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the OR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 1 if any bit in the specified range of v is 1, returns 0 otherwise.
The equivalent Verilog statement is: |(v[msb_index : lsb_index]).
Bit Reduction XOR:
uint1 legup_bit_reduce_xor(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the XOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 1 if there is an odd number of bits being 1 in the specified range of v, returns 0 otherwise.
The equivalent Verilog statement is: ^(v[msb_index : lsb_index]).
Bit Reduction NAND:
uint1 legup_bit_reduce_nand(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the NAND operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 1 if any bit in the specified range of v is 0, returns 0 otherwise.
Equivalent to !legup_bit_reduce_and(v, msb_index, lsb_index)
The equivalent Verilog statement is: ~&(v[msb_index : lsb_index]).
Bit Reduction NOR:
uint1 legup_bit_reduce_nor(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the NOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 0 if any bit in the specified range of v is 1, returns 1 otherwise.
Equivalent to !legup_bit_reduce_or(v, msb_index, lsb_index)
The equivalent Verilog statement is: ~|(v[msb_index : lsb_index]).
Bit Reduction XNOR:
uint1 legup_bit_reduce_xnor(uint64 v, uint6 msb_index, uint6 lsb_index);
Applies the XNOR operation on the range of bits of v, starting from the upper bit index, msb_index (index starts from 0), to the lower bit index, lsb_index.
Returns 1 if there is an even number of bits being 1 in the specified range of v, returns 0 otherwise.
Equivalent to !legup_bit_reduce_xor(v, msb_index, lsb_index)
The equivalent Verilog statement is: ~^(v[msb_index : lsb_index]).

2.14. Software Profiler

The LegUp IDE comes integrated with a profiler (gprof) that allows you to profile your software program to determine the runtime hot spots. This information can be used to decide which program segment should be accelerated in hardware.

To use the profiler, one simply has to click the Profile Software button on an opened LegUp project:

_images/profiler_button.png

Alternatively, you could also click LegUp -> Profile Software on the top menu. This executes the software program, and when it finishes running, a gprof tab should open up at the bottom window (where the Console window is).

_images/profiler.png

The gprof tab shows a hierarchy of the percentage of runtime spent on executing the program. In the figure shown above, you can see that most of the time is spent on executing the FIRFilterStreaming function. It also shows which parts of the function contribute to that percentage of runtime. Clicking on one of the lines (e.g., FIRFilterStreaming (fir.c:42) which contributes the highest percentage of runtime) takes you to that line of code in the IDE.

Using the software profiler is a useful way to determine which parts of your program is taking up the most runtime. You can use this information to decide whether you want to accelerate that portion of code with LegUp, or re-write the code in software to be more efficient.

Note

Note that grof profiles a program based on samples that it takes, and its minimum sampling period is 10 ms. Hence if your program’s runtime is very short, it will not give any meaningful result. If this is the case, try to increase the runtime by providing more inputs to the program.

Note

The figure shown above for the profiled result is from LegUp running on Linux. On Windows, it shows a text-based result.

2.15. Using Pre-existing Hardware Modules

It is possible to connect existing hardware modules to the hardware generated by LegUp. This can be useful when there are optimized IP blocks that one wants to use as part of a larger circuit. This is called the custom Verilog flow in LegUp.

To use this, a wrapper C function needs to be created, where its prototype matches the module declaration of the custom Verilog module. The name of the C function needs to match the name of the Verilog module, and the C function must have the same number of arguments as the number of input ports of the Verilog module. The bit-widths of the function arguments also need to match the bit-widths of the input ports. In addition, the bit-width of the return type in C must match the width of the output port corresponding to the return value.

To use the custom Verilog flow, one can specify the following in the config.tcl file:

set_custom_verilog_function "function_name" noMemory
set_custom_verilog_file "verilog_file_name"

Note

Currently, a custom Ver module cannot access any memories. All arguments need to be passed in by value. It cannot be in a pipelined section (within a pipelined loop/function). In addition, a custom Ver module cannot invoke other modules (functions or other custom Ver modules).

There can be cases where a custom Ver module needs to access I/Os. However, accessing I/Os cannot be easily described in C. This can be done in LegUp by simply specifying the I/Os in the config.tcl file as the following:

set_custom_verilog_function "function_name" noMemory \
  input/output high_bit:low_bit IO_signal_name
set_custom_verilog_file "verilog_file_name"

This connects the specified I/O signals of the custom Verilog module directly to the top-level module generated by LegUp. For example, if one specifies the following:

set_custom_verilog_function "assignSwitchesToLEDs" noMemory \
  output 5:0 LEDR \
  input 5:0 SW \
  output 5:0 KEY
set_custom_verilog_file "assignSwitchesToLEDs.v"

This indicates to LegUp that it needs to connect LEDR, SW, and KEY ports directly to the ports of the top-level module generated by LegUp.

2.15.1. Custom Verilog Module Interface

A custom Verilog module needs to conform to the interface expected by LegUp in order for it to be connected properly. The following interface is required by LegUp for the custom Verilog module:

  • input clk: Clock signal for the custom Verilog module
  • input reset: Reset signal for the custom Verilog module
  • input start: Signal set to start the custom Verilog module (set high for 1 clock cycle)
  • output finish: Set by the custom Verilog module to indicate the to rest of the LegUp-generated hardware that it has finished running.
  • return_val: Only necessary if the wrapper C function has a return value. Its width must equal to the width of the return type in the C function.
  • ``arg_<argument_name> : Only necessary if the wrapper C function has an argument. The <argument_name> must the same as the name of the argument in

the C wrapper function prototype, and its width must be equal to the bit-width of the argument type.

In addition, a custom Verilog module may have ports that directly to I/Os, as described above. However, as these ports are not specified in C, they do not have to follow any C conventions. LegUp will create ports at the top-level module with the user-specified names and widths, and connect them directly to the custom Verilog module.

2.16. Specifying a Custom Top-level Function

In the hardware flow, if you only want to compile certain C functions to hardware, rather than the entire program, you can specify a custom top-level function. This allows LegUp to compile only the specified top-level function and all of its descendant functions to hardware. If there are multiple functions to be specified, you can create a wrapper function in C which calls all of the desired functions. The custom top-level function can be specified via the HLS Constraints window:

_images/set_custom_top_level_module.png

It can also be specified directly in the config.tcl file as the following:

set_custom_top_level_module "topLevelModuleName"

This constraint is also described in set_custom_top_level_module.

2.17. Specifying a Custom Test Bench

LegUp allows one to use a custom test bench to simulate the hardware generated by LegUp. By default, LegUp automatically generates a test bench, however, when a custom top-level function is specified by the user, a custom test bench must also be provided by the user. A custom test bench can be specified to LegUp via the HLS Constraints window:

_images/set_custom_test_bench.png

One must specify both the name of custom Verilog module as well as the name of the custom test bench file. It can also be specified directly in the config.tcl file as the following:

set_custom_test_bench_module "testBenchModuleName"
set_custom_test_bench_file "testBenchFileName.v"

This constraint is also described in set_custom_test_bench_module and set_custom_test_bench_file.

2.18. LegUp Command Line Interface

LegUp can run in command line with the following command:

legup [-h] <cmd>
Where <cmd> can be one of:
  • hw (default): Run the hardware-only flow.
  • sw : Compile and run the program in software on host machine.
  • sim: Simulate the LegUp-generated circuit in Modelsim (only for hardware flow).
  • fpga: Fully synthesize design for target FPGA.
  • scheduleviewer: Show the scheduler viewer.
  • clean: Delete files generated by LegUp.
Commands for the hybrid flow:
  • hybrid: Run the hybrid flow.
  • hybrid_compile: Synthesize the generated SoC to FPGA bitstream with Intel Quartus (legup hybrid must have been run previously)
  • legup program_board: Program the FPGA with the generated bitstream (.sof) file (legup hybrid_compile must have been run previously)
  • legup run_on_board: Download and run the program on FPGA (legup program_board must have been run previously)

Intel (Altera) does not provide a simulation model for the ARM processor, hence it is not possible to simulate the ARM hybrid system.