3. Optimization Guide

This chapter describes how to optimize the generated hardware through software code changes, pragmas, and LegUp constraints.

3.1. Loop Pipelining

Loop pipelining is a performance optimization in high-level synthesis (HLS), which extracts loop-level parallelism by executing multiple loop iterations concurrently using the same hardware. The key performance metric when loop pipelining is the time interval between starting successive loop iterations, called the initiation interval (II). Ideally, the initiation interval should be one, meaning that we can start a new loop iteration every clock cycle. This is the highest throughput that a pipelined loop can achieve without unrolling. Although LegUp will always try to achieve an II of 1, sometimes this is not possible, due to resource constraints, or due to cross-iteration dependencies (recurrences) in a loop.

Consider the following example:

int sum = 0;
for (i = 0; i < N; i++) {
loop: for (j = 0; j < N; j++) {
          sum += A[i][j] * B[i][j];
      }
}

This example shows a nested loop, which performs an element-wise multiplication of two 2-dimensional arrays and accumulates the sum. The inner loop is labeled with a loop label, loop. If we tell LegUp to pipeline the loop using the loop_pipeline constraint, LegUp will show the following message when synthesizing the program:

Info: Pipelining the loop on line 61 of loop.c with label "loop".
Info: Done pipelining the loop on line 61 of loop.c with label "loop".
    Pipeline Initiation Interval (II) = 1.

These info messages let us know that LegUp successfully pipelined the inner loop with an II of 1. Even though an II of 1 has been achieved, the hardware may not meet our desired performance requirements. In this case, we can choose to pipeline the outer loop by moving the loop label to the outer loop:

int sum = 0;
loop: for(i = 0; i < N; i++) {
      for(j = 0; j < N; j++) {
        sum += A[i][j] * B[i][j];
      }
}

When an outer loop is specified to be pipelined, LegUp will automatically unroll all of the inner loops. This can provide higher performance at the expense of higher circuit area. In this example, N is 25, and when the inner loop is unrolled, LegUp will create 25 multipliers and adder units working in parallel. However, this does not mean that the performance will be improved by 25x due to the resource constraints on memories A and B.

When LegUp runs, we will see the following messages:

Info: Unrolling the entire loop nest on line 61 of loop.c.
    This loop nest is inside a parent loop labelled 'loop', which is specified to be
    pipelined.
Info: Pipelining the loop on line 60 of loop.c with label "loop".
Info: Resource constraint limits initiation interval to 13
      Resource 'A_local_memory_port' has 25 uses per cycle but only 2 ports available.
      +--------------------------------+-------------------+-----------+
      | Operation                      | Location          | # of Uses |
      +--------------------------------+-------------------+-----------+
      | 'load' operation for array 'A' | line 60 of loop.c | 25        |
      +--------------------------------+-------------------+-----------+
      |                                | Total # of Uses   | 25        |
      +--------------------------------+-------------------+-----------+

Info: Resource constraint limits initiation interval to 13
      Resource 'B_local_memory_port' has 25 uses per cycle but only 2 ports available.
      +--------------------------------+-------------------+-----------+
      | Operation                      | Location          | # of Uses |
      +--------------------------------+-------------------+-----------+
      | 'load' operation for array 'B' | line 60 of loop.c | 25        |
      +--------------------------------+-------------------+-----------+
      |                                | Total # of Uses   | 25        |
      +--------------------------------+-------------------+-----------+

Info: Done pipelining the loop on line 60 of loop.c with label "loop".
    Pipeline Initiation Interval (II) = 13.

The first info message indicates that the inner loop is being unrolled, since the outer loop is specified to be pipelined. Next, the info messages tell us there are 25 load operations that need to occur to both memory A and B every clock cycle if II is 1, but there are only two ports (which allows 2 loads per cycle) available for each memory. Local_memory_port indicates that this resource is a memory port of a local memory, which is described in Hardware Architecture. Due to the limited available memory ports, LegUp must increase the loop pipeline II until we can meet the constraint of having 25 load operations to each memory. When the II is 13, meaning that each successive loop iteration is started every 13 cycles, we have enough time to allow 26 load operations, hence the constraint is met (each memory has 2 ports by default, which allows 2 memory accesses per cycle. In 13 cycles, we can perform 26 memory accesses in total).

For this particular example, when the outer loop is pipelined, the performance is about 2x higher than when the inner loop is pipelined. However, the area has also increased by about 25x, due to having 25 multipliers and adders. Therfore, we must take care when pipelining outer loops due to the unrolling. In general, we recommend pipelining the innermost loop first, and if the performance requirement is not met, then try pipelining the outer loops.

Note

If the loop specified to be pipelined contains any function calls (in the loop or in any of the inner loops), the function calls will be inlined into the loop. Any descendants of the called functions will also be inlined, and all of their loops will also be unrolled automatically. If there are many descendant functions and loops, this can increase the area significantly (also described in Function Pipelining). We recommend the user to examine the program for such cases before pipelining a loop.

Lets look at an image filtering example:

for (y = 1; y < HEIGHT-1; y++) {
loop:
    for (x = 1; x < WIDTH-1; x++) {
        out[y][x] =  in[y-1][x-1]*k[0][0] + in[y-1][x]*k[0][1] + in[y-1][x+1]*k[0][2]
                   + in[y  ][x-1]*k[1][0] + in[y  ][x]*k[1][1] + in[y  ][x+1]*k[1][2]
                   + in[y+1][x-1]*k[2][0] + in[y+1][x]*k[2][1] + in[y+1][x+1]*k[2][2];
    }
}

This example applies a 3 x 3 image kernel filter, array k, to an input image, array in, producing an output image, array out. When we turn on loop pipelining and run LegUp we see the following messages:

Info: Pipelining the loop on line 22 of kernel.c with label "loop".
Info: Assigning new label to the loop on line 22 of kernel.c with label "loop"
Info: Resource constraint limits initiation interval to 5.
      Resource 'in_local_memory_port' has 9 uses per cycle but only 2 units available.
      +---------------------------------+---------------------+-----------+
      | Operation                       | Location            | # of Uses |
      +---------------------------------+---------------------+-----------+
      | 'load' operation for array 'in' | line 23 of kernel.c | 9         |
      +---------------------------------+---------------------+-----------+
      |                                 | Total # of Uses     | 9         |
      +---------------------------------+---------------------+-----------+

Info: Done pipelining the loop on line 22 of kernel.c with label "loop".
      Pipeline Initiation Interval (II) = 5.

The pipeline initiation interval is limited by the memory accesses to the input image (array in). There are 9 loads but only two memory ports, which forces the loop II to be 5, allowing up to 10 loads per iteration from array in. For loops where the II is constrained by memory accesses to an array, you can improve the II by manually splitting the C array into several smaller C arrays. Each C array can be accessed independently, which reduces resource contention. In this case, we can split the image into rows of pixels, where each row is stored in a separate array (in_0, in_1, and in_2).

for (y = 1; y < HEIGHT-1; y++) {
loop:
    for (x = 1; x < WIDTH-1; x++) {
        out[y][x] =  in_0[x-1]*k[0][0] + in_0[x]*k[0][1] + in_0[x+1]*k[0][2]
                   + in_1[x-1]*k[1][0] + in_1[x]*k[1][1] + in_1[x+1]*k[1][2]
                   + in_2[x-1]*k[2][0] + in_2[x]*k[2][1] + in_2[x+1]*k[2][2];
    }
}

Now when we run LegUp we will see:

Info: Pipelining the loop on line 22 of kernel.c with label "loop".
Info: Resource constraint limits initiation interval to 2.
      Resource 'in_0_local_memory_port' has 3 uses per cycle but only 2 units available.
      +-----------------------------------+---------------------+-----------+
      | Operation                         | Location            | # of Uses |
      +-----------------------------------+---------------------+-----------+
      | 'load' operation for array 'in_0' | line 33 of kernel.c | 3         |
      +-----------------------------------+---------------------+-----------+
      |                                   | Total # of Uses     | 3         |
      +-----------------------------------+---------------------+-----------+

Info: Resource constraint limits initiation interval to 2.
      Resource 'in_1_local_memory_port' has 3 uses per cycle but only 2 units available.
      +-----------------------------------+---------------------+-----------+
      | Operation                         | Location            | # of Uses |
      +-----------------------------------+---------------------+-----------+
      | 'load' operation for array 'in_1' | line 33 of kernel.c | 3         |
      +-----------------------------------+---------------------+-----------+
      |                                   | Total # of Uses     | 3         |
      +-----------------------------------+---------------------+-----------+

Info: Resource constraint limits initiation interval to 2.
      Resource 'in_2_local_memory_port' has 3 uses per cycle but only 2 units available.
      +-----------------------------------+---------------------+-----------+
      | Operation                         | Location            | # of Uses |
      +-----------------------------------+---------------------+-----------+
      | 'load' operation for array 'in_2' | line 33 of kernel.c | 3         |
      +-----------------------------------+---------------------+-----------+
      |                                   | Total # of Uses     | 3         |
      +-----------------------------------+---------------------+-----------+
Info: Done pipelining the loop on line 22 of kernel.c with label "loop".
      Pipeline Initiation Interval (II) = 2.

Now the initiation interval has improved from 5 to 2, which is a more than a 2x performance improvement just by manually partitioning the C arrays.

Consider another example below:

int A = 1;
loop: for (i = 0; i < N; i++) {
    A = A * B[i];
}

We have a loop where the value of A in the current iteration is dependent on the previous iteration. This is called a cross-iteration dependency or loop recurrence. In order to achieve an II of 1, the value of A is required every clock cycle. This means that the multiplication of A and B[i] has to complete every clock cycle. Now, let’s consider a case where we would like to pipeline the multiplier more in order to get a higher Fmax. This can be done by changing the multiplier latency to 2 (using the set_operation_latency constraint in LegUp) from the default latency of 1.

When LegUp runs, we will see the following messages:

Info: Pipelining the loop on line 10 of loop_recurrence.c with label "loop".
Info: Cross-iteration dependency does not allow initiation interval of 1.
    Dependency (distance = 1) from 'mul' operation (at line 11 of loop_recurrence.c) to
    'phi' operation (at line 11 of loop_recurrence.c)
    Recurrence path:
    +-----------------+------------------------------+---------------+
    | Operation       | Location                     | Cycle Latency |
    +-----------------+------------------------------+---------------+
    | 'phi' operation | line 11 of loop_recurrence.c | 0             |
    | 'mul' operation | line 11 of loop_recurrence.c | 2             |
    +-----------------+------------------------------+---------------+
    |                 | Total Required Latency       | 2             |
    +-----------------+------------------------------+---------------+

    Total required latency = 2. Maximum allowed latency = distance x II = 1 x 1 = 1.
    Total required latency > Maximum allowed latency, we must increase II
Info: Done pipelining the loop on line 10 of loop_recurrence.c with label "loop".
    Pipeline Initiation Interval (II) = 2.

The messages tell us that the II cannot be 1 due to a cross-iteration dependency, which is from the multiply operation of the current loop iteration to the phi operation of the next iteration. You can think of a phi as a select operation, which is need to represent the program’s intermediate representation in static single asignment form. In this particular case, the phi selects the value of A between the initial value of 1 (in the first iteration of the loop), and the computed value of A from within the loop (in the iterations after the first). The dependency distance of 1 means that the multiply value is used by the phi operation 1 loop iteration later. Or alternatively, the phi operation is dependent on the multiply value from one loop iteration ago. The Recurrence path table shows that the phi operation takes 0 cycles, but the multiply operation takes 2 cycles, hence the total required latency is 2 for the path. However, the maximum allowed latency, if the II were to be 1, is 1 (distance x II = 1 x 1 = 1). In this case, the next loop iteration should be starting after 1 clock cycle (II = 1) but we still have not finished calculating the result of the multiply which is needed by the phi operation in the next iteration. Since the total required latency is greater than the maximum allowed latency, the II has to be increased to 2. With the II being 2, the maximum allowed latency becomes 2, which satisfies the total required latency. In this case, the first iteration will start, we will wait two clock cycles for the multiply to finish (II = 2), then start the next loop iteration.

In general, for pipelining, achieving a lower II is the highest priority for achieving the highest performance, even at the expense of slightly lower Fmax. For examples, if we can reduce the II from 2 to 1 then we cut the clock cycles taken by the loop in half, but we are unlikely to double the Fmax by inserting one more pipeline stage (which changes an II of 1 to 2 in this case).

If we use the set_operation_latency constraint to reduced the multiplier latency from 2 to 1 and run LegUp, we will see the following messages:

Info: Pipelining the loop on line 10 of loop_recurrence.c with label "loop".
Info: Done pipelining the loop on line 10 of loop_recurrence.c with label "loop".
    Pipeline Initiation Interval (II) = 1.

We have achieved an II of 1. Note that by default, LegUp sets the multiplier latency to 1, hence this particular case will not occur without changing multiplier latency. However, we use this example to demonstrate how you may vary the latency of operations to achieve a lower II when there is a loop recurrence.

The above example illustrated a case of II being increased due to the latency of operations in the presence of a loop recurrence. The II can also be increased due to the delay of operations in a loop with cross-iteration dependencies.

Consider the following example:

loop: for (iter = 0; iter < MAX_ITER; iter++) {
    long long squared = x * x + y * y;
    xtmp = x * x - y * y + x_0;
    y = 2 * x * y + y_0;
    x = xtmp;

    filter += squared <= 4;
}

The code shows the main computations for the mandelbrot set, the algorithm details are not important. When we run LegUp on this code, we see the following messages:

Info: Pipelining the loop on line 39 of mandelbrot.c with label "loop".
Info: Cross-iteration dependency does not allow initiation interval (II) of 1.
    Dependency (distance = 1) from 'trunc' operation (at line 42 of mandelbrot.c) to
    'phi' operation (at line 42 of mandelbrot.c)
    Recurrence path:
    +-------------------+-------------------------+------------+
    | Operation         | Location                | Delay [ns] |
    +-------------------+-------------------------+------------+
    | 'phi' operation   | line 42 of mandelbrot.c | 0.00       |
    | 'sext' operation  | line 40 of mandelbrot.c | 0.00       |
    | 'mul' operation   | line 42 of mandelbrot.c | 8.00       |
    | 'ashr' operation  | line 42 of mandelbrot.c | 0.00       |
    | 'shl' operation   | line 42 of mandelbrot.c | 0.00       |
    | 'add' operation   | line 42 of mandelbrot.c | 6.40       |
    | 'trunc' operation | line 42 of mandelbrot.c | 0.00       |
    +-------------------+-------------------------+------------+
    |                   | Total Required Delay    | 14.40      |
    +-------------------+-------------------------+------------+

    Total required delay = 14.40 ns.
    Maximum allowed latency = distance x II = 1.
    Maximum allowed delay = Maximum allowed latency x clock period
                          = 1 x 8.00 ns = 8.00 ns
    Total required delay > Maximum allowed delay, we must increase II.
    Tip: Increase the clock period to be greater than the total required delay
         to improve II.

Info: Done pipelining the loop on line 39 of mandelbrot.c with label "loop".
    Pipeline Initiation Interval (II) = 2.

The messages indicate that there is a cross-iteration dependency from the truncate operation to the phi operation, where the total required delay for the operation is 14.40 ns. On the other hand, the maximum allowed latency, if the II were to be 1, is 1, and the maximum allowed delay, based on the given clock period constraint (8 ns) and the maximum allowed latency (1), is 8 ns. Since the required delay of 14.4 ns for the path is greater than the maximum allowed delay of 8 ns, LegUp must increase the II to 2 to satisfy the required delay. If the II is 2, the maximum allowed latency (distance x II = 1 x 2) becomes 2, hence the maximum allowed delay becomes 16 ns (maximum allowed latency x clock period = 2 x 8 ns), and the required delay can be met.

As mentioned above, keeping the II low (and ideally 1) should generally be the priority for achieving the maximum performance. Another way to meet the required delay shown above, based on the equations shown as well as the described Tip, is to increase the clock period rather than increasing the II. With an II of 1 (hence the maximum allowed latency of 1), if the clock period is bigger than 14.4, the maximum allowed delay should be greater than the total required delay.

Let’s set the clock period to 15 (with the CLOCK_PERIOD constraint), and re-run LegUp:

Info: Generating pipeline for loop on line 39 of mandelbrot.c with label "loop".
    Pipeline initiation interval = 1.

You can see that LegUp was now able to generate a circuit with an II of 1.

Loop pipelining is a great technique for achieving high performance, and in order to achieve the maximum performance, users should be mindful of the circuit resource constraints and the recurrences that exist in the loop.

3.2. Loop Unrolling

LegUp allows the user to specify a loop to be unrolled through the use of a pragma, #pragma unroll.

#pragma unroll
for (i = 0; i < N; i++) {
    ...
}

This unrolls the loop completely. Unrolling a loop can improve performance as the hardware units for the loop body are replicated, but it also increases area. You may also specify a loop to be partially unrolled, to prevent the area from increasing too much.

#pragma unroll 2
for (i = 0; i < N; i++) {
    ...
}

This unrolls the loop two times. You may also prevent a loop from being unrolled. LegUp automatically unrolls small loops, but you may not want the loop to be unrolled due to area constraints or to pipeline the loop. If the loop is completely unrolled, the loop disappears, hence you cannot pipeline the loop.

#pragma unroll 1
for (i = 0; i < N; i++) {
    ...
}

This prevents the loop from being unrolled.

3.3. Function Pipelining

Similar to loop pipelining, when a function is specified to be pipelined, LegUp will automatically inline all of its descendant functions, and unroll all loops (in the specified function as well as in all of its descendant functions). This is done to create a high-performance pipelined hardware. Consider the following call graph:

_images/callgraph_new.pdf

where function c contains a loop. If function a is specified to be pipelined, functions c and d will be automatically inlined.

When LegUp runs, it will print out the following:

Info: Adding no_inline attribute to the user-specified function: a
Info: Inlining function 'c' into its parent function 'a' for pipelining.
Info: Inlining function 'd' into its parent function 'a' for pipelining.
Info: Unrolling the entire loop nest on line 22 of function_pipeline.c.
      This loop nest is inside function 'a', which is specified to be pipelined.
Info: Pipelining function 'a' on line 15 of function_pipeline.c.

It shows that LegUp first adds the no_inline attribute to function a to prevent it from being inlined. Then it inlines its descendant functions and unrolls their loops.

Care must be taken though, if the function designated to be pipelined has many descendant functions, which also has many loops, the hardware area can increase significantly (as was described above for Loop Pipelining). For instance, in the call graph shown above, if main is specified to be pipelined, functions a, b, c, and d will be automatically inlined. There will be two copies of c, as the function is called from two different places. As there is also a loop in c that will be completely unrolled (in each copy of c), this can increase the area significantly. Hence for function pipelining, one should examine the program before pipelining a function that has many descendant functions or loops.

3.4. Structs

LegUp attempts to automatically split up all structs into their individual elements. This improves performance, as the elements can then be accessed in parallel. However, there are cases when LegUp cannot split up a struct. This includes cases where the struct holds other structs or arrays. The struct can, however, hold a pointer which points to an array. When LegUp cannot split up a struct, it will print out the following:

Warning: The struct, "struct1", on line 168 of struct.c can result in inefficient
memory accesses. We recommend splitting it up into individual elements.

If a struct cannot be automatically split up, LegUp still generates proper hardware when targeting Intel FPGAs. We still recommend splitting up the remaining structs manually in C for better performance.

For other FPGA vendors (Xilinx, Microsemi, Lattice, Achronix), when LegUp cannot automatically split up all structs, it will print out the following error:

Error: LegUp HLS does not support structs when targeting XILINX. The struct, "struct1",
on line 168 of struct.c must be split up into individual elements.

In this case, such structs must be split up manually in the C code.

3.5. Inferring a Shift Register

A shift register is an efficient hardware unit that is composed of a chain of registers that are connected from one to the next. It allows data to be continuously shifted from one register to its adjacent register. It is useful in applications where data is continuously coming in in a streaming fashion, where some amount of data has to be kept around for processing. It is different from a memory, in that all elements stored in a shift register can be accessed at the same time.

For example, in a FIR filter, time-shifted versions of the input data, commonly referred to as taps, are needed to compute the output. A FIR filter can be expressed with the following equation:

y[n] = b0*x[n] + b1*x[n-1] + .. + bN*x[n-N]

where y[n] is the output, x[n] is the input, N indicates the filter order, and b0 to bN are filter coefficients. As you can see in the equation, once an input is received, it is needed for N+1 computations of the output. This is the perfect application for a shift register.

Let’s see how one can infer a shift register from C using LegUp.

int FIRFilter(int input) {

  static int shift_register[TAPS] = {0};

  #pragma unroll
  for (j = (TAPS - 1); j >= 1; j -= 1) {
      shift_register[j] = shift_register[j - 1];
  }
  shift_register[0] = input;

  ...

  return out;
}

We show the part of the FIRFilter function which pertains to the shift register (x[n] in the equation above). Each time the FIRFilter function is called, it receives one input and produces one output. First, the shift_register array is declared as static. This is needed since the data stored in the shift register (time shifted versions of the input data) needs to be kept around on the next invocation of the function. The loop shows each element of the array being stored to an array index of 1 greater than its current index, starting from the highest array index (TAPS - 1) all the way down to 1. This is effectively moving each element of the array up by one array index. Then the newest input is stored in the lowest array index (0). It is important to note the unroll pragma, which allows the loop to be unrolled. Unrolling the loop splits up the array into individual elements, where each element is stored in a register, hence creating the shift register. Without the unroll pragma, the shift_register array is stored in a memory (RAM), which only allows up to 2 memory accesses per cycle.

Note that if the FIRFilter function is specified to be pipelined, or if the shift register loop is contained within another loop that is specified to be pipelined, the shift register loop will automatically be unrolled and the unroll pragma is not required.

3.6. Inferring a Line Buffer

A line buffer is used to buffer a line of pixels of an image or a video frame, in order to keep data around and reduce the overall required memory bandwidth. It is useful for image/video processing applications, where an image/video pixel is continuously streamed in and processed, and is often used in conjunction with the shift register described above. A good example of such an application is the Sobel filter, which is used as one of the key steps of edge detection – a widely used transformation that identifies the edges in an input image and produces an output image showing just those edges.

At a high-level, Sobel filtering involves applying a pair of two 3×3 convolutional kernels (or filters), typically called Gx and Gy, to a 3x3 pixel stencil window. The stencil window slides over the entire image from left to right, and top to bottom, as shown below. The two kernels detect the edges in the image in the horizontal and vertical directions. At each position in the input image, the filters are applied separately and the computed values are combined together to produce a pixel value in the output image.

_images/stencil.eps

At every position of the stencil, we calculate the edge value of the middle pixel e, using the adjacent pixels labeled from a to i, each of which is multiplied by the value at its corresponding position of Gx and Gy, and then summed. An example C code for the Sobel filter is shown below.

#define HEIGHT 512
#define WIDTH 512

for (y = 0; y < HEIGHT; y++) {
  for (x = 0; x < WIDTH; x++) {
    if (notinbounds(x,y)) continue;
    xdir = 0; ydir = 0;
    for (xOffset = -1; xOffset <= 1; xOffset++) {
      for (yOffset = -1; yOffset <= 1; yOffset++) {
        pixel = input_image[y+yOffset][x+xOffset];
        xdir += pixel * Gx[1+xOffset][1+yOffset];
        ydir += pixel * Gy[1+xOffset][1+yOffset];
      }
    }
    edgeweight = bound(xdir) + bound(ydir);
    output_image[y][x] = 255 - edgeweight;
  }
}

The outer two loops ensure that we visit every pixels in the image, while ignoring image borders. The stencil gradient calculation is performed in the two inner loops. The x and y directions are bound to be between 0 and 255 and the final edge value is stored to the output image.

Consider a case where each pixel in a 512x512 image is received every clock cycle. One approach to implementing this in hardware is to store the entire image in memory, then perform filtering on the image by loading it from memory. While this approach is certainly possible, it suffers from several weaknesses. First, if the input image is 512×512 pixels, with each pixel received every clock cycle, it would take 262,144 cycles to store the entire image. This represents a significant wait time before seeing any output. Second, we would need to store the entire input image in memory. Assuming 8-bit pixel values, this would require 262KB of memory. If the image is stored in off-chip memory, it would take a considerable amount of time to access each pixel, and the performance would suffer significantly.

An alternative widely used approach is to use line buffers.

_images/sobel.eps

The figure shows two buffers, each holding 512 pixels. Rather than storing the entire input image, we only need to store the previous two rows of the input image (as the 3x3 stencil window can cover 3 rows), along with a few pixels from the current row being received (bottom row of pixels in the figure). As new pixels are received, they are stored into the line buffers. Once the first two lines of the image (and the first three pixels of the third row) have been received, we can start computing the edges. From this point onwards, the stencil starts to move with every new pixel received. When the stencil moves to the next row, its previous two rows are always stored in the line buffers.

With the line buffers, we can start computing the edges much earlier, as we do not have to wait for the entire image to be stored. This also drastically reduces the amount of memory required to just two rows of the input image. By storing the line buffers in on-chip memory, its data can be accessed very quickly (with 1 cycle latency). Techniques such as this allow efficient real-time video processing on FPGAs.

We show how one can create the line buffers with the 3x3 stencil window (shown in the figure above) in C using LegUp.

void sf_window_3x3_and_line_buffer(unsigned char input_pixel,
                                   unsigned char window[3][3]) {

    // shift registers
    static unsigned prev_row_index = 0;
    static unsigned char prev_row1[WIDTH] = {0};
    static unsigned char prev_row2[WIDTH] = {0};

    // window buffer:
    //      window[0][0], window[0][1], window[0][2]
    //      window[1][0], window[1][1], window[1][2]
    //      window[2][0], window[2][1], window[2][2]

    // shift existing window to the left by one
    window[0][0] = window[0][1];
    window[0][1] = window[0][2];
    window[1][0] = window[1][1];
    window[1][1] = window[1][2];
    window[2][0] = window[2][1];
    window[2][1] = window[2][2];

    int prev_row_elem1 = prev_row1[prev_row_index];
    int prev_row_elem2 = prev_row2[prev_row_index];

    // grab next column (the rightmost column of the sliding window)
    window[0][2] = prev_row_elem2;
    window[1][2] = prev_row_elem1;
    window[2][2] = input_pixel;

    prev_row1[prev_row_index] = input_pixel;
    prev_row2[prev_row_index] = prev_row_elem1;

    prev_row_index = (prev_row_index == WIDTH - 1) ? 0 : prev_row_index + 1;
}

This function creates two line buffers, where the input pixel and the 3x3 stencil window are passed in as arguments. The line buffers, prev_row1 and prev_row2, are declared as static arrays (same as for Inferring a Shift Register), since the data in the line buffers need to be kept around on successive invocations of the function. Another static variable, prev_row_index, keeps track of the position of the line buffer where its data needs to shifted out, and where new data needs to be shifted in. We then shift each element in the 3x3 window to the left by one. The last elements of the line buffers are read out and stored into the 3x3 window, along with the new input pixel. The new input pixel is also stored into the first line buffer, and the last element of the first line buffer is stored into the second line buffer. Then prev_row_index is updated, by incrementing it by one, unless it has gone through the entire row, in which case it is set to zero (indicating that we are moving onto a new row).

3.7. Inferring Streaming Hardware via Producer-Consumer Pattern with Pthreads

The producer-consumer pattern is a well-known concurrent programming paradigm. It typically comprises a finite-size buffer and two classes of threads, a producer and a consumer. The producer stores data into the buffer and the consumer takes data from the buffer to process. The producer must wait until the buffer has space before it can store new data, and the consumer must wait until the buffer is not empty before it can take data. The waiting is usually realized with the use of a semaphore.

The pseudocode for a producer-consumer pattern with two threads is shown below.

producer_thread {
  while (1) {
    // produce data
    item = produce();
    // wait for an empty space in the buffer
    sem_wait(numEmpty);
    // store item to buffer
    lock(mutex);
    write_to_buffer;
    unlock(mutex);
    // increment number of full spots in the buffer
    sem_post(numFull);
  }
}
consumer_thread {
  while (1) {
    // wait until buffer is not empty
    sem_wait(numFull);
    // get data from buffer
    lock(mutex);
    read_from_buffer;
    unlock(mutex);
    // increment number of empty spots in the buffer
    sem_post(numEmpty);
    // process data
    consume(item);
  }
}

In a producer-consumer pattern, the independent producer and consumer threads are continuously running, thus they contain infinite loops. Semaphores are used to keep track of the number of spots available in the buffer and the number of items stored in the buffer. A mutex is also used to enforce mutual exclusion on accesses to the buffer.

The producer-consumer pattern is an ideal software approach to describe streaming hardware. Streaming hardware is always running, just as the producer-consumer threads shown above. Different streaming hardware modules execute concurrently and independently, as with the producer-consumer threads. LegUp supports the use of Pthreads, hence the producer-consumer pattern expressed with Pthreads can be directly synthesized to streaming hardware. Our easy-to-use FIFO library provides the required buffer between a producer and a consumer, without the user having to specify the low-level intricacies of using semaphores and mutexes.

An example pseudo code with three kernels, where function A is a producer to function B (B is a consumer to A), and function B is a producer to C (C is a consumer to B) is shown below.

// Although a pthread function takes a void* argument,
// the arguments are expanded below for readability.
void *A(FIFO *FIFO0) {
  ...
  loop_A: while (1) {
    // do some work
    ...
    // write to output FIFO
    fifo_write(out);
  }
}
void *B(FIFO *FIFO0, FIFO *FIFO1, FIFO *FIFO2) {
  ...
  loop_B: while (1) {
    // read from input FIFO
    int a = fifo_read(FIFO0);
    // do some work
    ...
    // write to output FIFOs
    fifo_write(FIFO1);
    fifo_write(FIFO2);
  }
}
void *C(FIFO *FIFO1, FIFO *FIFO2) {
  ...
  loop_C: while (1) {
    // read from input FIFOs
    int a = fifo_read(FIFO1);
    int b = fifo_read(FIFO2);
    // do some work
    ...
  }
}
...
void top() {
  FIFO *FIFO0 = fifo_malloc(...);
  FIFO *FIFO1 = fifo_malloc(...);
  FIFO *FIFO2 = fifo_malloc(...);
  // Multiple arguments for a pthread function must be passed in as a struct,
  // but the arguments are expanded below for readability.
  pthread_create(A, FIFO0);
  pthread_create(B, FIFO0, FIFO1, FIFO2);
  pthread_create(C, FIFO1, FIFO2);
}

Each kernel contains an infinite loop, which keeps the loop body continuously running. We pipeline this loop, to create a streaming circuit. The advantage of using loop pipelining, versus pipelining the entire function (with function pipelining), is that there can also be parts of the function that are not streaming (only executed once), such as for performing initializations. The top function forks a separate thread for each of the kernel functions. The user does not have to specify the number of times the functions are executed – the threads automatically start executing when there is data in their input FIFOs. This matches the always running behaviour of streaming hardware. The multi-threaded code above can be compiled, concurrently executed, and debugged using standard software tools (e.g., gcc, gdb). When compiled to hardware with LegUp, the following hardware achitecture is created:

_images/multiple_kernels.pdf

Another advantage of using Pthreads in LegUp is that one can also easily replicate streaming hardware. In LegUp, each thread is mapped to a hardware instance, hence forking multiple threads of the same function creates replicated hardware instances that execute concurrently. For instance, if the application shown above is completely parallelizable (i.e., data-parallel), one can exploit spatial hardware parallelism by forking two threads for each function, to create the architecture shown below. This methodology therefore allows exploiting both spatial (replication) and pipeline hardware parallelism all from software.

_images/multiple_replicated_kernels.pdf

For replication, some HLS tools may require the hardware designer to manually instantiate a synthesized core multiple times and also make the necessary connections in HDL. This is cumbersome for a hardware engineer and infeasible for a software engineer. Other HLS tools provide system generators, which allow users to connect hardware modules via a schematic-like block design entry methodology. This, also, is a foreign concept in the software domain. Our methodology uses purely software concepts to automatically create and connect multiple concurrently executing streaming modules.

To create the replicated architecture shown above, one simply has to change the top function as the following:

#define NUM_THREADS 2

void top() {

  int i;
  FIFO *FIFO0[NUM_THREADS], *FIFO1[NUM_THREADS], *FIFO2[NUM_THREADS];
  for (i=0; i<NUM_THREADS; i++) {
    FIFO0[i] = fifo_malloc(...);
    FIFO1[i] = fifo_malloc(...);
    FIFO2[i] = fifo_malloc(...);
  }

  // Multiple arguments for a pthread function must be passed in as a struct,
  // but the arguments are expanded below for readability.
  for (i=0; i<NUM_THREADS; i++) {
    pthread_create(A, FIFO0[i]);
    pthread_create(B, FIFO0[i], FIFO1[i], FIFO2[i]);
    pthread_create(C, FIFO1[i], FIFO2[i]);
  }
}

This simple, yet power technique allows creating multiple replicated streaming hardware modules directly from standard software. As this is a purely standard software solution, without requiring any tool specific pragmas, the concurrent execution behaviour of the replicated kernels can be modeled from software.