4. Hardware Architecture

This section describes the hardware architecture produced by LegUp.

4.1. Circuit Topology

Each C function corresponds to a hardware module in Verilog. For instance, if we have a software program with the following call graph:

_images/callgraph_new.pdf

where main calls a and b, which in turn call c and d. Notice that function c is called by both a and b. One way to create this system in hardware is to instantiate one module within another module, in a nested hierarchy, following how the functions are called in software:

_images/circuit_arch_nested.pdf

This architecture is employed by other HLS tools, but it can create an unnecessary replication of hardware. Notice how module c has to be created twice, since the function c is called from different parent functions. In LegUp, we instantiate all module at the same level of hierarchy, and automatically create the necessary interconnect to connect them together.

_images/circuit_arch_flat.pdf

This prevents modules from being unnecessarily replicated, saving area. The hardware system may also use a functional unit (denoted as FU in the figure), such as a floating-point unit, which is also created at the same level. This architecture also allows such units, which typically consume a lot of area, to be shared between different modules.

Note

For a small function, or for a function that is only called once in software, LegUp may decide to inline the function to improve performance. Thus you may not find all the software functions in the generated hardware.

4.1.1. Threaded Hardware Modules

When Pthreads/OpenMP are used in software, LegUp automatically compiles them to concurrently executing modules. This is synonymous to how multiple threads are compiled to execute on multiple processor cores in software. By default, each thread in software becomes an independent hardware module. For example, forking three threads of function a in software creates three concurrently executing instances of module a in hardware. In a processor-accelerator hybrid system, each thread is compiled to a concurrently executing hardware accelerator.

4.2. Memory Architecture

LegUp stores any arrays (local or global) and global variables in a memory. We describe below what type of memories they are, as well as where the memories are stored.

In LegUp, there exists four hierarchies of memories: 1) Local memory, 2) shared-local memory, 3) aliased memory, and 4) processor memory. Local, shared-local, and aliased memories exist in hardware (in a hardware-only system or within a hardware accelerator in a hybrid system). In a hardware-only system, only the first three levels of memories exist. In a processor-accelerator hybrid system, all four levels of memories exist. Any data that is only accessed by an accelerator is stored on the accelerator side. Any data that is only accessed by the processor, or is shared between multiple accelerators, or is shared between an accelerator and the processor, is stored in processor memory. The processor memory consists of an off-chip memory and on-chip cache. The processor and all hardware accelerators have access to the on-chip cache. For the processor-accelerator hybrid architecture, see the ARM Processor-Accelerator Hybrid Architecture section below.

4.2.1. Local Memory

LegUp uses points-to analysis to determine which memories are used by which functions. If a memory is determined to be used by a single function, where the function is to be compiled to hardware (in a hardware-only system or as part of a hardware accelerator in a hybrid system), that array is implemented as a local memory. A local memory is created and connected directly inside the module that accesses it.

Local memories have a latency of 1 clock cycle.

4.2.2. Shared-Local Memory

If a memory is accessed by multiple functions, where all the functions that access it are to be compiled to hardware, the memory is designated as a shared-local memory. A shared-local memory is instantiated outside the modules (at the same level of hierarchy as the modules, as shown above), and memory ports are created for the modules to connect to the memory. Arbitration is automatically created to handle contention for a shared-local memory. LegUp automatically optimizes the arbitration logic, so that if the modules that access the memory are executing sequentially, a simple OR gate is used for arbitration (which consumes only a small amount of area), but if the accessing modules are executing concurrently (with the use of Pthreads or OpenMP), a round-robin arbiter is automatically created to handle contention.

Shared-local memories have a latency of 1 clock cycle.

4.2.3. Aliased Memory

There can be cases where a pointer can point to multiple arrays, causing pointer aliasing. These pointers need to be resolved at runtime. We designate the memories that such a pointer can refer to as aliased memories, which are stored in a memory controller (described below). A memory controller contains all memories that can alias to each other, and allows memory accesses to be steered to the correct memory at runtime. There can be multiple memory controllers in a system, each containing a set of memories that alias to each other.

Aliased memories have a latency of 2 clock cycles.

4.2.3.1. Memory Controller

The purpose of the memory controller is to automatically resolve pointer ambiguity at runtime. The memory controller is only created if there are aliased memories. The architecture of the memory controller is shown below:

_images/memory_controller.pdf

For clarity, some of the signals are combined together in the figure. Even though the figure depicts a single-ported memory, all memories are dual-ported by default. The memory controller steers memory accesses to the correct RAM, by using a tag, which is assigned to each aliased memory by LegUp. At runtime, the tag is used to determine which memory block to enable, with all other memory blocks disabled. The same tag is used to select the correct output data between all memory blocks.

4.2.4. Processor Memory

The processor memory consists of the on-chip cache and off-chip memory in the processor-accelerator hybrid flow. Any memories which are accessed by functions running on the processor are stored in off-chip memory and can be brought into the on-chip cache during program execution. Memories that are shared between multiple accelerators are also stored in this shared memory space. In case of aliased memories, if a pointer that aliases points to a memory that is accessed by the processor, then all of the memories that the pointer aliases to are stored in this shared memory space.

4.2.5. Memory Storage

By default, each local, shared-local, and aliased memories are stored in a separate dual-ported on-chip RAM, where each RAM allows two accesses per clock cycle. All local memories can be accessed in parallel. All shared-local memories can be accessed concurrently when there are no accesses to same memory in the same clock cycle. If there are concurrent accesses to the same RAM, our arbitration logic handles the contention automatically and stalls the appropriate modules. All independent memory controllers can also be accessed in parallel, but aliased memories which belong to the same memory controller must be accessed sequentially.

4.2.5.1. Memory Optimizations

LegUp automatically stores each single-element global variable (non-array) in a register, rather than a RAM, to reduce memory usage and improve performance. A RAM has a minimum read latency of 1 clock cycle, where a register can be read in the same clock cycle (0 cycle latency). For small arrays, LegUp may decide to split them up and store individual elements in separate registers. This allows all elements to be accessed at the same time. If an array is accessed in a loop, and the loop is unrolled, LegUp also may decide to split up the array.

For constant arrays, which are stored in read-only memories (ROM), the user can choose to replicate them for each function that accesses them. This can be beneficial when the constant array is accessed frequently by multiple threads (in Pthreads/OpenMP). By creating a dedicated memory for each thread, it localizes the memory to each thread, making the constant array a local memory of each threaded module. This can improve performance by reducing stalls due to contention, and also save the resource of arbitration logic. This feature can be enabled with the Replicate ROM to each accessing module constraint in LegUp.

The figure below shows an example architecture with Pthread modules and all of the memories that we have described for a hardware-only system. The main function forks two threads of function d, hence two instances of d are created. main, d0, and d all execute in parallel, and they share the memory controller, shared-local memories 0 and 1, a register module, as well as a hardware lock module. The hardware lock module is automatically created when a mutex is used in software. All arbitration logic shown are round-robin arbiters, as all of the modules execute in parallel. A local memory that is used only by main is instantiated inside the module, and d0 and d1 have replicated constant memories instantiated inside.

_images/architecture_with_all_memories.pdf

4.3. Interfaces

LegUp generates a number different interfaces for functions and memories. For instance, if we have the following C prototypes:

int FunctionA(int a, int *mem, FIFO *fifo);
void FunctionB(FIFO *fifo);

where FunctionA calls FunctionB, with a FIFO being written to in FunctionA and read from in FunctionB, LegUp can generate the following module interfaces.

module FunctionA (
    // default interface
    input clk,
    input reset,
    input start,
    output reg finish,
    input memory_controller_waitrequest,

    // for return value
    output reg [31:0] return_val,

    // for argument
    input [31:0] arg_a,

    // for calling FunctionB
    output FunctionB_start,
    input FunctionB_finish,

    // for accessing mem pointer
    output reg [7:0] mem_address_a,
    output reg  mem_enable_a,
    output reg  mem_write_enable_a,
    output reg [31:0] mem_in_a,
    input [31:0] mem_out_a,
    output reg [7:0] mem_address_b,
    output reg  mem_enable_b,
    output reg  mem_write_enable_b,
    output reg [31:0] mem_in_b,
    input [31:0] mem_out_b,

    // for writing to fifo
    input fifo_ready_from_sink,
    output fifo_valid_to_sink,
    output [31:0] fifo_data_to_sink
)

module FunctionB (
    // default interface
    input clk,
    input reset,
    input start,
    output reg finish,
    input memory_controller_waitrequest,

    // for reading from fifo
    input fifo_ready_to_source,
    output fifo_valid_from_source,
    output [31:0] fifo_data_from_source
)

As seen above, each module contains a number of default interface signals, clk, reset, start, and finish. The start/reset signals are used by the first state of the state machine:

_images/first_state.jpg

The finish signal is kept low until the last state of the state machine, where finish is set to 1. The memory_controller_waitrequest signal is the stall signal for the module, which when asserted, stalls the entire module.

Since FunctionA has a return type of int, it has a 32-bit return_val output.

Each scalar argument becomes an input to the module. The int a argument creates the 32-bit arg_a input.

In the C program, FunctionA calls FunctionB, hence LegUp creates handshaking signals between the two modules. When the output signal FunctionB_start is asserted, FunctionB starts executing. FunctionA then waits until the input signal FunctionB_finish is asserted. This is in line with the sequential executing behavior of software. However, when Pthreads is used, the caller module continues to execute after forking its threaded modules.

In FunctionA, memory signals for the mem pointer argument are also shown. In this case, mem is designated as a shared-local memory and created outside the module, hence memory ports are created to access the memory. Two ports are created for the memory to allow two accesses per clock cycle. Note that if a pointer is passed into a function, but LegUp determines that the array the pointer refers to is only accessed by the function, the memory will be designated as a local memory and be instantiated within the module (removing the need for memory ports).

The following memory signals are shown:

Memory Signal Type Description
address Address of memory.
enable 1 if reading or writing in this clock cycle
write_enable 1 for write, 0 for read
in Data being read from memory
out Data being written to memory

The width of the address bus is determined by the depth of the memory.

The fifo argument is given to both FunctionA and FunctionB, where FunctionA writes to the FIFO and FunctionB reads from the FIFO. LegUp currently does not allow the same function to both write to and read from the same FIFO. In this case, the sink (reader) is FunctionB and the source (writer) is FunctionA.

FIFO Signal Type Description
ready Indicates whether the sink is ready to receive data
valid Indicates whether this data is valid
data Data being sent

In the hybrid flow, hardware accelerators can communicate with the processor, and also access the on-chip cache. Currently, the hybrid flow is only supported for Intel FPGAs and is available in the full licensed version of LegUp. We use Altera’s Avalon Interface and Altera’s System Integration Tool, Qsys.

Each hardware accelerator contains a default slave interface for clock and reset, a slave interface for communicating with the processor, and a master interface to access the on-chip cache.

Default slave interface for clock and reset:

Avalon signal Description
csi_clockreset_clk hardware accelerator clock
csi_clockreset_reset hardware accelerator reset

Avalon slave signals (prefixed with avs_s1) are used by the processor to communicate with the hardware accelerator.

Avalon signal Description
avs_s1_address address sent from processor to hardware accelerator.
avs_s1_read processor sets high to read return value from hardware accelerator
avs_s1_write processor sets high to write an argument or start the processor.
avs_s1_readdata data returned from accelerator data back to processor
avs_s1_writedata data written from processor to accelerator

Avalon master signals (prefixed with avm) are used by the accelerator to communicate with the on-chip cache.

Avalon signal Description
avm_ACCEL_address address to send to cache
avm_ACCEL_read set high when accelerator is reading from cache
avm_ACCEL_write set high when accelerator is writing to cache
avm_ACCEL_readdata return data from cache
avm_ACCEL_writedata data to write to cache
avm_ACCEL_waitrequest asserted until the readdata is returned from cache

The on-chip data cache is a write-through cache, hence when an accelerator or the processor writes to the cache, the cache controller also writes the data to off-chip memory.

If a memory read results in a cache miss, the cache controller will access off-chip memory to fetch a cache line and write it to the cache, then the appropriate data will also be returned to the accelerator.

4.4. ARM Processor-Accelerator Hybrid Architecture

LegUp can automatically compile a software program into a processor-accelerator SoC, comprising an ARM processor and hardware accelerators. Currently, this is available for the Intel (Altera) DE1-SoC Board and the Arria V SoC Development Board. Both boards contain an ARM HPS (Hard-Processor System) SoC FPGA. The ARM HPS consists of (among other things) a dual-core ARM Cortex-A9 MPCore, on-chip instruction and data caches, an SDRAM controller which connects to an on-board DDR3 memory. A number of different interfaces are provided to communicate between the HPS and the FPGA, including the HPS-to-FPGA interface (denoted as H2F in the figure below), and the FPGA- to-HPS interface (denoted as F2H in the figure). The H2F is an AXI (Advanced eXtensible Interface) interface that allows the ARM processor to communicate with the hardware accelerators in a hybrid system. To access the processor memory space, hardware accelerators access the F2H interface, which is connected to the Accelerator Coherency Port (ACP), which connects to on-chip caches in the HPS to provide cache-coherent data. A hardware accelerator may also have any of the hardware memories as described previously.

_images/hybridARM_architecture.pdf