LegUp-powered Intelligent Network Processing on Cloud FPGAs

By Ruolong Lian,

  Filed under: Tutorials
  Comments: None


As network bandwidths continue to increase from 10 Gbps to 25Gbps and beyond, cloud users need to be able to process high-bandwidth network traffic intelligently, to perform real-time computation or to gain insight into network traffic for detecting security or service issues.

We live in an exciting time, with FPGA cloud instances now available from Amazon and Alibaba. Traditionally, FPGAs (field programmable gate arrays) have been used in network switches or for high-frequency financial trading because of their superior ability to perform high-bandwidth network processing. We believe many cloud applications that require high-speed network and data stream processing can achieve 10X better latency and throughput on a cloud FPGA compared to a standard commodity server.

LegUp Cloud FPGA Platform

We have developed a cloud FPGA platform that makes it easier for a software developer trying to program high-speed data processing on a cloud FPGA. Behind the scenes, the LegUp platform hides all the low-level details of getting network traffic to/from the FPGA and handling the network layer. We currently support AWS F1 FPGA instances.

The LegUp cloud platform bypasses the Linux kernel and passes network packets directly from the network interface card to the FPGA over PCIe. On the FPGA side, we implemented a high-bandwidth network stack, including support for IP and UDP (TCP support coming soon). We have designed the network layer to be directly compatible with hardware accelerators generated from C/C++ using LegUp high-level synthesis. Using our cloud platform, you can focus solely on developing the core data stream processing algorithm in C/C++ and get your network accelerator up and running quickly on AWS F1 instances.



Case Study: Intrusion Prevention System

We will now show an example of using the LegUp Cloud Platform to implement a simple intrusion prevention system (IPS) on a cloud FPGA. Intrusion prevention and detection systems are used to inspect network traffic packets looking for signatures that indicate malicious behaviour, such as a distributed denial of service (DDoS) attack. The most popular open-source software is called Snort, which can run on a commodity server and monitors network traffic against a set of configurable rules that identify malicious traffic. Network traffic that matches the rules can be dropped or logged in a file or database.

The figure below shows our desired setup on AWS cloud. The server on the left is sending network traffic to FPGA cloud server, which is filtering the packets according to a set of filter rules. If a packet matches the rules, the FPGA redirects the packet to a logging server otherwise packets are forwarded to the server on the right.



For our experimental tests, we generate UDP network traffic and filter this against string matching rules on the FPGA. We use a single machine with two ports to simplify latency measurements, as shown below:



UDP Application Interface

Using the LegUp platform, we implement the packet processing accelerator in C/C++. In this example, the packet processing accelerator receives and sends packets with UDP. We must ensure that the C function prototype matches the standard LegUp network application template, as shown below:

// The UdpInterface struct encapsulates all necessary interfaces to
// open/close UDP ports, and receive/send UDP packets.
typedef struct {
    // RX packet and metadata.
    FIFO *rx_packet_data, *rx_packet_keep, *rx_packet_last, *rx_packet_length;
    FIFO *rx_src_ip, *rx_src_port, *rx_dst_ip, *rx_dst_port;

    // TX packet and metadata.
    FIFO *tx_packet_data, *tx_packet_keep, *tx_packet_last, *tx_packet_length;
    FIFO *tx_src_ip, *tx_src_port, *tx_dst_ip, *tx_dst_port;

    // Port open request and reply.
    FIFO *port_open_request, *port_open_reply, *port_release_request;
} UdpInterface;

void UdpAppFunc(UdpInterface* udp_interface) {
    // User logic here...




The user application function will receive and send UDP network packets through LegUp’s “FIFO” data type arguments. The FIFOs can be read and written using LegUp’s built-in functions, ‘fifo_read()’ and ‘fifo_write()’.

On the receiving side (shown as “RX Interface” in the above figure), the ‘rx_packet_data/keep/last’ FIFOs are always aligned (you can think of it as an AXI4-Stream interface). Each read from the ‘rx_packet_data’ FIFO returns 8 bytes of data, while each read from ‘rx_packet_keep’ returns a 8-bit word, indicating which corresponding bytes in the 8-byte packet data are valid. Reading a ‘1’ from the ‘rx_packet_last’ FIFO indicates that the current ‘data’ word is the last one of the current UDP packet.
Each UDP packet is associated with a ‘length’ and a ‘metadata’. Users can read ‘rx_packet_length’ to know the length of the current packet, and read ‘rx_src/dst_ip/port’ FIFOs to know the IPs and ports of the source and destination of current UDP packet.

The transmitting side (“TX Interface”) has the exact same arguments, but with the opposite direction. The UDP application core will write to these TX FIFOs to send out UDP packets.

On the right-hand side of the above figure, there is the “Port Control Interface”. The UDP application core can write to the “port_open_request” FIFO to listen to a port, read from the “port_open_reply” FIFO to confirm that the listening port has been open, and write to the ‘port_release_request’ FIFO to close a port.

The example code below implements a simple UDP echo server that listens on port 9876 and sends back all UDP packets received to the sender’s (client’s) IP and port.

void UdpAppFunc(UdpInterface* ui) {
    // Send port open request.
    fifo_write(ui->port_open_request, 9876);
    // Wait for port open response, assume it will always succeed.
    bool port_open_succeeded = fifo_read(ui->port_open_reply);

    while (1) {  // Continuously running..
        // Read one 8-byte data at a time.
        uint64 data = fifo_read(ui->rx_packet_data);
        uint8  keep = fifo_read(ui->rx_packet_keep);
        uint1  last = fifo_read(ui->rx_packet_last);

        // Pass through the payload.                                             
        fifo_write(ui->tx_packet_data, data);
        fifo_write(ui->tx_packet_keep, keep);
        fifo_write(ui->tx_packet_last, last);

        if (last) {
            // Reach the end of the packet, now send the packet length and meta.
            fifo_write(ui->tx_packet_length, fifo_read(ui->rx_packet_length));
            // Loopback: reverse the IPs and ports between source and destination.
            fifo_write(ui->tx_src_ip, fifo_read(ui->rx_dst_ip));
            fifo_write(ui->tx_dst_ip, fifo_read(ui->rx_src_ip));
            fifo_write(ui->tx_src_port, fifo_read(ui->rx_dst_port));
            fifo_write(ui->tx_dst_port, fifo_read(ui->rx_src_port));

The LegUp high-level synthesis compiler will synthesize the above code, and pipeline the ‘LoopUdpApp’ loop such that a new loop iteration can be launched every clock cycle. This means that the LegUp-generated hardware can receive, process, and send 8 bytes of data every clock cycle.
If we set the clock frequency to 250 MHz on the AWS F1 instance, the generated hardware echo server can reach a bandwidth of 2 GB/s, or 16 Gbps.


Experimental Results

Using the framework described above, we designed an intrusion prevention accelerator as an example application.
The code of the intrusion prevention accelerator follows the similar structure of the above echo server. The intrusion prevention accelerator will receive ASCII data packets from a network client using UDP, detect a keyword in the packets, and re-direct the packets back to the client with a different destination port if a keyword is detected in the packet. We created two configurations of the accelerator, one of which detects 4 keywords with the other detecting 64 keywords.

CPU vs FPGA Throughput

Our experiment showed that the FPGA intrusion prevention server can consistently reach 6.67 Gbps of throughput, regardless of the number of patterns to be detected by the accelerator. Such consistency in throughput comes from the pipeline architecture of our FPGA accelerator. This throughput is currently limited by the network, as the FPGA accelerator can perform intrusion detection at line-rate and 6.67 Gbps is the maximum network bandwidth we are currently able to achieve on AWS F1. We are currently investigating this limitation. The theoretical maximum network bandwidth on AWS is advertised to be 10 Gbps for the f1.2xlarge instance (contains one FPGA), which we are using for this example, and 25 Gbps for the f1.16xlarge instance (contains eight FPGAs).

For comparison, we implemented a C software version of the intrusion prevention server with the same functionality and ran it on the CPU of the same AWS F1 instance. The results showed that the throughput for CPU starts at 1.9 Gbps when the server is detecting 4 patterns, but it degrades significantly when the number of patterns increases. When the number of patterns to be detected is 64, the CPU throughput becomes 0.285 Gbps, so at 64 keywords FPGA throughput is 23.4X of the CPU. This shows that as more computations are added to the application, FPGA’s speedup over CPU can significantly increase. Note that this experiment is a proof-of-concept, hence the C code of the CPU version is very simple. For each packet data, the software goes through each keyword and calls the “strstr” function, which is a standard library function that detects whether a string exists in another string. This C code could be improved to achieve higher throughput. Nevertheless, we think that this simple example is sufficient as a preliminary test case.


Edited on March 21th: Since the initial post of this blog, the Experiment Results section has been updated to reflect the results of the improved network stack on FPGA.

Early Access

If you are interested in getting early access to the new LegUp cloud platform please email us at info@legupcomputing.com.

Be the first to write a comment.

Your feedback