We were invited to publish our AWS F1 Memcached acceleration work at the 2018 International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART). Our paper describes in detail the Memcached accelerator on AWS F1 as well as many of the interesting things (good and bad) that we learned about the F1 infrastructure along the way.
J. Choi, R. Lian, Z. Li, A. Canis, J. Anderson, “Accelerating Memcached on AWS Cloud FPGAs”, HEART 2018 (PDF)
We are excited to release a live demo of our Memcached server accelerator on AWS F1.
In the demo, we will automatically spin up two AWS EC2 instances for you (for free) so that you can easily try out our FPGA Memcached server accelerator. An F1 instance is programmed as the Memcached server and an M4 CPU instance is used as the client to run the memtier_benchmark. Note that starting the two instances can take a few minutes, so please be patient if the demo page takes some time to load.
As shown below, the demo page shows two terminal windows, one for the client and another for the server. In the client window, you will be able to choose the number of connections and the number of requests to the Memcached server to start the benchmark. While a test is running, the server window shows the packets-per-second (PPS) received and sent by the F1 instance. As described in our previous blog post, 700K is the maximum PPS on F1. When the test finishes, the client windows shows the measured throughput and latency to the Memcached server. For the demo, we use 100-byte data values, 1:1 set/get ratio, and Memcached pipelining of 16.
We attended RedisConf last week to learn more about Redis and meet users who are using Redis/Memcached. Many people were interested in our 11M+ Memcached ops/sec result achieved with a single AWS EC2 F1 instance, but the 1-page summary document was not comprehensive enough to answer all the questions. Here we explain in more detail about the hardware architecture and how we set up the experiment.
We had described the architecture in a previous blog post, but we briefly review it here. As shown in the figure below, the entire TCP/IP (and UDP) network stack, as well as the Memcached server logic, is implemented in FPGA hardware. We deeply pipeline the FPGA hardware (if this phrase isn’t clear, please see “What is FPGA Pipelining?” at the end of this post) to process many network packets and Memcached requests in flight. This cannot be done on a CPU, hence we are able to achieve a big speedup vs. CPU Memcached servers. On the F1, the FPGA is not directly connected to the network, so we use the CPU to transfer incoming network packets directly from the NIC to the FPGA and also outgoing network packets from the FPGA back to the NIC. Note that the Memcached accelerator is a prototype and currently only supports get and set commands.
As network bandwidths continue to increase from 10 Gbps to 25Gbps and beyond, cloud users need to be able to process high-bandwidth network traffic intelligently, to perform real-time computation or to gain insight into network traffic for detecting security or service issues.
We live in an exciting time, with FPGA cloud instances now available from Amazon and Alibaba. Traditionally, FPGAs (field programmable gate arrays) have been used in network switches or for high-frequency financial trading because of their superior ability to perform high-bandwidth network processing. We believe many cloud applications that require high-speed network and data stream processing can achieve 10X better latency and throughput on a cloud FPGA compared to a standard commodity server.
LegUp Cloud FPGA Platform
We have developed a cloud FPGA platform that makes it easier for a software developer trying to program high-speed data processing on a cloud FPGA. Behind the scenes, the LegUp platform hides all the low-level details of getting network traffic to/from the FPGA and handling the network layer. We currently support AWS F1 FPGA instances.
The quality and price of image sensors has seen a huge improvement over the past decade, we are now seeing increased adoption of cameras in the automotive sector. One new application is a driver facing camera that can monitor the driver for signs of drowsiness. If the driver is about to fall asleep, we can trigger an alarm. Implementing a system like this requires a camera and an embedded processor to analyze the video stream, looking for the driver’s face and performing facial landmark detection to determine the location of their eye lids.
We have recently worked with the company Eyeris who specializes in these facial analytics software algorithms. However, they were having a problem, the software algorithms ran too slowly on an embedded processor. They came to Efinix, a company that specializes in programmable hardware acceleration platforms, who contacted us to help them convert this facial analysis written in software into hardware that can run on an FPGA.
In the video below shows three versions of the facial analytics demo. First, running on the embedded processor (~5 frames per second), then running on an FPGA (~13 FPS), and finally on a smaller video canvas (~15 FPS). You can see that the responsiveness improves tremendously by using the FPGA to accelerate this application. Eyeris showed this demo to some of their customers during CES this year:
The CL_DRAM_DMA example demonstrates lots of the Shell/CL interfaces and functionality. This blog post will walk through the custom logic (CL) portion of the example. You may have found that this example has more than 6000 lines of SystemVerilog code but with very little comments. To help you quickly understand the example from a high level, we created some block diagrams to overview the CL’s hierarchy, interface, connectivity, and functionality. We will also dive into some major modules and go through the implementations. … Continue reading
The “Hello World” example exercises the OCL Shell-to-CL AXI-Lite interface, the Virtual LED outputs and the Virtual DIP switch inputs. This blog post will walk through the custom logic RTL (Verilog), explain the AXI-lite slave logic, and highlight the PCIe APIs that can be used for accessing the registers behind the AXI-lite interface. … Continue reading
Amazon has recently announced the availability of their FPGA cloud, Amazon EC2 F1. We think that this is very exciting news, as it is the first time that FPGAs in the cloud are being available to the general public on a massive scale. This is the first step towards making FPGAs easier to use, as with the EC2 F1, a user no longer has to buy an FPGA and install it on site. FPGAs are typically much more expensive and cumbersome to buy than CPUs or GPUs, hence making them available in the cloud so that one can use them from anywhere around the world makes FPGAs much more accessible.
When the F1 instances first became available, we were excited to try them out, but we found that they were pretty difficult to use at first. Documentation was lacking (and incorrect in some case), and although Amazon provides a few examples with their AWS EC2 FPGA Hardware and Software Development Kit, the instructions to run the examples are scattered in different places and missing some steps. Understandably, they only released this service publicly in April 2017 and documentation may not have been their highest priority. To this end, we decided to write a unified guide which provides step-by-step instructions on how to run the two main examples provided by Amazon, cl_hello_world and cl_dram_dma, on the Amazon EC2 F1. This guide includes the instructions included in their AWS EC2 FPGA Hardware and Software Development Kit as well as information that we have written by trying out the examples ourselves. At the time of writing, we could not find such a step-by-step guide and we ran into issues here and there so we think that this guide will allow one to easily try out the F1 instances without getting stuck in some setup issue. So let’s dive right into it!