We are pleased to present the world’s fastest cloud-hosted Memcached on AWS using EC2 F1 (FPGA) instances. With a single F1 instance, LegUp’s Memcached server prototype achieves over 11M ops/sec, a 9X improvement over ElastiCache, at <300 μs latency. It offers 10X better throughput/$ and up to 9X lower latency compared to ElastiCache. Please refer to our 1-page handout for more details.
On our last blog post, we wrote about using LegUp to perform networking processing on AWS cloud FPGAs (F1). In this post, we describe accelerating Memcached on AWS F1.
Memcached is a high-performance distributed in-memory key-value store, widely deployed by companies such as Flickr, Wikipedia and WordPress. Memcached is typically used to speed up dynamic web applications by caching chunks of data (strings, objects) in RAM, which alleviates the load on the back-end database.
The figure below shows a typically deployment where Memcached is used as a caching layer to provide fast accesses to data for front-end web servers. When the data is found on a Memcached server, trips to disk storage are avoided (i.e. disk stores attached to the back-end databases). Memcached is also used by Facebook, Twitter, Reddit, and Youtube.
As network bandwidths continue to increase from 10 Gbps to 25Gbps and beyond, cloud users need to be able to process high-bandwidth network traffic intelligently, to perform real-time computation or to gain insight into network traffic for detecting security or service issues.
We live in an exciting time, with FPGA cloud instances now available from Amazon and Alibaba. Traditionally, FPGAs (field programmable gate arrays) have been used in network switches or for high-frequency financial trading because of their superior ability to perform high-bandwidth network processing. We believe many cloud applications that require high-speed network and data stream processing can achieve 10X better latency and throughput on a cloud FPGA compared to a standard commodity server.
LegUp Cloud FPGA Platform
We have developed a cloud FPGA platform that makes it easier for a software developer trying to program high-speed data processing on a cloud FPGA. Behind the scenes, the LegUp platform hides all the low-level details of getting network traffic to/from the FPGA and handling the network layer. We currently support AWS F1 FPGA instances.
TORONTO, February 22, 2018 — LegUp Computing, Inc. announced today that it closed a seed funding round led by Intel Capital. LegUp offers a cloud platform that enables software developers to program, deploy, scale, and manage FPGA devices for accelerating high performance applications without requiring hardware expertise. The technology enables the next generation of low-latency and high-throughput computing on the vast amount of real-time data processed in the cloud. LegUp Computing, Inc., was spawned from years of research in the Dept. of Electrical and Computer Engineering at the University of Toronto to commercialize the award-winning open-source LegUp high-level synthesis tool.
LegUp Computing team from left to right: Omar Ragheb, Zhi Li, Dr. Andrew Canis, Ruolong Lian, Dr. Jongsok Choi, and University of Toronto Professor Jason Anderson (Photo: Jessica MacInnis)
The quality and price of image sensors has seen a huge improvement over the past decade, we are now seeing increased adoption of cameras in the automotive sector. One new application is a driver facing camera that can monitor the driver for signs of drowsiness. If the driver is about to fall asleep, we can trigger an alarm. Implementing a system like this requires a camera and an embedded processor to analyze the video stream, looking for the driver’s face and performing facial landmark detection to determine the location of their eye lids.
We have recently worked with the company Eyeris who specializes in these facial analytics software algorithms. However, they were having a problem, the software algorithms ran too slowly on an embedded processor. They came to Efinix, a company that specializes in programmable hardware acceleration platforms, who contacted us to help them convert this facial analysis written in software into hardware that can run on an FPGA.
In the video below shows three versions of the facial analytics demo. First, running on the embedded processor (~5 frames per second), then running on an FPGA (~13 FPS), and finally on a smaller video canvas (~15 FPS). You can see that the responsiveness improves tremendously by using the FPGA to accelerate this application. Eyeris showed this demo to some of their customers during CES this year:
In this post we’re going to show you a video filtering demonstration on an Amazon cloud FPGA using LegUp. The specific video filter we will showcase is Canny edge detection as we described in a previous post. This same approach could be used to implement any other image filter that uses convolution (blur, sharpen, emboss, etc.).
We’re from Toronto and fans of the Blue Jays, so we’ll use a slow motion video of Josh Donaldson hitting a home run to demonstrate our filter:
A paper describing the use of LegUp HLS to synthesize hardware cores for floating-point computations from C-language software will appear in the 2018 ACM/IEEE Design Automation and Test in Europe (DATE) conference, to be held at Dresden, Germany, in March 2018. The floating-point cores are fully IEEE 754 compliant, yet through software changes alone, can be tailored to application needs, for example, by reducing precision or eliminating exception checking, saving area and raising performance in non-compliant variants. The IEEE-compliant cores synthesized by LegUp HLS compare favourably to custom RTL cores from FloPoCo and Altera/Intel, despite the LegUp-generated cores being synthesized entirely from software. An advance copy of the paper, jointly authored by the University of Toronto and Altera/Intel, is available: PDF.
In this post we will explain how to implement a Canny edge detector on an FPGA. We will describe the entire algorithm in C code using LegUp, allowing us to avoid the difficulty of implementing this design in Verilog or VHDL.
First, watch this quick video of the the finished edge detector, running on an Altera DE1-SoC board, which costs $249. We also have this demo working on a Lattice HDR-60 board, which costs $425 and includes a built-in 720p camera.
The first thing you’ll notice is the output is black and white, with each pixel representing whether there is an edge at that particular region of the image. The brightness of the pixel represents how distinct the edge is at that location.
The example below shows Canny edge detection performed on the Lenna test image:
You may be asking, why do edge detection on an FPGA instead of a standard microprocessor? The main reason is that Canny edge detection requires significant computation. Typically this is not a problem when working with static images, but for a video application the processor must keep up with the incoming video frame rate, otherwise we would see a choppy output video. Instead, we want the video output to be updated in real-time, which means there is no delay between moving the camera and updating the screen. On an FPGA, we can exploit the parallelism inherent in the Canny edge detection algorithm and stream the incoming video pixels through a series of specialized hardware pipelines that perform the Canny edge algorithm in parallel stages.
The CL_DRAM_DMA example demonstrates lots of the Shell/CL interfaces and functionality. This blog post will walk through the custom logic (CL) portion of the example. You may have found that this example has more than 6000 lines of SystemVerilog code but with very little comments. To help you quickly understand the example from a high level, we created some block diagrams to overview the CL’s hierarchy, interface, connectivity, and functionality. We will also dive into some major modules and go through the implementations. … Continue reading
The “Hello World” example exercises the OCL Shell-to-CL AXI-Lite interface, the Virtual LED outputs and the Virtual DIP switch inputs. This blog post will walk through the custom logic RTL (Verilog), explain the AXI-lite slave logic, and highlight the PCIe APIs that can be used for accessing the registers behind the AXI-lite interface. … Continue reading