On our last blog post, we wrote about using LegUp to perform networking processing on AWS cloud FPGAs (F1). In this post, we describe accelerating Memcached on AWS F1.
Memcached is a high-performance distributed in-memory key-value store, widely deployed by companies such as Flickr, Wikipedia and WordPress. Memcached is typically used to speed up dynamic web applications by caching chunks of data (strings, objects) in RAM, which alleviates the load on the back-end database.
The figure below shows a typically deployment where Memcached is used as a caching layer to provide fast accesses to data for front-end web servers. When the data is found on a Memcached server, trips to disk storage are avoided (i.e. disk stores attached to the back-end databases). Memcached is also used by Facebook, Twitter, Reddit, and Youtube.
Memcached Accelerator on AWS F1
We implemented Memcached on the AWS F1 instance. Memcached commands are sent using the TCP network protocol, therefore the FPGA needs to contain a hardware network stack with TCP support. The block diagram of the Memcached accelerator on AWS F1 is as follows:
Since the network is not directly connected to the FPGA on AWS F1, we use the CPU to pass network packets to the FPGA over PCIe. The network stack on the FPGA then processes the packets and passes the Memcached requests to the Memcached accelerator. The accelerator processes the Memcached requests and generates corresponding responses, which are sent back to the network stack. The network stack forms network packets with the responses. The response packets are transferred over PCIe back to CPU, which sends the packets out to the network. The Memcached accelerator is fully pipelined so that it can have multiple Memcached requests in flight (multiple requests are processed in a streaming fashion).
Note that we have made some simplifications in this initial implementation of Memcached. We currently only support the set and get Memcached commands. Set and get are the core Memcached commands used to store/load key-value data to/from a Memcached server. The Memcached server currently only supports 1 client. Also, all data is currently stored in FPGA on-chip memory. We are working on supporting other Memcached commands, supporting multiple clients, and adding support for DDR.
We used two EC2 instances to test our Memcached accelerator on AWS, one instance which runs a Memcached client (sends/receives Memcached requests/responses) on a CPU, and another instance (F1) which runs our Memcached server accelerator on FPGA to receive requests from the client and send responses back. For the Memcached client, we used a standard Memcached benchmark, memtier_benchmark, which allows one to select various configurations, such as the type of commands, data value sizes, and the number of clients. Once run, the memtier_benchmark reports the throughput (in terms of how many operations/second the Memcached server could handle) and latency (the amount of time consumed in sending a request to the Memcached server and receiving a response back).
We compared our Memcached accelerator to ElastiCache, the Memcached server solution available from Amazon AWS. ElastiCache is a fully managed server by AWS, where one can deploy a Memcached or Redis server instantly. AWS provides an IP address to point your Memcached client to and the rest is automatically handled. There are a number of different ElastiCache configurations to choose from, each of which comes with a different number of CPUs, size of RAM, network bandwidth, and price ($/hour). For comparison, we chose the Cache.r4.4xlarge instance, which most closely matches the F1 instance in terms of price, RAM, and network bandwidth.
|Type||vCPU||RAM (GB)||Network Bandwidth||Cost ($/hour)|
|Cache.r4.4xlarge||16||101.38||Up to 10Gbps||1.82|
|F1||8||122||Up to 10Gbps||1.65|
Using the same client instance running memtier_benchmark, we pointed the Memcached client to an F1 instance running our Memcached accelerator, then to a Cache.r4.4xlarge ElastiCache instance to compare throughput, and latency. In memtier_benchmark, we used 100-byte data values and set the ratio of the number of set and get commands to be equal (1:1). Throughput results are as follows:
Note that our Memcached accelerator currently supports 1 client only. The x-axis shows the amount of pipelining and the y-axis shows the throughput (number of operations per second). Pipelining is a technique used in Memcached and Redis to pack multiple requests into a network packet to reduce packet processing overheads and improve throughput. Results show that our Memcached accelerator consistently outperforms the ElastiCache instance. With a pipelining of 1, we see 69% higher throughput for LegUp, and with a pipelining of 16, we see 32% higher throughput.
Latency is an important metric for key-value stores. In order to provide data quickly and reliably to the client, Memcached servers must provide low and deterministic latency. With CPUs, there can be spikes in latency due to context switches and interrupts. Our Memcached accelerator provides deterministic latency, as the entire server including the network stack is implemented in hardware and there is no operating system.
Results show that our Memcached accelerator again provides better results (lower latency) compared to Elasticache. We see 35~40% lower latency for LegUp. Note that this is total round-trip latency, which includes network time (time spent for a packet to travel over the network from one client to the server then back to the client) as well as Memcached server compute time. The network time constitutes a large chunk of total latency, thus if we only compare the compute latency of the Memcached servers, we expect the latency improvement of our accelerator to be even higher.
Using the throughput and price data above, we also computed the throughput/dollar metric; that is, the number of operations per second per dollar. Results show that when using our Memcached accelerator, one can get 46~87% higher operations per second per dollar:
A Memcached server typically connects to multiple clients, allowing the server to process at high throughputs (1 million+ operations per second). We are enhancing our solution to support multiple clients. However, given the present state of development, we can make projections regarding how our Memcached accelerator will perform with multiple clients. With pipelined hardware, we expect the throughput to scale well as more clients communicate with the server. For a Memcached server running on a CPU, the throughput per client rapidly starts to diminish as more clients are added, as the CPU cannot keep up with processing network packets.
We gathered the throughput results for ElastiCache with 200 clients (other settings remain the same), then projected how our Memcached accelerator will perform.
Using pipelining of 10, ElastiCache achieves 1.15 million operations/sec. However, given 200 clients, this indicates 5,773 operations/sec per client. Given that 1 client achieved 33,177 operations/sec (shown in the first throughput graph above), the throughput per client drops drastically as more clients are added. Again, we expect this to scale up better on pipelined FPGA hardware and project that we can achieve more than 7 million operations/sec with 200 clients and pipelining of 10. This gives us a 6X advantage over the throughput of ElastiCache with 200 clients.
Using the data/projection in the graph above, we also show the throughput/dollar metric for 200 clients. With pipelining of 10, the LegUp Memcached accelerator is projected to achieve more than 7X throughput/dollar of the ElastiCache instance.
If you are interested in our Memcached accelerator and would like a demo, please contact us at firstname.lastname@example.org.