Deep Dive into the World’s Fastest Cloud-Hosted Memcached

By Jongsok Choi,

  Filed under: Tutorials
  Comments: None

We attended RedisConf last week to learn more about Redis and meet users who are using Redis/Memcached. Many people were interested in our 11M+ Memcached ops/sec result achieved with a single AWS EC2 F1 instance, but the 1-page summary document was not comprehensive enough to answer all the questions. Here we explain in more detail about the hardware architecture and how we set up the experiment.

Hardware Architecture

We had described the architecture in a previous blog post, but we briefly review it here. As shown in the figure below, the entire TCP/IP (and UDP) network stack, as well as the Memcached server logic, is implemented in FPGA hardware. We deeply pipeline the FPGA hardware (if this phrase isn’t clear, please see “What is FPGA Pipelining?” at the end of this post) to process many network packets and Memcached requests in flight. This cannot be done on a CPU, hence we are able to achieve a big speedup vs. CPU Memcached servers. On the F1, the FPGA is not directly connected to the network, so we use the CPU to transfer incoming network packets directly from the NIC to the FPGA and also outgoing network packets from the FPGA back to the NIC. Note that the Memcached accelerator is a prototype and currently only supports get and set commands.

memcached_f1

Experimental Setup

The figure below shows our experimental setup. We use an EC2 m4.16xlarge instance, which has 64 vCPUs, to run the Memcached client, memtier_benchmark (an open-source Memcached/Redis benchmarking tool by Redis Labs). Using the same client instance, we connect to our F1 Memcached server (in the same availability zone and placement group as the client) and also to ElastiCache (also in the same availability zone).

memcached_experimental_setup

Different ElastiCache instances come with different RAM sizes, network bandwidths, and price/hour. We chose the Cache.r4.4xlarge instance since it is the closest to F1 in all three criteria.

Type   vCPU RAM   Network Bandwidth Cost
  Cache.r4.4xlarge 16   101.38 GB Up to 10 Gbps   $ 1.82/hr
f1.2xlarge 8 122 GB Up to 10 Gbps   $ 1.65/hr

 

Memtier_benchmark allows one to select the ratio between get/set commands, the value size, and the amount of pipelining (among many other options). For this experiment, we chose a 1:1 ratio between get and set commands, and 100-byte values. When there are many small network packets, the CPU can become the performance bottleneck, as the network packet processing overhead becomes too big (as opposed to actual Memcached processing). But this is not the case for FPGAs. The pipelined architecture and the dedicated network stack hardware in the FPGA allows our implementation to sustain high throughput by processing a continuous stream of network packets at a very high consistent rate, regardless of the network packet size. Based on this paper and from talking to users of Redis/Memcached, we think that a 100-byte value size is realistic in real-world use cases.

Pipelining in Memcached/Redis allows you to pack multiple requests into a single network packet. This reduces the packet processing overhead (with a pipelining of 10, each packet has 10 requests, meaning only one packet has to be processed per 10 Memcached requests, whereas without pipelining, a packet has to be processed for every single request). Hence, pipelining is an important feature to reduce packet-processing overhead. More importantly, AWS imposes a packets-per-second (PPS) limitation on EC2 instances, as in the cloud, everything is virtualized and the underlying hardware is shared among multiple users. With this PPS limitation, if pipelining was not used, the maximum throughput of Memcached/Redis would be limited to the PPS value. We found the maximum PPS for F1 (specific to f1.2xlarge instance, regardless of Memcached) to be ~700K (for each of inbound and outbound traffic when transmitting simultaneously), and we are reaching the maximum PPS with our Memcached server when there are many connections. For this experiment, we use a pipelining of 16 (16 Memcached requests are combined into a packet) for both ElastiCache and LegUp Memcached.  Using many connections (see results below), we ran the memtier_benchmark with up to 40 threads (–threads=40) with up to 35 connections each (–clients=35). Each result shown below is an average of 3 memtier_benchmark runs.

 

Results

The figure below shows the throughput (ops/sec) of ElastiCache and LegUp Memcached as we vary the number of connections from 5 to 1400. A Memcached server typically connects to many clients, thus an understanding of how the performance scales with different numbers of clients is important. We observe that LegUp’s Memcached outperforms ElastiCache starting from 5 connections all the way to 1400 connections. ElastiCache’s throughput is capped at around 1.4 million ops/sec with 80 connections and remains about the same when the number of connections is increased to 1400, as the CPU can no longer keep with the network traffic. As for LegUp Memcached, throughput continues to increase to reach 11.5 million ops/sec with 1400 connections, corresponding to a 9X better throughput compared to ElastiCache. At 1400 connections, the LegUp Memcached instance is reaching the 700K PPS limit.

throughput

 

Latency is another key metric for Memcached, as databases need to respond as quickly as possible to front-end web servers in order to create a responsive experience for web users. For example, for online shipping, latency is a key factor for conversion rate, since if a website takes too long to respond, the shopper will simply quit or go to another website. The figure below shows the latency of ElastiCache and LegUp Memcached as we vary the number of connections. Note that the reported latency is the total round-trip time (RTT) measured at the client side, which includes round-trip network latency in addition to Memcached latency. As the number of connections increases, ElastiCache’s latency spikes to 18.22 ms, as the CPU can no longer keep up with the network traffic. For LegUp Memcached, the RTT latency is sub-millisecond up to 400 connections, and increases up to 2.03 ms with 1400 connections, at which point the latency is 9X better for LegUp. Again, we are reaching the max PPS at this point.

As mentioned, this latency is the total RTT including network latency. The latency of the LegUp Memcached server instance alone is measured to be 0.2~0.3 ms with 1400 connections (includes the network stack and Memcached server latency, but excludes the delay on the network link between instances).

latency

 

For many Memcached users, 11M ops/sec may not be necessary. However, the total cost of ownership (TCO) is important for everyone. To illustrate the cost efficiency of LegUp Memcached, we show the throughput/$ (ops/sec/$) in the figure below. With 5 connections, LegUp Memcached is 77% more cost-efficient, and with 1400 connections, LegUp Memcached is 10X more cost-efficient.

In other words, if we allow multiple tenants in our single LegUp Memcached server, where the throughput and the memory space are split across users, the TCO can be greatly reduced.

throughput_per_dollar

 

FPGA Area is Only 20% Utilized

Even with 11M+ ops/sec, the FPGA area on F1 is only about 20% utilized. There is abundant room to add more computations. We can add an in-line compression module to increase the effective RAM size, or in-line encryption for security. Due to the pipelined nature of FPGA hardware, adding more computations in-line does not necessarily affect the throughput (discussed below) — it is possible to maintain throughput at 11M ops/sec. In the future, we would like to allow users to add custom functions that are compiled from C with LegUp. In Redis, there are modules that can perform custom analytics on the stored data, such as RediSearch and Redis-ML. These custom modules fit very well to our vision of using LegUp to add more custom compute to our network-attached cloud FPGA.

What is FPGA Pipelining?

For those not coming from a hardware background, the phrase, “we deeply pipeline the FPGA hardware”, may not be clear. Here we provide a high-level overview of pipelining, to illustrate how an FPGA’s throughput can be much higher than a CPU. Note that the pipelining that we discuss here is in terms of hardware and is not related to Redis/Memcached pipelining (which refers to combining multiple requests in a single network packet).

Pipelining in hardware can be visualized as workers on an assembly line. Imagine a car factory where there is a single assembly line (see figure a below). We have a worker named John, who is to perform three tasks: Build the car frame, put on doors, then put on wheels to deliver a car. Assuming each task takes one hour to complete, John will be able to deliver a car every three hours, by performing one task after another and repeatedly performing the three tasks. This is analogous to sequential execution using a single CPU core. If we were to increase the output of the car factory, we can hire more workers, Bob and Joe, and put in two more assembly lines (figure b below). With three workers working in parallel on three assembly lines, we can output three cars every three hours. This is analogous to multi-threading in software, where multiple cores execute tasks in parallel. The problem with this approach is that we can quickly run out of room in the factory to put in more assembly lines, or CPU cores to run parallel tasks on. The degree of parallelism is limited by the number of CPU cores (among other things).

non_pipelined2

 

A better approach is to have one assembly line, but have John, Bob, and Joe each perform one task each (figure c below). Once John finishes the first frame and passes it to Bob, he will immediately start working on the next frame, without waiting for the first car to be completely made. Once the pipeline is filled up (after the initial three hours, each worker is busy), the factory outputs one car per hour, which is the same rate as the above scenario with three assembly lines, but in this case, we have only one assembly line. Pipelining in FPGA hardware operates in an analogous fashion.  As data streams in, a pipeline stage (worker) works on a new piece data, while its next pipeline stage (next worker) is working on the data item just processed by the current stage. Hence the whole pipeline is processing multiple data items concurrently and can output newly processed data every clock cycle.

 

pipelined

Pipelining also allows one to add new tasks easily. Let’s say we now want to add the task of installing windshields into the car. In the pipelined case, all we have to do is hire an additional person to work after Joe, without having to create an entirely new assembly line. This is straightforward on an FPGA and takes up much fewer resources than replicating the hardware (i.e., a new assembly line). Assuming that the new task also takes one hour, the throughput (output/production rate) remains the same. Once the pipeline is filled up, it can produce a car every hour. We can add many tasks to the pipeline, hence the phrase “deeply pipeline the FPGA hardware”. The number of pipeline stages in FPGA hardware can typically be made much higher than the number of cores available in modern CPUs, which is why an FPGA’s throughput can be higher than CPUs for streaming applications.

Note that an FPGA can also do multi-core execution in conjunction with pipelined execution by replicating multiple pipelines. In the example above, if we replicated the three pipelines, we would be able to produce three cars every hour, tripling the rate of production.

 

If you have any questions or would like a demo of LegUp Memcached, please contact us at info@legupcomputing.com.

 

Be the first to write a comment.

Your feedback