In this post we’re going to show you a video filtering demonstration on an Amazon cloud FPGA using LegUp. The specific video filter we will showcase is Canny edge detection as we described in a previous post. This same approach could be used to implement any other image filter that uses convolution (blur, sharpen, emboss, etc.).
We’re from Toronto and fans of the Blue Jays, so we’ll use a slow motion video of Josh Donaldson hitting a home run to demonstrate our filter:
Run the LegUp Video Streaming AFI on AWS
For this demonstration, we have used LegUp to create an Amazon FPGA image (AFI) containing the Canny edge detection example. We launch a new F1 instance and look for the LegUp video streaming AFI (we assume this has already been synthesized previously):
aws ec2 describe-fpga-images --fpga-image-ids agfi-XXXXXXXXX
We clear any AFI that might already be loaded in FPGA slot 0:
sudo fpga-clear-local-image -S 0
Now we load the LegUp image into FPGA slot 0:
sudo fpga-load-local-image -S 0 -I agfi-XXXXXXXXX
Now we compile the Canny edge detection runtime source code that will run on the processor:
cd ~/src/project/data/aws-fpga-legup/hdk/cl/examples/cl_canny/software/runtime make all
Inside this same directory we have provided the home run video shown above. We can play this by running:
This video has 300 frames in total and each frame has a width of 440 pixels and height of 502 pixels.
Now we run the Canny edge detection on the provided video, we specify “cpu” to run this entirely on our local machine (not on the FPGA).
./canny cpu donaldson.avi edges_cpu.avi
We verify this output video using mplayer. Here’s how the video looks after applying Canny edge detection:
Now we run the Canny edge detection on the FPGA using our LegUp AFI, by specifying “fpga” as the first argument. The runtime software will copy the video frames over PCIe to the FPGA and then start the canny edge detection algorithm, then copy the video frames back to the CPU and save them in the video: edges_fpga.avi. The FPGA hardware is running at 250 MHz (clock recipe A1) and we transfer each video frame using one DMA request.
./canny fpga donaldson.avi edges_fpga.avi
We use mplayer to verify the video matches the expected edge detection video edges_cpu.avi
We have successfully run the Canny edge detection on an Amazon FPGA!
We benchmarked the Canny against a CPU and found the following results:
As you can see, the FPGA was 3X faster than the CPU, note that we didn’t spend any time optimizing this.
We were only able to transfer one frame per DMA request to avoid seeing PCIS timeouts. The shell terminates any outstanding AXI transactions after 8 μS. You can see if this is happening by running:
sudo fpga-describe-local-image -S 0 -M
Make sure you see dma-pcis-timeout-count=0 otherwise you are running into this issue and the data transfer will be corrupted. Note that after you see timeouts, you will need to completely reprogram the FPGA.
The CPU to FPGA memory bandwidth is only 1.5GB/s according to the eDMA driver docs. The video is 66 MB (300 frames of 220kB) which would take 44 msec, close to what we are seeing. But this bandwidth is significantly less than the memory bandwidth of ~240 GB/s available on an AWS p2.xlarge Telsa K80 GPU. We hope this improves over time since this is just the first version of the eDMA driver.