Faster Facial Analytics using an FPGA

Introduction

The quality and price of image sensors has seen a huge improvement over the past decade, we are now seeing increased adoption of cameras in the automotive sector. One new application is a driver facing camera that can monitor the driver for signs of drowsiness. If the driver is about to fall asleep, we can trigger an alarm. Implementing a system like this requires a camera and an embedded processor to analyze the video stream, looking for the driver’s face and performing facial landmark detection to determine the location of their eye lids.

We have recently worked with the company Eyeris who specializes in these facial analytics software algorithms. However, they were having a problem, the software algorithms ran too slowly on an embedded processor.  They came to Efinix, a company that specializes in programmable hardware acceleration platforms, who contacted us to help them convert this facial analysis written in software into hardware that can run on an FPGA.

In the video below shows three versions of the facial analytics demo. First, running on the embedded processor (~5 frames per second), then running on an FPGA (~13 FPS), and finally on a smaller video canvas (~15 FPS). You can see that the responsiveness improves tremendously by using the FPGA to accelerate this application. Eyeris showed this demo to some of their customers during CES this year:

 

 Facial Analytics Pipeline

The facial analytics pipeline we were using is shown in the figure below. The algorithm starts with the face detection to find a bounding box surrounding the largest face in the current camera frame. Within the bounding box, we then find the locations of the facial landmarks (the dots in the image). Based on the facial landmark locations, several prediction models are then used to estimate the head pose (yaw, pitch and roll), eye-openness, gender, age, and emotion. For more details please contact Eyeris directly.

eyeris

Target Platform

Eyeris was originally using a Raspberry Pi 3 board with the following specs:

  • ARM Cortex-A53 CPU, 64-bit Quad-Core
  • 1.2 GHz
  • 1GB RAM – LPDDR2

We sped this up by targeting Aldec’s Tysom-1 Zynq board with a Xilinx Zynq-7000 XC7Z030:

  • ARM Cortex-A9 CPU, 32-bit Dual-Core
  • 667 MHz
  • 512 MB RAM – DDR3
  • Kintex-7 FPGA: 125K Logic Cells, 157K Flip-Flops, 9.3Mb RAM, 400 DSP

We used high-level synthesis to move certain compute intensive functions from the ARM processor onto the FPGA. By using high-level synthesis technology, the optimized algorithms can still be implemented in software (C/C++) and also verified in a standard software development setup by running it on a CPU. After the functionality is verified, we then use the HLS-generated Verilog RTL to program the FPGA. The rest of the non-accelerated portion of the program then runs on the ARM processor of the SoC-FPGA.

More Detailed Video

I have a more detailed video below with more explanation below:

Leave a Reply

Your email address will not be published. Required fields are marked *