# Introduction

In this post we will explain how to implement a Canny edge detector on an FPGA. We will describe the entire algorithm in C code using LegUp, allowing us to avoid the difficulty of implementing this design in Verilog or VHDL.

First, watch this quick video of the the finished edge detector, running on an Altera DE1-SoC board, which costs $249. We also have this demo working on a Lattice HDR-60 board, which costs $425 and includes a built-in 720p camera.

The first thing you’ll notice is the output is black and white, with each pixel representing whether there is an edge at that particular region of the image. The brightness of the pixel represents how distinct the edge is at that location.

The example below shows Canny edge detection performed on the Lenna test image:

You may be asking, why do edge detection on an FPGA instead of a standard microprocessor? The main reason is that Canny edge detection requires significant computation. Typically this is not a problem when working with static images, but for a video application the processor must keep up with the incoming video frame rate, otherwise we would see a choppy output video. Instead, we want the video output to be updated in **real-time**, which means there is no delay between moving the camera and updating the screen. On an FPGA, we can exploit the parallelism inherent in the Canny edge detection algorithm and stream the incoming video pixels through a series of specialized hardware pipelines that perform the Canny edge algorithm in parallel stages.

# Canny Edge Detection

Before starting Canny edge detection, we perform a grayscale conversion to convert the color input image to black and white. There are four stages of image filters required to perform Canny edge detection: Gaussian filter, Sobel filter, non-maximum suppression and hysteresis.

First, we blur the image using a Gaussian blur convolution operation. Next, we calculate the magnitude of the brightness gradient in the up/down and right/left directions with a Sobel filter. Non-maximum suppression thins the edges and hysteresis removes the weak edges caused by noise.

All of these filters work by performing a convolution between a filter kernel and the input image. The following image illustrates the convolution operation for a single output pixel, in this case for an emboss image filter:

We perform convolution by sweeping the kernel across the input image to produce the output image. For example, during a Gaussian blur we take a weighted average of the input pixel with the neighboring pixels, giving the center input pixel the highest weight.

# Hardware Implementation

From a hardware perspective, the video input is received as a stream of pixels. We receive a new pixel every single hardware clock cycle, as shown below. As new input pixels arrive, we perform the convolution operation by multiplying the values of the kernel, shown by the 3×3 blue matrix, by the image pixels below. Note: the blue kernel is actually much smaller than pictured here, taking up only a 3 by 3 pixel region of the image.

You may notice that we only need the previous two rows of pixels to perform the convolution and produce the output pixel (assuming a 3 by 3 convolution kernel). This means that we don’t need to keep older pixels around:

In fact, we can store the previous two rows of pixels in a shift register in hardware, also called a line buffer. This has a very efficient implementation on the FPGA:

Now we can efficiently shift a new pixel into the line buffer every clock cycle and discard the older pixel.

We perform the convolution operation by reading from nine fixed locations in this shift register (as illustrated by the blue arrows above) and multiplying these pixel values by the corresponding coefficients of the image filter kernel. On the FPGA, we can perform these multiplies in parallel, meaning that every clock cycle we can read in a new input pixel and produce a valid output pixel. Compare this to a CPU, where we would need many instructions to calculate the output pixel from a new input pixel.

We can also take the output pixel from one image filter and pass this as an input to the next image filter. For example, the Gaussian filter output can feed directly into the Sobel filter. On the FPGA, each filter can be run independently in parallel, exploiting data flow parallelism using a streaming paradigm. We represent the flow of data between these functions in LegUp using the FIFO datatype provided by LegUp’s streaming library. A FIFO is a queue with first-in, first-out behavior and has an efficient hardware implementation on an FPGA. The final hardware implementation will look like the following:

We describe this streaming data flow by passing a FIFO as an argument to each C function implementing a convolution filter, as shown below:

void sobel_filter(FIFO *input_fifo, FIFO *output_fifo) { unsigned char input_pixel = fifo_read(input_fifo); ... fifo_write(output_fifo, sobel_out);

# Software Implementation

Lets now walk through some details of the code implementing the Canny edge detector in C using LegUp.

First we will look at how to the initial implementation of the Gaussian image filter function in C:

void gaussian_filter(unsigned char input[HEIGHT][WIDTH], unsigned short output[HEIGHT][WIDTH]) { int i, j, m, n; unsigned short sum; // Gaussian Convolution Kernel Filter const unsigned int GAUSSIAN[5][5] = { { 2, 4, 5, 4, 2 }, { 4, 9, 12, 9, 4 }, { 5, 12, 15, 12, 5 }, { 4, 9, 12, 9, 4 }, { 2, 4, 5, 4, 2 } }; const unsigned int DIVISOR = 159; for (i = 0; i < HEIGHT; i++) { for (j = 0; j < WIDTH; j++) { int outofbounds = 0; outofbounds |= (i < 2) | (i > (HEIGHT - 3)); outofbounds |= (j < 2) | (j > (WIDTH - 3)); if (outofbounds) { output[i][j] = input[i][j]; } else { sum = 0; for (m = -2; m <= 2; m++) { for (n = -2; n <= 2; n++) { sum += input[i + m][j + n] * GAUSSIAN[m + 2][n + 2]; } } sum /= DIVISOR; output[i][j] = sum; } } } }

The Gaussian blur convolution kernel is a 5 by 5 array on lines 7-12. The divisor is the sum of all the coefficients in the kernel. We loop over the image on lines 16-35 and if we are on the edge of the image (last 2 pixels on each edge) where the convolution is undefined we set the output equal to the input. We perform the multiplications required by the convolution on lines 26-28 and divide on line 31.

We can take this code as-is and compile with LegUp, but the FPGA circuit will be very inefficient. Instead, we perform a number of coding style changes that allow LegUp to create an efficient hardware pipeline. The final code is shown below:

void gaussian_filter(FIFO *input_fifo, FIFO *output_fifo) { static int i = 0; static int j = 0; static int count = 0; static unsigned char window[5][5] = {0}; unsigned char input_pixel = fifo_read(input_fifo); unsigned char ret = 0; gf_window_5x5_and_line_buffer(input_pixel, window); // wait for line buffers to fill before output if (count > 2 * WIDTH + 1) { int outofbounds = 0; outofbounds |= (i < 2) | (i > (HEIGHT - 3)); outofbounds |= (j < 2) | (j > (WIDTH - 3)); unsigned int sum = 0; int m, n; for (m = -2; m <= 2; m++) { for (n = -2; n <= 2; n++) { sum += ((unsigned int)window[m + 2][n + 2]) * GAUSSIAN[m + 2][n + 2]; } } sum /= DIVISOR; if (outofbounds) { ret = window[2][2]; } else { ret = sum; } fifo_write(output_fifo, ret); // keep track of row/column of image if (j < WIDTH - 1) { j++; } else if (i == HEIGHT - 1 && j == WIDTH - 1) { // end of the image frame i = 0; j = 0; } else { i++; j = 0; } } count++; }

In this new function, we have modified the function arguments to use the FIFO datatype from LegUp’s streaming library. The input/output variables now more accurately model the hardware, where one new video pixel is arriving every clock cycle. We receive that new pixel on line 8 by calling the fifo_read() function to read the next pixel from the FIFO. We declare many of the variables as static on lines 3-6 to allow them to retain their values across function invocations. The function gf_window_5x5_and_line_buffer() will be described in detail shortly, for now you can think of window as representing the input pixels we need to multiply by the convolution kernel coefficients. We have a new variable count that is checked on line 15 to make sure that the line buffers are full and the window holds valid pixels. The output pixel is output using the fifo_write() function on line 37. On lines 40-48 we keep track of the x/y coordinates of where we are in the image so that we can do bound checking on lines 17-19.

The code for the gf_window_5x5_and_line_buffer() function is below:

void gf_window_5x5_and_line_buffer(unsigned char input_pixel, unsigned char window[5][5]) { // shift registers static unsigned prev_row_index = 0; static unsigned char prev_row1[WIDTH] = {0}; static unsigned char prev_row2[WIDTH] = {0}; static unsigned char prev_row3[WIDTH] = {0}; static unsigned char prev_row4[WIDTH] = {0}; // window buffer: // window[0][0], window[0][1], window[0][2], window[0][3], window[0][4] // window[1][0], window[1][1], window[1][2], window[1][3], window[1][4] // window[2][0], window[2][1], window[2][2], window[2][3], window[2][4] // window[3][0], window[3][1], window[3][2], window[3][3], window[3][4] // window[4][0], window[4][1], window[4][2], window[4][3], window[4][4] // shift existing window to the left by one window[0][0] = window[0][1]; window[0][1] = window[0][2]; window[0][2] = window[0][3]; window[0][3] = window[0][4]; window[1][0] = window[1][1]; window[1][1] = window[1][2]; window[1][2] = window[1][3]; window[1][3] = window[1][4]; window[2][0] = window[2][1]; window[2][1] = window[2][2]; window[2][2] = window[2][3]; window[2][3] = window[2][4]; window[3][0] = window[3][1]; window[3][1] = window[3][2]; window[3][2] = window[3][3]; window[3][3] = window[3][4]; window[4][0] = window[4][1]; window[4][1] = window[4][2]; window[4][2] = window[4][3]; window[4][3] = window[4][4]; int prev_row_elem1 = prev_row1[prev_row_index]; int prev_row_elem2 = prev_row2[prev_row_index]; int prev_row_elem3 = prev_row3[prev_row_index]; int prev_row_elem4 = prev_row4[prev_row_index]; // grab next column (the rightmost column of the sliding window) window[0][4] = prev_row_elem4; window[1][4] = prev_row_elem3; window[2][4] = prev_row_elem2; window[3][4] = prev_row_elem1; window[4][4] = input_pixel; prev_row1[prev_row_index] = input_pixel; prev_row2[prev_row_index] = prev_row_elem1; prev_row3[prev_row_index] = prev_row_elem2; prev_row4[prev_row_index] = prev_row_elem3; prev_row_index = (prev_row_index == WIDTH - 1) ? 0 : prev_row_index + 1; }

This helper function updates an array, window, which holds the pixels of the image that we will multiply by the convolution kernel. Window correspond to the blue matrix in the animations above. The prev_row array variables are the line buffers that hold the previous four rows of the image. Every time this function is called we shift the existing pixels in the window left by one pixel on lines 20-24. Then we shift in the next input pixel and shift the line buffers on line 32-41.

You can download the source files for this project by emailing us.