Applying OpenCL as the Computing Engine for an Embedded, Real-Time Image Processing System
William DeCook and Andrew Dowd
This project was an R&D effort to examine the suitability of OpenCL and General Purpose computing on GPU (GPGPU) as a mobile computing technique. To keep the project grounded in practical terms, we implemented an image processing algorithm on a platform suitable for mobile deployment (ARM/Mali). The BHAG (Big Hairy Audacious Goal) was to achieve real-time video processing with power, weight and size (SWAP) suitable for battery-powered systems. Lithe Technology is frequently tasked with creating computational intensive solutions and we wanted to understand the pros and cons of OpenCL for our customers. Can we leverage OpenCL to create better embedded systems?
1.1 Embedded Computing
Heterogeneous computing has been used in various forms on embedded systems for many years. Lithe Technology has designed and built numerous “roll-your-own” heterogeneous computing systems. These systems were based on FPGAs and used HDL to create semi-custom computing engines. While this approach can produce impressive performance, the cost to develop custom HDL is significant and the opportunities for reuse are limited. OpenCL can solve some of these shortcomings by providing an open source, cross–platform option that can handle computationally intensive problems. To explore how OpenCL might be used to implement, Lithe initiated this project to explore the application of OpenCL on real-time image processing for embedded systems.
1.2 What is OpenCL?
OpenCL is a framework that allows for improved allocation of resources across a heterogeneous computing environment. In particular, tasks can be assigned to multi-core graphics processing units (GPUs) so that certain computations are executed in parallel rather than serially. This parallel processing offers significant speedup in execution times of programs that require similar calculations to be made on large arrays of data. The OpenCL framework allows for just-in-time (JIT) compilation of kernel source code which is necessary when creating a program that should be able to run on multiple platforms. Pre-compilation of kernel source code is also an option, but was not utilized in this particular project.
1.3 Embedded Target–Why ARM/Mali?
It would be ideal to simply pick the platform that has the most computation performance combined with the lowest power, weight, and size (the ultimate goal of any embedded system). In practice, most platforms are not suitable for industrial applications, because the documentation is sparse and vendors have little interest in supporting this market. As a secondary requirement, a platform with the cleanest, most generic OpenCL implementation (i.e. with little or no target-specific optimizations) would save on development and maintenance costs. Consequently, target selection was really about identifying a system vendor that might make this development possible. The goal of this project was to demonstrate a low-power video processing capability that was suitable for mobile deployment. Also, we had an institutional preference for Linux because of long experience working on embedded Linux systems. While several options were available (See ), we selected the ARM/Mali, primarily because it had the best documentation.
1.3.1 The Development system
The quad core Mali-T604 GPU is included in the Samsung Google Chromebook which also features a small size, low power consumption, and a webcam. These qualities make this package ideal for easy demonstration of the technology. The Mali-T604 was designed to meet the needs of GPGPU. If a video enhancement algorithm could be run effectively in real-time on this device, this would suggest that similar applications could be implemented on other embedded devices. It it for these reasons that we decided the Chromebook would provide a good development platform that could also serve as a convenient demonstration of the technology.
2 Video Enhancement Algorithm
Video enhancement is a class of image processing algorithms that tries to improve the raw video stream by removing distortions and improve visual quality of the video. For example, video captured through large telescopes is marred by the effects of atmospheric turbulence. By examining localized temporal fluctuations, it is possible to interpolate images to reduce the effects of turbulence. This is a computationally expensive class of algorithms and difficult to implement in real-time without significant computing power. It also provides many opportunities for parallelization and computation by a GPU. The rest of this paper will explore a few key computation steps in this process and examine how we implemented them on the ARM/Mali platform. For this work, we used a grayscale video feed. This way, each pixel is represented by a single 8-bit value. While extending this to RGB images is certainly possible, we did not explore this.
2.1.1 Computing Average Frame
Video is processed as a stream of frames. A baseline image is created by averaging a number of frames together. If the current image is the first image of the process, the average frame becomes equal to the current frame. Subsequently, a single-pole infinite impulse response (IIR) filter is used to calculate on each pixel value of the image. Each pixel of the average frame is found by equation 2-1. This is a pixel by pixel operation and can easily be parallelized with OpenCL.
Ax,yk is the value of the pixel at coordinates (x,y) of the new average frame, Ax,yk-1is the value of the pixel at the same coordinates of the previous average frame, Cx,y is the value of the corresponding pixel in the current image, and α ∈ ℝ, 0 < α< 1 is the weight given to the pixels of the current image.
Next, the current and average frames should be divided into “chips”. Chips are sub-images of each frame. The pixel dimensions of each chip are given by
N represents a value referred to as chip_size and M represents the value of chip_overlap. Figure 2-1 demonstrates this idea. The blue squares divide the image into N × N sub-images. The red squares are of size (N+2M) × (N+2M) and demonstrate the data represented by each chip. The number of chips that span the width of an image is found by dividing the image width by chip_size. Likewise, the number of chips across the vertical dimension is found by dividing the image height by chip_size. The image in Figure 2-1 is 16 pixels wide by 12 pixels high and is represented by 4 chips across and 3 chips down, or 12 chips altogether. If a chip lies on the border of the image, the values of the border pixels are extended out to the chip_overlap region of that chip, as demonstrated in chip 1 of Figure 2-1.
Figure 2-1 Chips
2.1.3 Template Matching
Once the current and average frames have been “chipped”, the chips of the current frame are used as templates that are matched to the chips of the average frame by finding the minimum Sum of Squared Differences (SSD). The pixels of the current frame will first be scaled to the pixels of the average frame. The scale value is found using Equation 2-3.
In 2-3, thischipmean refers to the average pixel value of the inner N × N region of the chips of the current frame. avechipmean refers to the average pixel value of an N × N region shifted by (i,j) from center of the average frame’s chips.
Each of the current frame’s chip’s center N × N regions are compared with the entire area of the corresponding chips of the average frame to find coordinates where the two sets of data match closest. The SSD for each offset (i,j) is computed by
Figure 2-2 shows how the current frame is matched to the average frame.
Figure 2-2 SSD
The location, or offset, of the best match is given by the coordinates (i,j) of the minimum SSD. These coordinates will be referred to as i_min and j_min. Using the four SSD values adjacent to location (i_min,j_min), two 1-D parabolic fits are made. From this, a tip and a tilt can be found.
The resulting tip value corresponds to the optimum amount of pixels that the current frame’s chip should be shifted in the y direction in order to most closely match the chip of the average frame. Likewise, the resulting tilt value gives the optimum shift in the x direction.
2.1.4 Assign Planes and Weights
Now that the tip and tilt of each chip has been determined, we must calculate how the pixel data from each chip should be shifted and merged with neighboring chips. The tips and tilts calculated in the template matching kernel will be real numbers. Since we cannot shift an image by a non-integer amount, we must combine weighted values of the four closest pixels to these non-integer coordinates. Chips are assigned to one of sixteen planes. These planes are ordered in 4-chip × 4-chip blocks across the entire image, where the top-left chip of each block is in Plane 0, the next chip to the right is in Plane 1, and the bottom right chip of the block is in Plane 15. Figure 2-3 demonstrates the assignment of plane numbers to chips of an image that is represented by 8 chips across and 8 chips down.
Figure 2-3 Plane Assignments
The tips and tilts of each chip are divided into their integer and fraction components, i_shift, i_shift, f_shift, and f_shift, respectively. The fractional components of tip and tilt are then used to calculate four weights that will be applied to 4-pixel groups, or quartets. The weights are calculated as follows:
A value is assigned to each pixel location within the given chip’s plane based on the sum of these weights multiplied by the values of each of the pixels in the quartet corresponding to the given pixel. f1 will correspond to the given pixel, f2 to the pixel to the right, f3 to the pixel underneath, and f4 to the pixel that is diagonally adjacent in the bottom-right direction. Furthermore, another weight is assigned to this new value based on the sum of the fractional weights multiplied by the distance of each pixel of the quartet’s distance from the edge of the chip.
At this point, the value of each pixel is summed across all 16 planes. The weights for each pixel are also summed across all planes. The total pixel value is then divided by the total weight. This gives the new value for the pixel. Using this data, the chips can be reassembled into an image of the same dimensions as the current frame. The following code demonstrates the kernel code that accomplishes the reassembly of the processed image. Each kernel instance will operate on a single pixel location.
2.2 How can algorithm be parallelized?
Many of the calculations done in this algorithm are embarrassingly parallel, or easily parallelizable because of a lack of dependency on other operations.
2.2.1 Average Frame
The calculation for each pixel in the average frame only requires the value of that pixel in the previous average frame and in the current frame. Each instance of the get_ave_frame kernel finds the average value for a pixel at a single location within the image by applying Formula 2-1 to the corresponding pixels of the current frame and previous average frame to find the new average frame pixel value. In implementation of this kernel, a problem arises from the limitations of OpenCL. Each frame is implemented as a 2-D image object, which can not be both read from and written to in the same kernel instance. Since we need to be able to read the values from the previous average frame and write the values of the new average frame, a ping pong buffer is used. The ping pong buffer requires that the data for the average frame be saved in two separate 2-D image objects, bufferFrameAveA and bufferFrameAveB. This allows the most recently updated average frame to be read from while the out-of-date average frame can be written to. These roles alternate every frame.
The process of creating chips is as simple as reading values from the given frame and assigning them to the appropriate location in the chip data structure, which is implemented as a 3-D image object. The first dimension of this image object is the width of a chip, given by 2-2. The second dimension is the same, as each chip is square. The third dimension is equal to the amount of chips that represent a single image. This is found by
It is appropriate that a separate kernel instance is run for each chip. Thus, if there are 4 chips that span the x-axis of an image and 3 chips that span the y-axis, the work dimensions of the chip_it kernel will be 4 × 3 – for a total of 12 separate kernel instances.
2.2.3 Template Matching
The template matching kernel compares each chip of the current frame to each chip of the average frame, with no dependency between separate chips. It is natural then to run a separate kernel instance for each chip, which can be accomplished with a 1-dimensional work size. The global id of each kernel instance will correspond to the number of the chip that is being operated on by that kernel. The result of this kernel will be a tip and a tilt that is applied to each chip, so global float buffers newTip and newTilt are passed into the kernel. These buffers each have enough memory allocated to store a float value for each chip.
2.2.4 Plane and Weight
As with finding the SSD, this operation has dependencies only within each chip, so each kernel instance should operate on a single chip.
Reassembling the image is a simple matter of summing the pixel data in each plane, then dividing each total pixel value by each total pixel weight. These operations apply to single pixels, therefore a separate kernel instance can be run for each pixel of the image. The global work size of the reassemble kernel will therefore be 2-dimensional with a width the same as the image width and a height the same as the image height.
2.2.6 Distance Matrix
A reference matrix referred to as the distance matrix is used in the assignment of weights to plane values. This distance matrix is 2-D and has the same dimensions as a chip. The value of each element of the matrix is the distance of that element from the closest edge of the matrix, starting with 1 as the value of border elements. None of the individual distance calculations are dependent on any of the other calculations, so the creation of this matrix is embarrassingly parallel. Furthermore, once calculated, the distance matrix is only used as a reference and is not modified. Thus, this operation needs only to be done once in setup.
3 Target and Development Platform
3.1 ARM MALI
3.1.1 Development Tools
ARM offers the Mali OpenCL Software Development Kit (SDK) for OpenCL development on Mali devices.
The Mali OpenCL SDK is available at http://malideveloper.arm.com/develop-for-mali/sdks/mali-opencl-sdk/.
3.2 Chromebook (ARM)
188.8.131.52 Bootable SD Card
A bootable SD card was created to run Linux on the Chromebook XE303C12. This paper will not cover the entire process of creating this card, but only where our method deviates from the step-by-step instructions found at: http://malideveloper.arm.com/develop-for-mali/features/graphics-and-compute-development-on-samsung-chromebook/ Prior to building the SD card, make sure the following packages are installed:
When following the procedure outlined by the Mali Developer website, the following error was received when attempting to build vboot_reference:
cc: error: unrecognized command line option ‘-fuse-ld=bfd’
We commented this -fuse-ld=bfd build option out of the Makefile.
184.108.40.206 Webcam Support
The libwebcam-0.2.2 command line tool was installed from source to give:
uvcdynctrl -d /dev/video0 -f
uvcdynctrl -d /dev/video0 -c
uvcdynctrl is a command line tool for controlling V4L2 devices. To specify that video0 will be the output video file, the following command was issued:
ffmpeg -f video4linux2 -i /dev/video0
ffserver serves the video stream that has been encoded by ffmpeg.
3.3 Video Stream
To demonstrate a real-time video stream, the camera integrated into the Chromebook was used to capture a live video stream. The implementation of the video enhancement process on the Chromebook consists of two separate programs, dstream and dewarp. The dstream program handles the capture of images from the camera and the displaying of images on the screen. The OpenCL kernels for processing the image are managed by the dewarp program. These two programs share data via mapped memory. One option we considered was to stream the processed video to the laptop screen. However, we were concerned this might use up some of the available GPU resources.
3.3.1 OpenCL Setup/Compilation
One drawback of the OpenCL framework is that it requires a lot of overhead in the form of program setup. Before any kernels can be executed, there must be steps taken to build the OpenCL program object, create kernels, instantiate OpenCL data structures and set kernel arguments. Much of the setup only needs to be done once prior to executing the rest of the program, so dewarp is split into two major functions–init() and process_frame(). init() takes care of all of the OpenCL setup and data initialization.
3.3.2 Kernel Execution
Execution of the OpenCL kernels is initiated by the process_frame() function of dewarp. Commands to execute the kernels on the GPU are enqueued with a call to clEnqueueNDRangeKernel(). The following function call enqueues frameAveKernel.
commandQueue is of type cl_command_queue, which is an abstract built-in data type supported by OpenCL. This data type keeps track of the kernel tasks that will be completed by the GPU. Tasks submitted to commandQueue are executed in the order that they are enqueued. The next argument to this function call, frameAveKernel, is the kernel object that has been created from the corresponding kernel defined in the .cl file. The third argument specifies the number of work dimensions that will be used to specify work-items in the work-group. Each frame is a 2-D image, so there are two work dimensions. NULL is passed as the global_work_offset parameter because OpenCL does not currently support any other value. Future revisions of the standard will support this offset, allowing for work-item global IDs to begin at a specified unsigned value other than zero. image2d_global_work_size is an array of 2 (work_dim) values. These values specify the number of global work-items in each dimension that the kernel will operate on. Since each kernel instance will operate on a single pixel to find the average frame, image2d_global_work_size has been assigned a value equal to the image width and image2d_global_work_size has been assigned a value equal to the image height. The next parameter of clEnqueueNDRangeKernel() is local_work_size. It is not beneficial for us to choose the work-group size, so passing a value of NULL has the effect of letting the OpenCL implementation choose an appropriate size. The last three parameters of this function call are num_events_in_wait_list, event_wait_list, and event. A wait list was not used in this project, so values of 0, NULL, and NULL are passed, respectively. From the above description, it may appear that the process of setting kernel arguments, defining work size parameters, and enqueuing each kernel is a tedious and onerous task. It is. All of the other kernels are enqueued in a similar manner–with their appropriate argument values, of course. It should be noted that the kernel arguments remain the same until they are changed by a call to clSetKernelArg(). Hence, some kernels may require some or all of their arguments to be modified before they can be enqueued. Such is the case for frameAveKernel, which uses the a ping pong buffer requiring the swapping of input buffers each time the kernel is enqueued.
3.3.3 Streaming Camera Input
Windowing, image capture, and output display is handled by the dstreamprogram. This program operates in two modes, which can be toggled by the user. The default mode is to display an unprocessed image. As new frames are captured, dstream simply converts each image to grayscale and displays it until the next frame is ready. The second mode displays processed images. As each new frame comes in to the program, dstream converts the image to a grayscale format suitable for the algorithm and sends the frame to the dewarp program. Once the algorithm has completed, dstream sends the processed image to the display. dstream discards any frames that are captured by the camera while the dewarp program is still busy operating on another image.
4.1 Timing Performance
This works is ongoing and will be posted when available.
One of the most important reasons for exploring the capabilities of OpenCL is to find a heterogeneous computing solution that is highly portable, as OpenCL is touted to be. What we found is that only the most basic OpenCL code is portable. Many devices allow for their own optimizations that are architecture-specific. These optimizations, however, are not standard across all platforms. Therefore, OpenCL programming presents a trade-off between portability, program speed, program size, and labor required to write the program. The most basic OpenCL code that is highly portable will have a smaller program size and generally take less effort to create, but will run significantly slower than a program that has been fully optimized for a particular platform. On the other hand, a program that has been optimized for only one platform will not be portable and require more work on the programmer’s part, but have significant speedup. Depending on the optimizations that are made. The size of this optimized code may be bigger or smaller, depending on what optimizations are offered by the platform. Smaller code may result from replacing multiple computations with built-in functions (this type of optimization often comes with another trade-off – speed vs. accuracy). Larger code may come from optimizations that require setting up specific data types that the platform is better-suited to operate on or by specifying flags in function calls to override defaults. Code size is also increased by function calls that query the system for information about data limits, data-type support, etc. If it is desired to have a highly portable program that is also optimized for many devices, this can be done at the high cost of man-hours, and large code size. Optimizations for every platform could be written in to the program in such a way that once the program has queried the platform, the proper optimizations are implemented, either by compiling one of many kernel files or a series of if/then statements in the setup of the program. Just-in-time compilation of OpenCL really shines here.
4.2.1 The Mali GPU Memory Model
While desktop systems generally have separate memories for the application processor and GPU. This means that in order to pass data to the GPU, host memory must be allocated, initialized, then copied into local memory. Mali GPUs have a unified memory system. This means that global and local OpenCL address spaces are mapped to the same physical memory. This eliminates the need to copy data between host and local memory. The following code shows the setup of mapped memory that will hold a frame of char data-type.
First, a buffer of size numPixels*sizeof(unsigned char) is created. The CL_MEM_ALLOC_HOST_PTR flag specifies that the OpenCL application should allocate memory from host-accessible memory. A call to clEnqueueMapBuffer() returns an unsigned char pointer to the shared memory containing bufferCharFrame. At this point, any changes made to this_frame on the host side can be seen in bufferCharFrame by the OpenCL implementation and vice versa. On a platform with separate host and GPU local memory, the process of creating a data buffer to be sent to a kernel is a bit longer and requires more resources.
As you can see, the separate memory model requires that twice as many resources be allocated for a given buffer size. In addition, each time the data are changed—either by the host or the GPU—the buffer must be re-copied if it is to be synced with the other side. From the point of view of a programmer who wants to optimize an OpenCL implementation on a Mali GPU and nothing else, the above information is good news. Additionally, the Mali GPU supports both of the above implementations. But if you were to try to run a Mali-optimized program that takes advantage of this shared memory model on a system that has separate host and GPU memories, the program simply will not work.
5 Work to be Done
-Perform benchmarking to compare against serial and parallel desktop implementations -Since images are monochrome, consider using buffers to hold image data rather than cl_images.
-Consider offline compilation for faster setup time, smaller program size, but less transparency (.cl file is human-readable).
The authors would like to thank Dr. Michael Hart of HartSci at The University of Arizona for his assistance. Dr Hart provided technical guidance and support on the image processing algorithm. His contribution was essential to implementing this project.