Image processing continues to become increasingly important in areas like computer vision, computational photography, and augmented reality, and there is great demand to deploy these algorithms on mobile platforms. A heterogeneous system containing processors and custom accelerators makes a promising platform due to its high performance and energy efficiency. However, designing and programming such systems is hard. We extend the image processing language Halide to map applications to an efficient heterogeneous system target. Based on these extensions, we build a system offering a high level interface for defining hardware and software implementations simultaneously, greatly raising the level of design automation and enabling a large space for co-optimizing hardware and software as well.
Experimental results show that Xilinx ZYNQ designs produced by our system achieve up to 6x performance speedup and 38x energy efficiency compared to NVIDIA Tegra K1’s CPUs, and 3.5x performance speedup with 12x energy efficiency compared to a K1’s GPU.
In our design methodology, an application is first prototyped in Halide with CPU schedules, which serves as a gold model for the rest design stages. Next, with a new scheduling extension, the algorithm in the prototype can be mapped to a heterogeneous system. A compiler tool then produces HLS code for the specialized hardware engines, along with verification collateral. The compiler also produces the software programs running on the host linux OS, which contain the sequential part of the workloads with access to the hardware for acceleration. After the design is implemented using third-party HLS/FPGA tools, the generated program appears to the user as a single C ABI function, callable from userland on the CPU cores.