Convolutional Neural Networks (CNNs) are the state of the art solution for many computer vision problems, and many researchers have explored optimized implementations. Most implementations heuristically block the computation to deal with the large data sizes and high data reuse of CNNs. This paper explores how to block CNN computations for memory locality by creating an analytical model for CNN-like loop nests. Using this model we automatically derive optimized blockings for common networks that improve the energy efficiency of custom hardware implementations by up to an order of magnitude.
Compared to traditional CNN CPU implementations based on highly-tuned, hand-optimized BLAS libraries, our x86 programs implementing the optimal blocking reduce the number of memory accesses by up to 90%.