Convolutional Neural Networks (CNNs) are the state of the art solution
for many computer vision problems, and many researchers have explored
optimized implementations. Most implementations heuristically
block the computation to deal with the large data sizes and
high data reuse of CNNs. This paper explores how to block CNN
computations for memory locality by creating an analytical model for
CNN-like loop nests. Using this model we automatically derive
optimized blockings for common networks that improve the energy
efficiency of custom hardware implementations by up to an order of
magnitude.
Compared to traditional CNN CPU implementations based on
highly-tuned, hand-optimized BLAS libraries, our x86 programs
implementing the optimal blocking reduce the number of memory accesses
by up to 90%.