Benchmarks can either encourage or constrain innovation

The recent DeepBench proposal released by Baidu includes several core DSP routines commonly used for the training and inferencing of Deep Neural Networks (DNN). On the surface, this is a well-intentioned and useful benchmark proposal. However, depending on the way benchmarks like these are defined, they can either encourage or discourage innovation across the industry. The lofty and admirable goal set by the Baidu team is to encourage a 100x improvement in performance for deep learning, and the proposed benchmark provides a way to measure the increase in performance by given hardware systems.

I argue that an improvement of 100X over state-of-art  GPUs for these benchmarks is not achievable using regular “continuous” advances in technology or microarchitecture alone, without corresponding increases in power consumption, silicon area, loss in programmability or some combination of all of the above.

To achieve 100X or similar game-changing improvements over the current technological frontier requires what is called “discontinuous” innovation – and it should be applied at multiple levels. Not only at transistor manufacturing (e.g. 16ff or FDSOI silicon) or circuit level (e.g. dynamic logic versus gate level synthesis), microarchitecture (e.g. carry-save multi-add or radix-8 multiplication) or even at numerical precision and algorithmic levels – but at a combination across all of these. To encourage discontinuous innovation, the innovator must not have their hands tied to the current paradigm of solving the problem – as so often happens when well-meaning groups define benchmarks or standards.

So if the objective is to benchmark 16ff vs 28nm GPUs, or to compare Intel’s OpenCL system versus NVIDIA’s CUDNN libraries – then this type of benchmark is great. But it will never encourage a 100X improvement .

 

Learning from the past:

I am reminded of the half rate voice coding algorithm used in the GSM standards in 1992. This standard defined the algorithm to be used, and the bit patterns at both input and output. The implementation of this algorithm basically tied the hands of the innovator. Micro-architectural techniques like carry-save arithmetic (that improve the performance and computational efficiency of MAC units in silicon) could not be used to generate the intermediate bit patterns required by the standard. When standards define compliance in terms of BER or Packet Error Rate, all manner of innovation is unlocked as innovators exploit everything in their bag of tricks to achieve maximum performance with minimum energy.

 

Back To Deep Learning

While the AlexNet benchmark is not ideal as a measure of DNN hardware performance, it does enable a wide range of innovation to be applied at algorithmic, numerical representation and micro-architectural levels in order to achieve faster training times.  If AlexNet (and other more advanced DNN models like Google Inception) were defined in terms of input data set for training, compute budget for inference, and the inference accuracy required by the final model, the industry would actually be better-off.

Suppliers can and will scour the academic publications and apply the very latest techniques to deliver a result that is actually meaningful to the customer: namely, “How long does it take to train my network to a desired level of accuracy or compute budget”. The key point, and the lesson I learned from the evolution of cellular standards, is that the benchmark defines a target level of accuracy and not a bit pattern that must be matched.

Consider a DNN system that can perform 4000 GEMM operations in parallel. Such a system might be used in a distributed training system to train many small networks with 20 layers and several large networks with 200 layers. Does the distributed system efficiently utilize all GEMM units efficiently across variations in data size, number of layers, number of weights and so-on? If each GEMM is individually fast, but the overall utilization of the 4000 GEMMs is around 20% when integrated into a distributed training system, then the user is not seeing the benefit promised by the benchmark that they used to procure the 4000-GEMM system in the first place.

There are organizations like BDTI who, via the embedded vision alliance, could define a meaningful set of benchmarks to measure the performance, cost and efficiency of DNN systems. This is among their core competency and the industry might consider motivating them to define benchmarks that enable and empower the innovator to do their best at every level and not just comparing the floating point GEMM performance of a few competing GPUs. We might also need specific benchmarks for power-constrained embedded applications, and another set of benchmarks for distributed learning systems like those used in data centers.