Match Made in Heaven – Deep Learning and Dataflow Computing

The demand for Deep Learning (DL) systems has grown exponentially over the last four years. Market research firm Tractica projects the DL systems market will grow to over $10.4 Billion by 2025, and IBM’s CEO believes that DL and broader Artificial Intelligence (AI) will have a $2 Trillion impact on businesses by 2025. However you slice and dice the market, the demand for better, faster DL systems is exploding.

The explosive growth in DL market is due in large part to the emergence of data as the world’s new natural resource and the growing need in various organizations to more effectively process and understand this data. There are important patterns to be discovered within the immense volumes of data being collected every day. Video data could contain patterns for particular face or particular car. Sensor data on networks could contain patterns for cyber security threat or indicators of high-value equipment failure. Social network data could contain patterns indicating criminal behaviors or new consumer purchase behaviors. DL methods have emerged as incredibly effective tool for understanding these patterns in data.

At the same time, DL computational workloads are very computational and memory bandwidth intensive beyond what CPUs can deliver. Although the recent success of DL based systems was due in par to the use of graphical processing units (GPUs) as compute accelerators in training DL models, the use of a general purpose computing systems such as GPUs do not effectively exploit the characteristics of DL workloads to deliver best results – this is due to fact that software will always run more efficiently on hardware that is tuned for it. As a result, we are already reaching the point where for growing number of DL development use cases, “GPUs are simply not enough.”

Rather than using general purpose computing systems, a purpose-built deep learning computing systems can deliver far more compute power and efficiency through co-optimized design of software and hardware. The DL computational workload has several interesting characteristics that can be exploited in hardware design to have a co-optimized design yielding better results:

  • Deep neural networks in DL and associated training algorithms are very resilient to error and noise.
  • Deep neural network training involves huge numbers of training data, trainable parameters and training iterations requiring large memory bandwidth.

These characteristics of DL workloads can be more effectively exploited in a dataflow based computing architecture than traditional control-flow based systems. Dataflow model, involving representing computational flow as dataflow graphs, is a sound, simple, and powerful model of parallel computation. In dataflow model of computing, there is no notion of a single point of control. The dataflow model describes computation in terms of locally controlled computational activities.

Unlike the traditional control-flow model of computing, the execution of dataflow model does not impose any constraints on sequencing except for the data dependencies in the program. In dataflow model, the execution of instruction is based on availability of its operands and the synchronization of parallel activities is implicit. Because of the inherently parallel nature of dataflow execution, dataflow computers provide an efficient and elegant solution to the two fundamental problems of control-flow computers especially in handling DL workloads: the memory latency and synchronization overhead. The ability of the dataflow model to tolerate memory latency, by switching dynamically between computation threads, and to support low overhead distributed synchronization in hardware makes it a better candidate for DL workloads.

A dataflow model based computers that can be dynamically reconfigured to optimally support DL workloads are inherently better suited to support DL workloads. One way of taking advantage of deep neural network’s ability to tolerate error and noise in hardware is by exploiting the use native hardware support of variable length fixed point math that are simpler and more efficient than floating point hardware. As we pointed out earlier, software computational workload will always run more efficiently on hardware that is tuned for it. A tight co-optimized design of hardware and software is the best way to improve demanding compute workloads such as DL. A reconfigurable dataflow architecture that enables better delivery of compute resources can deliver tuned hardware and software co-design in a cost effective manner.

This is an exciting time for computer architecture and DL development. The nature of DL computation demands high level of parallelism beyond what traditional control-flow oriented systems can provide. We are seeing major advances in research centers in leveraging new computer architecture from dataflow to neuromorphic to deliver better hardware and software co-designs.