Quantcast
Viewing all articles
Browse latest Browse all 31

Let’s Untangle the Mesh — Accelerate Deep Learning with Collective Streaming

Image may be NSFW.
Clik here to view.
Moving data across a mesh is like a high-speed train stopping at every station. On the other hand, Collective Streaming can potentially unleash the collective horsepower of a sea of computing resources.

In a domain-specific parallel processing machine, the building blocks, often referred to as the Processing Elements (PEs), are typically arranged in a mesh, when you want to scale out to many PEs. The data and partial results are passed among the PEs as dataflows to reuse data and save bandwidth. Such a paradigm is so ubiquitous that it is hard to imagine that there is an alternative.

We break away from such a paradigm by Collective Streaming, reminiscent of Collective Communication in High Performance Computing (HPC). Shown below are 4 types of collective streaming operations:

Image may be NSFW.
Clik here to view.

Analogous to how efficient Collective Communication is achieved with recursive algorithms, efficient Collective Streaming is supported recursively using Collective Element (CE) as the building block. The PEs are organized hierarchically through the CEs as shown in the figure below:

Image may be NSFW.
Clik here to view.

The data are broadcasted or scattered to the PEs, and the results are reduced or gathered from the PEs. There is no inter-PE dependency, and hence no dataflow among them. They can pack orders of magnitude more MAC (multiply-and-accumulate) units and don’t have to be strictly placed and routed in a rectangular area.

In the first FPGA prototype of the Collective Streaming architecture, we can pack more than one hundred independent MAC units into one PE, and employ 8 such PEs to run VGG16 in 43 ms to support real-time face recognition.

The Mesh-Centric Paradigm

As mentioned earlier, the building block of a domain-specific Deep Learning machine is often referred to as the PE. Mesh topology is a strikingly popular way to organize PEs (for examples, Google’s TPU, the DianNao family, MIT’s Eyeriss):

Image may be NSFW.
Clik here to view.

It seems logical to use a mesh topology to organize the PEs on a 2-dimensional chip when there are lots of PEs and regularity is desirable. Such an arrangement leads to the following two mesh-centric assumptions:

  1. The distance for a piece of data to travel across the mesh in one clock period is fixed as that between 2 neighboring PEs, even though it could be much further;
  2. A PE depends on the upstream neighboring PEs to compute even though such a dependency mainly comes more from the spatial order, rather than true data dependency.

The first assumption is actually a self-imposed constraint and is not true in practice. It is analogous to the situation when a high-speed train stops at every single station on the way to the destination, as shown in the following figure:

Image may be NSFW.
Clik here to view.

Within one clock period, a piece of data can travel over a distance equal to hundreds of the width of a MAC unit without having to hop over every single MAC unit in between. Restricting dataflows to PE hopping in a mesh topology causes orders of magnitude increase in latency.

The second assumption is a legacy inherited from distributed parallel processors comprising many compute nodes. Each compute node not only handles computations but also plays a part in the distributed storage of the data. The nodes need to exchange data among themselves to make forward progress. For an on-chip processing mesh, however, the data come from the side interfacing with the memory. The data flow through the mesh, and the results are to be collected on some other side as shown below:

Image may be NSFW.
Clik here to view.

Due to the local topology, an internal PE has to get the data through the PEs sitting between it and the memory. Likewise, it has to contribute its partial result through the intermediate PEs before reaching the memory. The resulting dataflows are due to the spatial order of the PE in the mesh, not true data dependency.

Given the two mesh-centric assumptions, no matter how many PEs and how much bandwidth you have, the performance to solve a problem on a d-dimensional mesh is limited by the dimensionality d of the mesh, not the number of the PEs, nor the IO bandwidth (see Your favorite parallel algorithms might not be as fast as you think) as shown below:

Image may be NSFW.
Clik here to view.

Matrix Multiplication on a Supercomputer

Let’s look at the most time-consuming part of Deep Learning: Matrix Multiplication, which has always been at the heart of HPC. State-of-the-art parallel matrix multiplication performance on modern supercomputers is achieved with the following two major advancements:

  1. Scalable matrix multiplication algorithms
  2. Efficient collective communications with logarithmic overhead

See below for a demonstration of matrix multiplication in outer products. The computations are 2-dimensional, but both the data and the communications among them are 1-dimensional as shown below:

Image may be NSFW.
Clik here to view.

The width of a block column and a block row can be a constant and is independent of the number of nodes. On a systolic array, the computations are also broken down into outer products. However the width of the block column/row must match the side length of the systolic array to achieve optimal performance. Otherwise, the array is poorly occupied for problems with low inner dimension, as elaborated in my previous post, Shall We All Embrace Systolic Arrays, on scalability of the matrix multiply unit of Google’s TPU. Outer product-based matrix multiplication algorithms, such as Scalable Universal Matrix Multiplication Algorithm (SUMMA), have been proven to be very scalable both in theory and in practice in distributed systems.

The communication patterns in SUMMA or similar algorithms are based on collective communications defined for parallel computing on distributed systems. Advances in collective communication for HPC with recursive algorithms reduce the communication overheads to be proportional to a logarithmic of the number of nodes, and have been instrumental for the continuing performance growth in supercomputing.

It is interesting to compare how matrix multiplication is achieved with a systolic array and a supercomputer, even though they are at completely different scales: one is on-chip and each node is a PE; the other is at the scale of a data center and each node is a compute cluster:

Image may be NSFW.
Clik here to view.

Broadcasts are implemented as forwarding data rightward, while reductions (a synonym of “accumulate” in the terminology of collective communications) are implemented as passing partial sums downward in a systolic array and accumulating along the way. In comparison, with an algorithm like SUMMA, broadcasts on a supercomputer happen in two dimensions among the nodes, while reductions are achieved in place at each node. There is no dependency, thus no dataflow but collective communication among the participating nodes. Since the reduction is in place, the number of nodes in either dimension is independent of the inner dimension of the matrices. In fact, the nodes don’t even have to be arranged physically in a 2-dimensional topology as long as collection communication can be supported efficiently.

Native Supercomputer

Today’s distributed supercomputers are descendants of “Killer Micro,” which were considered aliens invading the land of supercomputing in the early 90s. In fact, early supercomputers were purposely built to do vector/matrix computations. Imagine that we build a supercomputer-on-chip by

  1. Shrinking a compute cluster to a PE with MAC units densely packed
  2. Building on-chip data delivery fabric with CE as the building block to support collective streaming mimicking collective communication

CEs are inserted between the PEs and the memory to broadcast or scatter the data to the PEs, and to reduce or gather the results from the PEs. The 4 operations are analogous to the counterparts in collective communication for HPC for the compute nodes to exchange data among themselves as shown below:

Image may be NSFW.
Clik here to view.

Compared to systolic arrays, the PEs do not have to be interlocked in a 2-dimensional grid and the latency can be within a constant factor of a logarithm of numbers of PEs. Building a supercomputer-on-chip can be considered as an effort to return to the matrix-centric root of Supercomputing. It is effectively a Native Supercomputer, domain-specifically designed for Deep Learning.

Let’s Untangle the Mesh

A Native Supercomputer leverages Collective Streaming to reuse data and save bandwidth, just as a distributed supercomputer system relies on Collective Communication to reduce the overhead to exchange data. The former could be a perfect building block for the latter.

There is no need to build a massive mesh, nor is there a need to stick many MAC units to the grids. Let’s untangle the mesh to unleash the collective horsepower of a sea of MAC units.

Image may be NSFW.
Clik here to view.

Viewing all articles
Browse latest Browse all 31

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>