Differentiable Hardware

Notes from Industry

How AI Might Help Revive the Virtuous Cycle of Moore’s Law

In the wake of the global chip shortage, TSMC has reportedly raised chip prices and delayed the 3nm process. Whether or not it is accurate or indicative of a long-term trend, this kind of news should alert us to the worsening impact of the decline of Moore’s Law and compel a rethinking of AI hardware. Would AI hardware be subject to this decline or help reverse it?

Suppose we want to revive the virtuous cycle of Moore’s Law, in which software and hardware propelled one another, making a modern smartphone more capable than a past-decade warehouse-occupying supercomputer. A popularly accepted post-Moore virtuous cycle, in which bigger data leads to larger models requiring more powerful machines, is not sustainable. We can no longer count on transistor shrinking to build wider and wider parallel processors unless we redefine parallelism. Nor can we rely on Domain-Specific Architecture (DSA) unless it facilitates and adapts to software advancement.

Instead of figuring out which hardware is for AI, an advancing moving target, we look at AI hardware from the perspective of AI having Differentiable Programming at its heart. Here, an AI software program is a computation graph consisting of computation nodes trained together to achieve an end-to-end objective. Deeply pipelined DSA hardware can serve as a computation node as long as it is differentiable. Software programmers can freely plug differentiable hardware into computation graphs for high-performance and creative problem-solving, like a pre-built customizable software component. There should no longer be a “purity” check for AI hardware, which can now include differentiable hardware.

There should no longer be a “purity” check for AI hardware, which can now include differentiable hardware.

Hopefully, software and hardware will once again advance in parallel through a virtuous cycle as they did when Moore’s Law was in full swing.

The Agony of AI Hardware Architects

Amidst a myriad of GPU contenders in the AI marketplace, Tesla unveiled the Dojo supercomputer. The Dojo seems to be a tour de force in networking, integration, and scalability. On the other hand, the D1 chip, the building block of the Dojo, is hardly an architectural breakthrough. We can group the GPU contenders into two camps, the Many-Core and the Many-MAC. The D1, exemplifying the Many-Core camp, is a “mesh” of many CPU cores. On the other hand, Tesla FSD or Google TPU, epitomizing the Many-MAC camp, features a small number of large MM accelerators, each packing many Multiply-Accumulation (MAC) units in a “mesh.” As we can see, the AI architecture debate is between the meshes and the GPU.

Under pressure from the sky-rocketing effort needed to build a chip, it must be nerve-wracking for AI hardware architects to follow media coverage on benchmarks and conferences. AI hardware, designed to replace the GPU, often struggles to run the benchmarks and hot-off-the-press NN models, ironically running well on the “old-fashioned” GPU. As shown in the diagram below, the Many-Core and the GPU differ essentially in exchanging data. The former passes data through an interconnecting mesh, while the latter shares data through a memory hierarchy. This difference has little to do with AI. It remains to be seen if the Many-Core, such as the D1 chip, will eventually outperform the GPU. I will cover the Many-MAC later.

Comparing the data-exchanging in the Many-Core (Left) and the GPU (right). (Image by Author)

Now, let’s take a quick history detour to trace the common roots of the meshes and the GPU in High-Performance Computing (HPC).

The HPC Legacy

HPC is used to solve computationally intensive military research problems, scientific discoveries, oil and gas exploration, etc. The Supercomputer, or Super for short, has been the critical hardware solution to HPC. In contrast to a generic program dealing with pointer-rich data structures such as trees and linked lists, an HPC program spends predominant time repeating data-parallel computations in “loops.”

The Rise and Fall of the Vector Super

Through the 1970s and into the 1990s, the Vector Super, designed to speed up HPC programs by unrolling the data-parallel loops in vectors, dominated the HPC marketplace. During that time, A Supercomputer was by default a Vector Super.

In the 1990s, when Moore’s Law was in full force, it became viable to build a Supercomputer by arranging many off-the-shelf CPUs in a mesh or some similar topology. This trend gave rise to the Distributed Super, which the HPC community suspiciously referred to as The Attack of the Killer Micros, where “Micro” meant Microprocessor. This sentiment arose from the Microprocessor being a CPU-on-a-Chip when the “CPU” was typically a system made of discrete components. Eventually, the Distributed Super superseded the Vector Super and became the synonym of the Supercomputer today.

The Return of the Vector Super in GPGPU

In the early 2000s, Moore’s Law showed signs of aging, which grounded to a halt the race for CPU clock speed, the primary source of single-chip computing performance. The industry responded by putting more than one CPU core on a chip, expecting parallelism to become the new primary source of performance. This trend led to the Dual-Core, the Quad-Core, and eventually the Many-Core, effectively the Distributed Super-on-a-chip, which commonly arranges the CPU cores in a mesh. Examples of the Many-Core include Intel’s two failed attempts to take on the GPU, Larrabee for the 3D marketplace, and Larrabee’s descendants Xeon Phi series for the HPC.

The GPU traditionally unrolls the “loops” over Graphics entities such as vertices, triangles, and pixels. GPU architects extended this capability to the loops in HPC applications, making the GPU effectively the Vector Super-on-a-chip. They then named the usage of the GPU in HPC as General-Purpose GPU (GPGPU). Fatefully, when the Vector Super gave way to the Distributed Super in the HPC marketplace, it reincarnated into the GPU to avenge its rival. We can see the GPU’s commercial success in top Supercomputers such as the Titan Supercomputer at Oak Ridge National Lab and the Piz Daint at the Swiss National Supercomputing Center.

In A Nutshell

The Distributed Super kicked the Vector out of the HPC marketplace.
The Many-Core is a Distributed Super-on-a-chip, and the GPU is a Vector Super-on-a-chip for HPC.

Matrix Multiplication (MM) and AI

How are the meshes, old hammers in computer architecture, rebranded and retrofitted for AI as the new nail?

MM and HPC

A timeless rule in computer architecture is that moving data is more expensive than computing it, necessitating computer architectures to do more computations on fewer data. Fortunately, the HPC community learned from decades of practice that they could express most HPC problems in MM, which has a high compute-to-communicate ratio with, roughly speaking, N³ MAC operations on 2N² data. If implemented appropriately, problem-solving with MM can achieve high performance by hiding the data transfers. Therefore, an HPC programmer needs only to have a sound MM library provided by Supercomputer vendors. When computing an MM, today’s Distributed Super can fully utilize hundreds of thousands of nodes spread over hundreds of thousands of square feet, effectively keeping every single node busy with computations.

The Rise of MM in AI

Machine Learning (ML) using Neural Networks (NNs) characterizes modern AI. An NN model consists of deep layers of ML kernels. Before the Convolutional Neural Network (CNN), the most popular type of Neural Network (NN) was the Multi-Layer Perceptron (MLP). The fundamental ML kernel of MLP is Matrix-Vector Multiplication (MVM), which employs roughly N² MAC operations on N² data with almost no reuse of data. On the other hand, the current predominant primitive of CNN is Tensor Convolution (TC). As I explained in my article “All Tensors Secretly Wish to Be Themselves,” MM and TC are structurally equivalent in terms of data movement and sharing, so we often use tensors and matrices interchangeably.

The use of MM as a primitive has brought breakthroughs in HPC as well as AI. CNN, principally using MM, triggered the AI breakthrough in Computer Vision. The Transformer, which also uses MM extensively, ignited the AI breakthrough in Natural Language Understanding (NLP).

Thanks to AI and its heavy use of MM, the computer architecture community has a once-in-a-century opportunity to focus on the razor-sharp objective of optimizing MM while having a broad impact on computing in general — more bang for the buck.

The Many-Core can run the same MM algorithms developed for the Distributed Super. In a sense, the Many-Core for AI goes back to its HPC roots.

The Tide of the Many-MAC

The Systolic Array was introduced in 1982 to accelerate MM and other applications. If accelerating MM in the context of AI had been as cool as today, the Systolic-Array researchers would not even have bothered with applications other than MM. The Systolic Array is a mechanism to pack MAC units much more densely than in a CPU core. The drawback, however, is that we cannot use the MM MAC units elsewhere. With this lack of versatility, the Systolic Array did not see marketplace acceptance until AI became the killer application of MM, prompting Google to adopt it in the TPU as an MM accelerator. Since then, the marketplace has spawned variants to improve upon the original. Here, I refer to both the original Systolic Array and its variants as the Many-MAC. To handle non-MM operations, the Many-MAC adds companion processors.

On the other hand, a CPU core in a Many-Core, such as the D1 chip or a GPU Shader core, can use much smaller Many-MACs, effectively becoming a Many-MAC container.

In A Nutshell

AI and HPC crossed paths as they are both dominated by MM.
The Many-Core and the Many-MAC are no more AI-specific than the GPU.

Domain Shift and Domain-Specific Parallelism

The Dark Silicon and the Power Wall

Shortly after 2010, the industry realized that doubling parallelism, the primary source of computing performance, with twice the CPU cores, could not have kept the virtuous cycle. Each CPU core could not reduce its power consumption by half, or twice the parallelism per watt. Over several iterations of core doubling, we would see a majority of the cores remain unpowered under the same power budget, resulting in Dark Silicon, or more accurately, Dark Cores. As shown in the conceptual graph below, when we go from 2 cores to 4 cores, only 3 of the 4 cores can be powered, and when we go from 4 to 8, only 4 can be powered. Finally, only 4 out of the 16 cores can be powered, rendering no benefit to going from 8 to 16 cores. We refer to this phenomenon as hitting the Power Wall.

For this reason, a sizable portion of the computer architecture community shied away from parallelism. Furthermore, the pessimists favored parallelism-barren pointer-rich computations as mainstream and considered parallelism-abundant HPC as a niche. They believed the virtuous cycle would prematurely stop at Amdahl’s Ceiling, limiting what parallelism can achieve.

Dark Silicon, or Dark Cores. (Image by Author)

AI to the Rescue

Coincidentally, AI emerged during this pessimism. According to the Stanford AI Index Report, AI has been advancing as if the Power Wall did not exist!

The key is that there can be domain shifts in mainstream software, resulting in different types of parallelism. As shown in the conceptual graph below, when mainstream software undergoes a domain shift from pointer-rich to data-parallel computations, it redefines one degree of parallelism as a Single-Instruction-Multiple-Data (SIMD) lane rather than a CPU core. We see a higher curve (labeled SIMD lanes for data-parallel) than the CPU-core curve. Next, when the mainstream software entered the MM-heavy AI space, an even higher curve (labeled MM MACs for MM-heavy) was added, with one MM MAC representing one degree of parallelism. As we can see, by exploring more efficient domain-specific parallelism and raising Amdahl’s Ceiling, computing performance continues growing behind the Power Wall.

By the way, MM-heavy AI has its own Amadhl’s Ceiling. An AI application needs to have loop frontends, to distribute MM operations to parallel computing resources, and loop backends to collect results for serial operations such as normalization or softmax. Amdahl’s Law will kick in when there are sufficient MM MACs to speed up MM, bottlenecking both the loop frontend and backend.

Moreover, as the decline of Moore’s Law is worsening, it becomes questionable if making wider machines for accelerating MM can sustain the AI virtuous cycle. To solve this problem and further raise Amdahl’s Ceiling, we need to execute a new domain shift and explore new domain-specific parallelism. In other words, we need to add a new curve (???) to the following conceptual graph.

Conceptual graph of parallelism scaling by different domain shifts. (Image by Author)

In A Nutshell

We have stayed behind the Power Wall through domain shifts from pointer-rich, data-parallel to MM-heavy computations.

The Next Domain Shift

Differential Programming

According to Raja Koduri at Intel, “Neural nets are the new apps. What we see is that every socket, [whether] it is CPU, GPU, [or] IPU, will have matrix acceleration.”

Tesla’s Ganesh Venkataramanan describes their D1 chip as a “pure” ML machine that runs “ML kernels” without legacy hardware. Perhaps, he implies the GPU is not as pure as the D1 since it has Graphics-specific hardware sitting idle during AI processing.

The above two opinions have raised two questions. Should AI domain shift stop at accelerating Matrix Multiplication? Does AI hardware exclude legacy domain-specific design?

Now, we explore a different view of AI hardware from the perspective of AI having Differentiable Programming (DP) at its heart. An AI software program is a computation graph, as illustrated below, consisting of parameterized computation nodes, each taking inputs as outputs of upstream nodes and computing output to feed downstream nodes. We determine the parameters of all computation nodes through “training,” which first calculates the end-to-end loss using the final output and then the output gradient of that loss. It further calculates intermediate gradients repeatedly using the standard calculus chain rule, following the opposite direction of the outputs.

DP requires only a computation node to be differentiable, allowing a joint optimization with all other nodes to minimize the end-to-end loss through gradient descent. The differentiability of a computation node enables it to maintain a feedback path from its downstream to its upstream neighbors, completing an end-to-end feedback loop. Under DP, a computation node is not necessarily a conventional “ML kernel.” A computation graph can be heterogeneous to include non-ML software and hardware nodes, as long as they satisfy the differentiability requirement.

Computation Graphs

We show a conceptual diagram of a computation graph below.

A Conceptual Computation Graph (Image by Author)

A computation node, which computes output y given input x using parameter w, evaluates and remembers the output/input differential used to calculate the input gradient. The blue dashed curved lines show that the feedback path propagates the input gradient to the upstream nodes. If necessary, it computes and remembers the output/parameter differential for calculating the parameter gradient to adjust the parameters. Let’s see some examples.

Differentiable Graphics-in-the-Loop

More and more NN models have heterogeneous computation nodes, fitting the definition of Differentiable Programming. Good examples are those that solve the Inverse Graphics problem. In contrast to Forward Graphics, which generates 2D images from 3D scene parameters, Inverse Graphics recovers scene parameters from 2D images. Emerging AI-based Inverse Graphics solutions typically include a differentiable graphics renderer, which differs from traditional ones. It back-propagates gradients to the upstream nodes, participating in gradient descent to minimize an end-to-end loss. The power of the Inverse Graphics pipeline with differentiable Graphics-in-the-loop lies in making Inverse Graphics “self-supervised,” as shown in the diagram below.

The reconstruction NNs obtain scene parameters from a real-world image, while the differentiable graphics render the virtual-world image from the scene parameters. The two downstream NNs, which are the same, prepare the real-world and virtual-world images to compute the end-to-end loss between them. Without the differentiable graphics in the loop, we must prepare the 3D ground truths for the scene parameters. Instead, the real-world image effectively serves as the ground truth for the virtual-world image, making the process self-supervised.

Current differentiable renderers such as Soft Rasterizer, DIB-R, and those used in AI frameworks, such as PyTorch3D, TensorFlow Graphics, are software renderers that don’t use Graphics-specific hardware. Such software implementations are not as MM-heavy as typical ML kernels, and therefore cannot leverage the MM accelerations.

On the other hand, GPU architects design and provision Graphics-specific hardware with a sufficiently deep pipeline so that they are fast and rarely become bottlenecks. Now, imagine that we make such a pipeline “differentiable hardware.” Software programmers can effectively use the differentiable hardware in a computation graph, similar to using a pre-built software component. This hardware Graphics-in-the-Loop should be much faster than its software counterpart thanks to the deep pipeline parallelism of the Graphics-specific hardware.

Differentiable ISP-in-the-Loop

In addition to using differential hardware as a pre-built software component, we can “program” it by adjusting its parameters with gradient descent, just like how we “train” an ML kernel. For example, Image Signal Processor (ISP) captures images through the lens and processes them in a pipeline to produce pictures for human consumption or downstream Image Understanding (IU) tasks, such as object detection or semantic segmentation. A traditional ISP has an ample parameter space requiring expert tuning for human consumption. This parameter space is mainly untapped by experts who train downstream IU NN models. Instead, the experts train the NN models with images pre-captured and pre-processed by an ISP using specific parameter settings. Furthermore, the lens systems, which capture the images, may be subject to defects during manufacturing and operations. Without joint optimization and on-device adjustments with the ISPs, the IU NN models will not perform satisfactorily.

There have been flourishing proposals to replace certain ISP processing stages with NN models, which are not necessarily practical or better in scenarios with specific power and real-time constraints. On the other hand, there has been emerging research striving to exploit the untapped parameter space of ISPs. Here are some examples:

Non-differentiable ISP hardware-in-the-loop for parameter auto-tuning with non-ML optimization.
An NN model trained to imitate an ISP as a differentiable proxy for parameter auto-tuning using ML.

The research above has shown that by setting up an end-to-end objective for specific IU tasks, an auto-tuned ISP outperforms one without auto-tuning.

The first approach cannot jointly optimIze a non-differentiable ISP with other NN models. On the other hand, while the second approach of using a differentiable proxy is helpful for training, its drawback is that we need to train this proxy separately in a carefully controlled setting.

Now, imagine making an ISP differentiable. We could compose an adaptive sensing pipeline with ISP-in-the-Loop, as shown in the diagram below. It could jointly tune itself on-device with pre-ISP and post-ISP NN models for adapting to the operating environment and the UI tasks. Note that we do not fixate pre-ISP and post-ISP NN models in the same way that GPU architects do not dictate Graphics shaders (see my article Will the GPU Star in a Golden Age of Computer Architecture).

Conclusion

We have introduced the concept of differentiable hardware using the examples of Graphics-in-the-Loop and ISP-in-the-Loop. Imagine we already have a differentiable ISP and a differentiable GPU on a chip for the next level. We also want self-supervised Inverse Graphics and self-adjusting sensing. As shown below, we can potentially compose a new pipeline by joining the Graphics-in-the-loop and the ISP-in-the-loop pipelines.

As we can see, a differentiable hardware unit is programmable in the following three aspects:

AI programmers can use it in computation graphs, as they use a pre-built and customizable software component in software development.
AI programmers can auto-tune these differentiable hardware unit parameters with the same ML framework used for training NN models.
AI programmers have the freedom to choose from a variety of NN models to work with this differentiable hardware unit, just like Graphics programmers can freely program different types of shaders.

AI has introduced a domain shift in mainstream software to MM-heavy computations. Software programmers can reduce a wide range of applications to ML kernels. To revive the virtuous cycle of Moore’s Law, we will need another domain shift. Instead of figuring out which hardware is for AI, an advancing moving target, we should follow the heart of AI, Differentiable Programming, to change how we design and use computing hardware. There should no longer be a “purity” check for AI hardware, which can now include differentiable hardware.

Hopefully, hardware can extend its lifetime in innovative software, and software can leverage hardware as pre-built and customizable components. Both can propel each other through a new virtuous cycle, just as they did when Moore’s law was in full swing.

Hopefully, hardware can extend its lifetime in innovative software, and software can leverage hardware as pre-built and customizable components. Both can propel each other through a new virtuous cycle, just as they did when Moore’s law was in full swing.

Differentiable Hardware was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

How AI Might Help Revive the Virtuous Cycle of Moore’s Law

The Agony of AI Hardware Architects

The HPC Legacy

The Rise and Fall of the Vector Super

The Return of the Vector Super in GPGPU

In A Nutshell

Matrix Multiplication (MM) and AI

MM and HPC

The Rise of MM in AI

The Tide of the Many-MAC

In A Nutshell

Domain Shift and Domain-Specific Parallelism

The Dark Silicon and the Power Wall

AI to the Rescue

In A Nutshell

The Next Domain Shift

Differential Programming

Computation Graphs

Differentiable Graphics-in-the-Loop

Differentiable ISP-in-the-Loop

Conclusion

Trending Articles