Hardware for Deep Learning. Part 4: ASIC

Grigory Sapunov
49 min readJan 12, 2021

This is a part about ASICs from the “Hardware for Deep Learning” series. The content of the series is here.

As of beginning 2021, ASICs now is the only real alternative to GPUs for
1) deep learning training (definitely) or
2) inference (less so, because there are some tools to use FPGAs with a not-so-steep learning curve or ways to do efficient inference on CPUs).

Now, when every large company launches it’s own DL or AI chip, it’s impossible to be silent. So, ASICs.

Table of Contents

· ASIC
· Google TPU
TPU v1
TPU v2
TPU v3
TPU v4
TPU POD (and Multi-POD)
· Practical considerations
Performance
Price
Code
· Habana (acquired by Intel)
Habana Goya
Habana Gaudi
· Intel (Nervana) NNP
NNP-T (discontinued)
NNP-I (discontinued)
· Graphcore IPU
Colossus MK2 GC200 IPU
Colossus MK1 GC2 IPU
IPU architecture
Systems and PODs
Benchmarks
Software
· Cerebras
Software platform
· AWS
AWS Inferentia
AWS Trainium
· Huawei Ascend
Atlas 300I Inference Card
Atlas 300T Training Card
DaVinci AI architecture
· Bitmain Sophon
· Alibaba Hanguang 800
· Baidu Kunlun
· Groq
· Qualcomm Cloud AI 100
· Others
· Other interesting links
· Summary
· Release Notes

ASIC

ASIC (application-specific integrated circuit) is an integrated circuit customized for a particular use, rather than intended for general-purpose use like CPU.

ASICs are more specialized than GPUs because a GPU is still a massively parallel processor with thousands of computational units capable to execute many different algorithms, while an ASIC is a processor designed to be capable of doing a very small set of computations (say, only matrix multiplications). But it does so extremely well.

However, some of the ASICs described here are rather universal computers capable of solving different tasks (and in AI there are many different complex tasks, not only neural networks) but with uncommon (for now) architecture. There are different names for these processors: NPU, XPU, TPU, IPU, and so on, and maybe one day they will become for the computing field the same thing GPU became once.

Comparing to FPGA, you can’t reprogram ASIC to do a different thing once you need it (but remember, some of them are rather universal), its logic is fixed since being produced, yet on FPGA you can create another circuit that better suits your needs.

Due to this specialization, ASICs are usually much more energy efficient. AlphaGo example is a good case. While the first versions of AlphaGo were run on a cluster of CPUs and GPUs (the distributed version in October 2015 was using 1,202 CPUs and 176 GPUs), the later versions used TPUs achieving better results. To make a clear comparison it is required to compare the computing requirements of these algorithms as well, but I think, using here the overall performance + power consumption is a good proxy.

https://deepmind.com/blog/article/alphago-zero-starting-scratch

Developing specialized hardware is a pretty long-term play.

According to Google’s Jeff Dean: “Chip design projects that are started today often take 18 months to 24 months to finish the design, fabricate the semiconductor parts and get them back and install them into a production data center environment. For these parts to be economically viable, they typically must have lifetimes of at least three years. So, the challenge for computer architects building ML hardware is to predict where the fast-moving field of machine learning will be in the 2 to 5-year time frame.

More on the economics of ASICs.

ASICs are very interesting from different points of view. From the technical point of view, they propose specialized architectures, engineering solutions, and tradeoffs, and that’s a tasty part for the technically inclined people. For practitioners, the most important points are the performance, software support, and last but not least, the price.

There is a lot of movement to ASIC right now. If you heard about Google TPUs (and you couldn’t live the last few years without hearing of them), then you know what is it about. TPU seems to be the most famous example of an ASIC.

Google TPU

Google TPU, or Tensor Processing Unit, now exists in its 4th generation (Google Edge TPU is a bit different story, we’ll talk about it in a separate post on Edge AI), yet the latest publicly available is still the 3rd one. The first one, TPU v1 appeared in 2015 and was designed for INT8 inference only. The TPU v2 and v3 were ready for training as well.

https://www.youtube.com/watch?v=rP8CGyDbxBY&list=PLOU2XLYxmsIJ5Bl3HmuxKY5WE555cu9Uc&index=4

TPU v1

Google’s TPU v1 was put into production in 2015 and it was used internally by Google for their applications. In 2017 Google finally published a technical description of the chip called “In-Datacenter Performance Analysis of a Tensor Processing Unit”.

According to Google, TPU was moved from research into production very fast, just within 22 days from the first tested silicon. Yet the work started in 2013 as a high-priority project when Google’s projections on using their speech technology by users (3 minutes a day) showed that it would require their datacenters to double to meet computation demands. The goal was to improve cost-performance by 10X over GPUs. The full cycle from design, to verification, building, and deploying to the datacenters took 15 months.

No cloud TPU of this version was available to the public.

Google TPU v1 printed circuit card that fits into a SATA hard disk slot for drop-in installation.

The TPU was designed to be a coprocessor using a PCIe Gen3 x16 bus that provides 12.5GB/s of effective bandwidth. This choice allows plugging it into existing servers just as GPU does, up to 4 cards in a server. So, TPU v1 is a kind of matrix accelerator on the I/O bus.

To simplify hardware design and debugging, the host server sends TPU instructions for it to execute (like an FPU, floating-point unit) rather than letting TPU fetch instructions by itself (like GPU does).

TPU Block Diagram.

The heart of the TPU, the main computation part, is the yellow Matrix Multiply unit in the upper right-hand corner. It contains 256x256 “multiply and accumulate” units (MACs) that can perform 8-bit multiply-and-adds on signed or unsigned integers, which offers a peak throughput of 92 TeraOps/second (TOPS) (65,536 * 700MHz clock rate = 46 × 10¹² multiply-and-add operations or 92 Teraops per second, 92 × 10¹²).

The weights for the matrix unit are staged through an on-chip Weight FIFO that reads from an off-chip 8 GB DRAM called Weight Memory (two 2133MHz DDR3 DRAM channels) for inference, weights are read-only. 8 GB supports many simultaneously active models.

The 16-bit products are collected in the 4 MB of 32-bit Accumulators below the matrix unit. The 4MB represents 4096, 256-element, 32-bit accumulators.

The matrix unit produces one 256-element partial sum per clock cycle. When using a mix of 8-bit weights and 16-bit activations (or vice versa), the Matrix Unit computes at half-speed, and it computes at a quarter-speed when both are 16 bits.

The intermediate results are held in the 24 MB on-chip Unified Buffer, which can serve as inputs to the Matrix Unit. A programmable DMA controller transfers data to or from CPU Host memory and the Unified Buffer.

The 24 MB Unified Buffer is almost a third of the die and the Matrix Multiply Unit is a quarter.

Floor Plan of TPU die.

The light (blue) data buffers are 37% of the die, the light (yellow) compute is 30%, the medium (green) I/O is 10%, and the dark (red) control is just 2%. Control is much larger (and much more difficult to design) in a CPU or GPU.

The TPU ASIC is built on a 28nm process, runs at 700MHz, and consumes 40W when running (75W TDP).

TPU implements the matrix multiplication with the systolic array in a pipeline fashion. It relies on data from different directions arriving at cells in an array at regular intervals and being combined.

The matrix unit uses “systolic execution” to save energy by reducing reads and writes of the Unified Buffer. It relies on data from different directions arriving at cells in an array at regular intervals where they are combined. A given 65,536-element vector-matrix multiply operation moves through the matrix as a diagonal wave-front. The weights are preloaded and take effect with the advancing wave alongside the first data of a new block. Control and data are pipelined to give the illusion to the programmer that the 256 inputs are read at once and instantly update one location of each of 256 accumulators. From a correctness perspective, the software is unaware of the systolic nature of the matrix unit, but, for performance, must account for the latency of the unit. (source)

Systolic data flow of the Matrix Multiply Unit

If you want to know more about systolic arrays, read the corresponding section from Google’s blog. Another good and more detailed description can be found here.

Google’s benchmark figures all use the roofline performance model (see the section on GPU for more details) because it offers insights on the causes of performance bottlenecks.

The assumption behind the model is that applications don’t fit in on-chip caches, so they are either computation-limited or memory bandwidth-limited. For HPC, the Y-axis is the performance in floating-point operations per second, thus the peak computation rate forms the “flat” part of the roofline. The X-axis is operational intensity, measured as floating-point operations per DRAM byte accessed. Memory bandwidth is measured in bytes per second, which turns into the “slanted” part of the roofline since (FLOPS/sec)/ (FLOPS/Byte) = Bytes/sec. Without sufficient operational intensity, a program is memory bandwidth-bound and lives under the slanted part of the roofline.

The Roofline model for a single TPU die on log-log scales

The TPU has a long “slanted” part of its roofline, where operational intensity means that performance is limited by memory bandwidth rather than by peak compute. Five of the six benchmarking applications are happily bumping their heads against the ceiling.

Yet, rooflines for a single Haswell die and for a single K80 die show that the six NN applications are generally further below their ceilings than was the TPU. Response time is the reason. Many of these NN applications are parts of end-user-facing services. Researchers have demonstrated that small increases in response time cause customers to use a service less. Hence, while training may not have hard response time deadlines, inference usually does. That is, inference prefers latency over throughput.

The Roofline model for TPU (blue), NVIDIA K80 GPU (red) and Intel Haswell CPU (yellow).

There was a revised TPU v1 with the DDR3 memory replaced by GDDR5 (like in NVIDIA K80) resulted in increased memory bandwidth (from 34 GB/s to 180 GB/s) and raised roofline. The updated roofline is here.

Remember, TPU v1 was designed for INT8 inference only (but also supports INT16 through software).

References:

TPU v2

TPU v2 was unveiled at Google I/O in May 2017, two years later.

While TPU v1 is a coprocessor, controlled by the host, TPU v2 and successors are Turing-complete and are suitable for both training and inference.

Importantly, TPU v2+ was built for multi-chip configurations because it became critical due to heavy production workloads — a single TPU v2 would take 60–400 days for some of them.

A “Cloud TPU” is a TPU board with 4 TPU chips connected through PCI to a host virtual machine.

Google TPU v2 card with 4 chips

Each TPU chip contains two cores with 8 GiB of HBM (high-bandwidth memory) and one matrix unit (MXU) for each TPU core.

A single TPU v2 chip with two cores

TPU v2 is connected to the host by PCIe Gen3 x32.

The MXU provides the bulk of computing power in a TPU chip. Each MXU is capable of performing 16K (128x128) multiply-accumulate operations in each cycle at reduced bfloat16 (BF16) precision (more on bfloat16 and other formats here). It is supported by a vector processing unit that performs all the other computations in typical training workloads.

Matrix multiplications are performed with BF16 inputs but all accumulations happen in FP32 so the resulting matrix is FP32. All other computations are in FP32 except for results going directly to an MXU input, which are converted to BF16.

Block diagram of a TPU v2 core

The Scalar Unit fetches VLIW (Very Long Instruction Word) instructions from the core’s on-chip, software-managed Instruction Memory (Imem), executes scalar operations using a 4K 32-bit scalar data memory (Smem) and 32 32-bit scalar registers (Sregs), and forwards vector instructions to the Vector Unit. The 322-bit VLIW instruction can launch eight operations: two scalar, two vector ALU, vector load and store, and a pair of slots that queue data to and from the matrix multiply and transpose units. The XLA compiler schedules loading Imem via independent overlays of code, as unlike conventional CPUs, there is no instruction cache.

The Vector Unit performs vector operations using a large on-chip vector memory (Vmem) with 32K 128 x 32-bit elements (16MB), and 32 2D vector registers (Vregs) each containing 128 x 8 32-bit elements (4 KB). The Vector Unit streams data to and from the MXU through decoupling FIFOs. The Vector Unit collects and distributes data to Vmem via data-level parallelism (2D matrix and vector functional units) and instruction-level parallelism (8 operations per instruction).

The Transpose, Reduction, Permute Unit performs efficient common matrix transformations.

ICI enables direct connections between chips (500 GB/s per link) to form a supercomputer using only 13% of each chip. Direct links simplify rack-level deployment, but in a multi-rack system, the racks must be adjacent.

TPU v2 chip floor plan

TPU v1 was memory bound for most of its applications. Engineers solved its memory bottleneck by using High Bandwidth Memory (HBM) DRAM in TPU v2 with 700 GB/s bandwidth instead of 34 GB/s bandwidth of DDR3 in TPU v1. More on memory speed here.

TPU v2 delivers 22.5 TFLOPS per core, so 45 TFLOPS per chip and 180 TFLOPS per Cloud TPU card.

To summarize, the significant changes from TPU v1 are:

  • switching from INT8 to BFLOAT16
  • MXU is 128x128 instead of 256x256 (but remember it’s now 16-bit)
  • two cores per chip vs a single one
  • each core still has 8 GB off-chip memory (remember, now there are two cores, so 16 GB of memory total), but now it’s high-bandwidth HBM
  • suitable for training as well, not only inference

References:

TPU v3

TPU v3 was announced in May 2018 at Google I/O, only a year after the TPU v2. It’s rather a gradual evolution of TPU v2, or maybe “TPU v2 done right” (in one deck Google called it “The Anti-Second System”).

A “Cloud TPU v3” is still a TPU board with 4 TPU chips. Each TPU v3 chip contains two cores with 16 GB of HBM (increased size twice and bandwidth from 700 to 900 GB/s) and two matrix units (MXU) for each TPU core (instead of a single one in TPU v2). Now it is liquid-cooled (so the tubes on the photo).

A short visual summary of what changed in TPU v3 comparing to TPU v2:

(source)

TPU v3 delivers 105 TFLOPS per chip and 420 TFLOPS per Cloud TPU card (due to twice the number of MXU and increased clock rate (940 MHz vs 700 MHz).

TPUv1/v2/v3 and Volta feature comparison (source)

While TPUs v2/v3 are mainly required for heavy training, they can also be used for inference using BFLOAT16 (compare to INT8 for TPU v1):

(source)

References:

TPU v4

On July 29, 2020, Google announced the results of the MLPerf benchmark with the TPU v4, Google’s fourth-generation Tensor Processing Unit chip.

Google TPUv4 card

TPU v4 has more than double the matrix multiplication TFLOPS of TPU v3, a significant boost in memory bandwidth, and advances in interconnect technology.

The results demonstrate an average improvement of 2.7 times over TPU v3 performance at a similar scale in the last MLPerf Training competition.

(source)

Each v4 TPU chip contains two TensorCores. Each TensorCore has four MXUs (twice more than TPUv3), a vector unit, and a scalar unit. Its Peak compute is 275 TFLOPS per chip (BF16 or INT8) or 1100 TFLOPS (1.1 PFLOPS!) per card.

The TPU v4 chips have a unified 32 GB HBM memory space across the entire chip (instead of two separate 16 GB HBM modules), enabling better coordination between the two on-chip TensorCores. It also has improved HBM performance using the latest memory standards and speeds.

Here is the deep dive into TPU v4 systems by TheNextPlatform.

The comparison table by TheNextPlatform

TPU POD (and Multi-POD)

TPUs can be connected to each other, making a supercomputer called POD.

The v3 TPU POD

The critical feature of a modern supercomputer architecture is how its chips communicate: what is the speed of a link; what is the interconnect topology; does it have centralized versus distributed switches; and so on.

For training, most traffic is an all-reduce over weight updates from all nodes of the machine.

If switch functionality is distributed into each chip rather than as a stand-alone unit, the all-reduction can be built in a dimension-balanced, bandwidth-optimal way for a 2D torus topology. An on-device switch provides virtual-circuit, deadlock-free routing so that communication between chips does not require host CPU or host networking resources. To enable a 2D torus, the chip has four custom Inter-Core Interconnect (ICI) links, each running at 496Gbits/s per direction in TPU v2.

A 2D-torus topology. TPU v2 uses a 16x16 2D torus.

The Next Platform hypothesizes that Google added a dimension to the toroidal hyper-mesh in TPU v3 and moved from a 2D toroidal mesh to a 3D toroidal mesh, but Google states 2D toroidal mesh for both versions.

The v2 TPU Pod provides a maximum configuration of 64 devices for a total of 256 chips/512 TPU v2 cores, and 4 TB of TPU memory delivering 11.5 PFLOPS peak performance.

The v3 TPU Pod provides a maximum configuration of 256 devices for a total of 1024 chips/2048 TPU v3 cores and 32 TB of TPU memory delivering 100+ PFLOPS peak performance.

You don’t need to use a full TPU POD, though. Google also lets developers rent “slices” of these machines. A Cloud TPU v3 POD slice can include 16, 64, 128, 256, 512, or 1024 chips.

TPU v4 PODS have 3D torus topology and can include up to 4096 chips.

some common TPU v4 topologies (source)

Google is experimenting with TPU multipods. A recent paper from November 2020 called “Exploring the limits of Concurrency in ML Training on Google TPUs” presents techniques to scale ML models on the Google TPU Multipod, a mesh with 4096 TPU-v3 chips, achieving record training times from 16 to 28 seconds in four MLPerf models on the Google TPU-v3 Multipod machine.

Here the two pods are connected along the X-dimension of the mesh by the cross-pod optical links. These links are longer than standard TPU v3 within-pod links. The MLPerf benchmarking was done on a 4-pod multipod with 4096 chips in a 128x32 2-D mesh topology (with within-pod torus links at the Y edges). As the TPU v3 chip only had 1024 entries in the routing table, the authors used a sparse routing scheme where only neighbors along rows and columns were visible to each chip. This was sufficient for achieving peak throughput in all-reduce communication operations.

TPU-v3 4-pod configuration where cross-pod links connect neighboring TPU-v3 pods

The multipod can deliver 400+ PFLOPS. For comparison, NVIDIA’s DGX SuperPOD may contain 20–140 NVIDIA DGX A100 systems with a total peak performance of 100–700 PFLOPS.

References:

Practical considerations

Google TPUs are ready-to-use cloud-only alternatives to GPU. As of the beginning of 2021, TPU v2 and v3 are both available on the Google Cloud.

Performance

The minimal configuration you can take is a full card with 4 TPUs/8 cores, v2–8 or v3–8.

v2–8 can potentially deliver up to 180 TFLOPS peak performance and has 16*4=64 GB memory and v3–8 up to 420 TFLOPS with 32*4=128 GB memory.

Remember: this peak performance is an ideal performance, and the performance on your real task can be significantly below this number due to many reasons (memory bottleneck, not fully exploiting the computational units, and so on).

These performance numbers are for BFLOAT16, so it’s even harder to compare it with other solutions that do not support this format.

Anyway, we need some references to compare against. We’ll try to use the nearest analog for other hardware solutions. So, for NVIDIA’s cards, it’s reasonable to use “Tensor FP16” performance for Turing/Volta series, which is typically mixed-precision FP16/FP32 computations (FP16 computations with FP32 accumulate). The Ampere series does support BFLOAT16 natively. We can compare with TF32 as well, but it’s twice as less. We do not compare against A100 sparse linear algebra performance (which is twice as large comparing to dense linear algebra performance) because current TPUs do not support sparse calculations.

(Again, here is a short article describing all these formats: FP32/FP16/BF16/TF32, etc)

So, some comparables:

  • TPU v2–8: 180 TFLOPS and 64 GB HBM memory (700 GB/sec bandwidth)
  • TPU v3–8: 420 TFLOPS and 128 GB HBM memory (900 GB/sec)
  • A100: 312 TFLOPS BFLOAT16 and 40 GB HBM2 memory (1.6 TB/sec)
  • RTX 3090: 71 TFLOPS BFLOAT16 (same for Tensor FP16) with FP32 accumulate (or 142 TFLOPS for Tensor FP16 with FP16 accumulate) and 24 GB GDDR6X (936.2 GB/s)
  • RTX 3080: 59.5 TFLOPS BFLOAT16 (same for Tensor FP16) with FP32 accumulate (or 119 TFLOPS for Tensor FP16 with FP16 accumulate) and 10 GB GDDR6X (760 GB/s)
  • Titan RTX: 130.5 TFLOPS Tensor FP16/FP32 and 24 GB GDDR6 (672 GB/s)
  • RTX 2080: 42.4 TFLOPS Tensor FP16/FP32 (or 84.8 TFLOPS for Tensor FP16 with FP16 accumulate) and 8 GB GDDR6 (448 GB/s)
  • Titan V: 110 TFLOPS Tensor FP16 and 12 GB HBM2 (652.8 GB/s)

Looks like for workloads that fit into one-to-several GPUs the high-end consumer cards (Titan RTX, Titan V, or RTX 3090) are still a good option (at least they are comparable).

But the true power of TPU is in its multi-computer configuration (the same is true for A100).

Price

For a single-device on-demand TPUs (not PODs) the price is $4.50/hour for TPU v2 (called v2–8 by the number of cores on a single card) and $8.00/hour for TPU v3 (v3–8).

It is cheaper for a preemptible TPU (it is similar to AWS spot instances, a Cloud TPU can exit (preempt) at any time if Cloud TPU requires access to the resources for another task). For preemptible TPUs the prices are $1.35/hour and $2.40/hour correspondingly for v2–8/v3–8.

These are the prices in the US region. In other regions (the Netherlands or Taiwan) it costs more.

For POD-type TPUs the available instances are v2–32, v2–128, v2–256, and v2–512 for TPU v2, and v3–32, v3–64 to v3–2048. There is an evaluation price per hour or 1-year/3-year commitments with price per month.

The current list of prices is here.

There is also a TensorFlow Research Cloud (TFRC) program. Those enrolled in it are granted access to Cloud TPU v2 and v3 for a limited period of time free of charge. Specifically for the TFRC program, you are not charged for Cloud TPU as long as your TPU nodes run in the us-central1-f zone.

Some comparables again:

Code

TPU support is pretty vast among modern frameworks and languages.

You can use TPUs from Tensorflow, PyTorch, JAX, Julia, Swift. They all use XLA to compile code for TPU.

There is a set of supported models that are optimized for fast and accurate training on TPU. Google also has a rich set of tutorials on the topic.

Depending on your situation it might be worth optimizing code for the TPU. Here are some guidelines called “Cloud TPU programming model” and “Performance Guide”. If you want TPU-accelerated NumPy-like computations you can just use JAX.

Habana (acquired by Intel)

Dec. 16, 2019, Habana Labs, an Israel-based developer of programmable deep learning accelerators, was acquired by Intel.

Habana provides products both for training (Gaudi) and inference (Goya).

Let’s start with the inference part.

Habana Goya

The Goya Inference Processor is based on the scalable architecture of Habana’s proprietary Tensor-Processing Core (TPC) and includes a cluster of eight programmable cores. TPC is a VLIW SIMD vector processor with Instruction-Set-Architecture and hardware tailored for deep learning workloads.

GOYA High-level Architecture

The information on the TPC architecture is very limited.

The TPC is C/C++ programmable, providing the user with maximum flexibility to innovate, coupled with many workload-oriented features such as General Matrix Multiply (GEMM) operation acceleration, special-functions dedicated hardware, tensor addressing, and latency hiding capabilities.

All Goya engines (TPCs, GEMM, and DMA) can operate concurrently and communicate via shared memory. For external interface, the processor uses PCIe Gen4x16 enabling communication to any host of choice. The processor includes two 64-bit channels of DDR4 memory interface with a max capacity of 16 GB.

The TPC natively supports these mixed-precision data types: FP32, INT32/16/8, UINT32/16/8.

There is GOYA HL-100 PCIe card (Gen 4.0, 16 lanes) built with Goya HL-1000 processor, with 16 GB DDR4 memory with ECC on board, having TDP 200W. There is also a GOYA 8-card server.

HL-100 PCIe card

Habana provides their own SynapseAI compiler and runtime. SynapseAI can be interfaced directly using either C or Python API, it also natively supports ONNX and TensorFlow 2.2 today, and will be followed by native PyTorch and ONNX RT support.

GOYA Inference Platform — Software Stack

There are benchmarks on BERT, ResNet50, and other network architectures in the whitepapers. There are also MLPerf Inference 0.5 results from November 2019.

References:

Habana Gaudi

For training, there are Gaudi accelerators: HL-205 Mezzanine card and HL-200/202 PCIe cards.

Both use Gaudi HL-2000 processor, have 32 GB HBM2 memory with ECC (1 TB/sec bandwidth) on board. The host interface is PCIe Gen 4.0 x16.

Gaudi HL-205 mezzanine card

Gaudi uses a cluster of 8 TPC 2.0 cores (the first generation of cores was introduced in the Goya inference processors). It is also a VLIW SIMD vector processor.

Gaudi high-level architecture

The TPC core natively supports FP32, BF16, INT32/16/8, UINT32/16/8.

Gaudi also integrates RDMA over Converged Ethernet (RoCE v2) engines, supporting bi-directional throughput of up to 2 TB/sec. Each Gaudi has 10 ports of 100Gb Ethernet and each port can be used for either internal or external (scale-out) connectivity.

There are servers with eight (HLS-1) or four (HLS-1H) HL-205 cards. They can be gathered in a POD. Different topologies can be built with these servers.

December 1st, 2020 AWS announced EC2 instances powered by Habana Gaudi. Habana Gaudi-based EC2 instances will be available in 2021.

Peak FLOPS for Habana products are unknown.

References:

Intel (Nervana) NNP

Intel acquired Nervana in 2016. Since then it had promising deep learning processors “the next year” up until 2019. But soon after acquiring Habana at the end of 2019, they decided to stop working on the Nervana lineage.

Anyway, it’s worth mentioning Nervana NNP processors from the historical viewpoint, because they were in the focus for a long time.

NNP-T (discontinued)

NNP-T was the processor for training. It was supposed to be useful for building PODs (say 10-rack POD with 480 NNP-T).

The processor had 24 Tensor Processing Cluster (TPC) and 32 GB HBM2, and its peak performance was 119 TFLOPS BF16. It was planned to be on PCIe Gen 4 x16 accelerator card (300W) or OCP Accelerator Module (375W).

It was known under the Spring Crest code name.

NNP-I (discontinued)

NNP-I was a processor for inference using mixed precision math, with a special emphasis on low-precision computations using INT8.

It contained 12 inference compute engines (ICE) + 2 Intel architecture cores (AVX+VNNI). It was supposed to be in two variants: M.2 form factor (1 chip) consuming 12W with up to 50 TOPS, and a PCIe card (2 chips) consuming 75W with up to 170 TOPS.

It was known as the Spring Hill microarchitecture.

References:

Graphcore IPU

Graphcore, a British semiconductor company, develops what they call Intelligence Processing Unit (IPU), a massively parallel processor to accelerate machine intelligence.

Colossus MK2 GC200 IPU

GC200 is the second-generation Colossus MK2 IPU processor announced in July 2020. It is claimed that the second-generation is achieving an 8x step up in real-world performance compared to the MK1 IPU.

GC200 contains 59.4B transistors and is built using the very latest TSMC 7nm process. Each MK2 IPU has 1472 IPU-cores, running 8832 independent parallel program threads. Each IPU holds 900MB In-Processor-Memory with 47.5 TB/s bandwidth. It delivers up to 250 TFLOPS of AI compute at FP16.16 and FP16.SR (stochastic rounding).

Colossus MK1 GC2 IPU

The Colossus MK1 GC2 IPU is the first-generation IPU. It has 1216 IPU-cores, running 7296 parallel program threads. Each IPU holds 300MB In-Processor-Memory with 45 TB/s bandwidth.

The MK1 IPU offers up to 31.1 TFLOPS FP32 and 124.5 TFLOPS in mixed precision (FP16 with FP32 accumulation).

IPU architecture

The information on the Graphcore architecture is very limited. The most comprehensive description I found was Citadel’s paper called “Dissecting the Graphcore IPU Architecture via Microbenchmarking” describing the first-generation IPU. Most of the text is taken from there. If you want more details, I’d recommend diving into the paper.

IPU is a massively parallel platform. Each IPU contains processing elements called tiles (1472 in GC200, and 1216 in GC2); a tile consists of one computing core plus local memory (768KB in GC200, 256KB in GC2).

The IPU’s approach to reducing memory latency is radical — it does away with shared memory entirely (but that’s not true for the second-generation which supports DDR4 Streaming Memory). The IPU only offers small, distributed memories that are local and tightly coupled to each core. Each tile contains 256 KB of memory, totaling 304 MB per device (for GC2). IPU memory is implemented as SRAM, not DRAM, so the lower latencies.

In addition to the tiles, the IPU processor contains the IPU-exchange, an on-chip interconnect that allows for high-bandwidth, low-latency communication among tiles (8 TB/s all-to-all).

Each IPU also contains ten IPU link interfaces; the IPU link is a Graphcore proprietary interconnect that enables low latency, high-throughput communication between IPU processors with 320 GB/s chip-to-chip bandwidth. This is close to NVIDIA V100 300 GB/s with NVLink 2.0 and lower than 600 GB/s for A100 with NVLink 3.0. IPU links make transfers between remote tiles as simple to the programmer as between local tiles.

Finally, the IPU contains two PCIe links for communication with CPU-based hosts.

The IPU’s emphasis on fine-grained parallelism means that the IPU can efficiently run applications that have irregular and sparse data access patterns and control flow. Unlike SIMD/SIMT architectures, IPUs don’t need large warps of threads consuming contiguous vector data to achieve high efficiency. Instead, IPUs can run individual processing threads on smaller data blocks, in a highly parallel MIMD fashion. Each thread can have completely distinct code and execution flow without incurring performance penalties.

IPU cores pay no penalty when their control flows diverge or when the addresses of their memory accesses diverge. In fact, they pay no penalty for running disjoint instruction flows that exhibit uncorrelated memory accesses. Cores access data from their respective local memory at a fixed cost that is independent of access patterns. This makes IPUs more efficient than GPUs at executing applications with irregular or random data access patterns and/or applications that are control-flow dominated, provided that working sets fit in IPU memory.

Similarly to CPUs and GPUs, IPUs achieve higher efficiency by oversubscribing threads to cores. Specifically, each IPU tile offers hardware support for 6 threads in a manner that is functionally similar to the SMT technique (Simultaneous Multi-Threading, or Intel’s Hyper-Threading) commonly found on CPUs. Each IPU tile maintains 6 resident execution contexts and multiplexes them onto shared resources, thus hiding instruction latencies (dependency, memory access and branch latencies), reducing corresponding pipeline stalls, and increasing aggregate throughput. Each tile rotates among threads according to a static, round-robin schedule. The entire IPU supports therefore 6×1216 = 7296 threads in GC2 and 6×1472=8832 for GC200. For maximum occupancy, software designers are encouraged to instantiate that many threads.

Specialized pipelines called Accumulating Matrix Product (AMP) units are present in each IPU tile. AMP units are used to accelerate matrix multiplication and convolution operations. An AMP unit can finish 64 mixed-precision or 16 single-precision floating-point operations per clock cycle.

The IPU’s memory organization requires software designers to partition their working set across the tiles’ memories appropriately and make tiles exchange data with each other when they need non-local operands. The programming model, along with Graphcore’s Poplar language and associated compiler, allows for the automatic orchestration of these data transfers.

Models that fit entirely on-chip benefit from the high bandwidth and low latency offered by local memory. Models of larger size can be sharded across IPU processors and IPU boards. The second-generation IPU can use external DDR4 Streaming Memory as well.

A system with multiple IPUs exposes the single IPU devices independently, but it also exposes Multi-IPUs. A Multi-IPU is a virtual IPU device that is comprised of multiple physical IPUs and offers all their memory and compute resources as if they belonged to a single device.

The Multi-IPU programming model is transparent to the developer. The underlying hardware makes the abstraction efficient and, in practice, no extra development effort is needed to scale applications onto large IPU systems. In contrast, CUDA applications do require extra development effort and complexity to parallelize across multiple GPUs, and even more to parallelize across hosts. The same is true for CPU parallel applications, especially across hosts.

The IPU programming paradigm is based on the Bulk Synchronous Parallel (BSP) model. The BSP model organizes computation in multiple sequential supersteps; a superstep is composed of a local computation phase, followed by a communication phase and a barrier synchronization phase.

(source)

The IPU is a true BSP machine. It faithfully incarnates hardware support,
enforcement, and optimization for the three phases of each BSP superstep. Its programming model lets programmers specify processes in terms of graph vertices that compute on local data (and local data only).

In the IPU paradigm supported by the Poplar SDK, programmers describe computation as vertices, data as tensors, and data exchanges as static edges, without having to worry about the allocation of data at rest in local memories, allocation of input/output transfer buffers, or scheduling of transfers. All these tasks are carried out by the Poplar compiler. The compiler organizes the processing of machine intelligence models exclusively using this BSP paradigm.

Because the IPU implements the BSP model using its exchange and IPU links, the interconnect’s performance is the primary factor that determines the performance of BSP communication and synchronization phases and ultimately affects overall application performance.

So, the architecture is interesting (yet we can actually say the same for many of the chips mentioned here).

Systems and PODs

The IPU-M2000 is Graphcore’s IPU system powered by 4 x Colossus MK2 GC200 IPU processors. It has 3.6GB In-Processor Memory plus up to 448GB Streaming Memory (outside of the chip) for larger models. The system delivers 1 PFLOPS FP16 peak performance.

The IPU-M2000 is Graphcore’s IPU system

The IPU-Machine features ultra-low latency 2.8Tbps IPU-Fabric to build scale-out IPU-POD data center solutions connecting up to 64,000 IPUs. The machine is designed in a slim 1U blade. The IPU-M2000 has a flexible, modular design, so you can start with one and scale to thousands.

The M2000 price was mentioned to be $32,450.

The IPU-POD64 has 16 IPU-M2000s in a standard rack. It offers seamless scale-out of up to 64,000 IPUs working as one integral whole or as independent subdivided partitions to handle multiple workloads and different users.

The Graphcore IPU-POD64

The Dell DSS8440, is the first Graphcore IPU server, features 8 dual-IPU C2 PCIe cards based on the first-generation IPU. It’s a standard 4U server. Dell also offers a 2-way PCIe card IPU-Server for inference.

Dell DSS8440

Interestingly, Azure runs NDv3 Graphcore IPU-powered Azure VM instances in preview.

Cirrascale offers first-generation Graphcore IPUs in the cloud with the price starting at $9,999/month (or $13.70/hourly equivalent).

Benchmarks

There are benchmark results from Graphcore for 1xIPU-M2000, 4xIPU-M2000, and IPU-POD64 systems.

There are also comparisons for IPU-M2000 with NVIDIA A100, but for me, it looks like an apples-to-oranges comparison.

For example, BERT-Large inference is compared for a single A100 with 40GB (312 TFLOPS) vs. IPU-M2000 (1 PFLOPS) which has 4x Colossus MK2 GC200 IPU processors and 450GB memory, and Graphcore wins with 3.4x higher throughput. Not surprisingly given the similar difference in peak performance.

For BERT-Large training, there is a 5.3x difference in time-to-train, but the systems under comparison are IPU-POD64 with 16 IPU-M2000s (should be 16 PFLOPS given that a single M2000 deliver 1 PFLOPS, and total 450*16 GB memory) and DGX-A100 (8x NVIDIA A100, 5 PFLOPS total peak performance, 320 or 640 GB memory). So the relative-to-performance difference should be not so large. It would be interesting to compare the relative-to-price numbers as well.

The difference in throughput on vision models (ResNet-50 training and inference, ResNeXt-101 training) is also between 2.6x and 4.6x for A100 vs M2000. For ResNeXt-101 inference numbers are strange and I cannot interpret them. Graphs with “Latest GPU” are hard to interpret as well.

Citadel, a hedge fund, prepared a detailed analysis of the first-generation systems, MK1, and M1000 machines.

Software

The Poplar SDK is a complete software stack, which was co-designed from scratch with the IPU. At a high level, Poplar is fully integrated with standard machine learning frameworks so developers can port existing models.

Poplar seamlessly integrates with TensorFlow 1 & 2 with TensorFlow XLA backend, PyTorch with PopTorch, PopART (Poplar Advanced Runtime) for training & inference supports Python/C++ model building plus ONNX model input, plus upcoming support for PaddlePaddle and other frameworks.

PopLibs is a complete set of libraries, available as open-source code (MIT License), that support common machine learning primitives and building blocks.

References:

Cerebras

We’ve mentioned Cerebras 400,000-core wafer-scale processor in Part 2 of the Series. But the true place for this processor is here, in ASICs.

The Cerebras Wafer Scale Engine (WSE) is a huge monster!

WSE numbers:

  • 400,000 Sparse Linear Algebra Compute (SLAC) cores, fully programmable processing elements. (According to a recent October 2020 paper, the number is closer to 380,000)
  • 18 Gigabytes of On-chip Memory, all accessible within a single clock cycle, and provides 9 PB/s memory bandwidth.
  • 1.2 trillion (!) transistors (for comparison, NVIDIA’s A100 GPU contains 54 billion transistors, 22x less!)
  • TSMC 16nm process

The 400,000 cores on the WSE are connected via the Swarm communication fabric in a 2D mesh with 100 PB/s of bandwidth.

The most comprehensive technical details so far were found in the October 2020 article. Below I copied some parts from its “Architecture” section.

The CS-1 wafer is a MIMD, distributed-memory machine with a 2D-mesh interconnection fabric. The repeated element of the architecture is called a tile. The tile contains one processor core, its memory, and the router that it connects to. The routers link to the routers of the four neighboring tiles.

The wafer contains a 7×12 array of 84 identical “die.” A die holds thousands of tiles. Ordinary chips are made by cutting the wafer into an individual die; in the WSE, the dies are instead connected by extending the interconnect across the “scribe lines”, the spaces between die.

CS-1 Wafer Scale Engine (WSE)

It is impossible to yield a full wafer with zero defects, so the design includes redundant cores and redundant fabric links. Redundant cores replace defective cores and extra links reconnect fabric to restore logical 2D mesh.

The memory, functional units, and instruction set are designed for high throughput numerical computation. The roughly 380,000 tiles each have their own fast SRAM memory. There is no shared memory. Local memory is 48 KB, which totals 18 GB across the wafer. The load-to-use latency is one cycle.

(Another source describes a 630x630=396,900 grid of processing elements. This number is much closer to 400,000. So, maybe the number 380,000 from the article is due to using the WSE with more redundant cores being used to replace other potentially defective cores? It’s unclear.)

The instruction set supports operations on INT16, FP16, and FP32 types. Floating-point adds, multiplies, and fused multiply-accumulate (or FMAC, with no rounding of the product prior to the add) can occur in a 4-way SIMD manner for 16-bit operands. The instruction set supports SIMD operations across subtensors of four-dimensional tensors, making use of tensor address generation hardware to efficiently access tensor data in memory. These play the role of nested loops and eliminate any loop overhead.

In mixed precision with multiplications in FP16 and additions performed in FP32, the throughput is two FMACs per core per cycle. Purely FP32 computations run one FMAC per core per cycle. The theoretical peak performance of the system varies depending on the number of cores configured on the wafer, clock rate, and power settings.

The core supports nine concurrent threads of execution.

A stream of data to or from the fabric may be used as an input to a tensor operation, or as the destination for one. The hardware directly implements scheduling activities that would normally be performed by an operating system. This allows compact and efficient software implementations. For example, one core can be sending data from its local memory to another core; simultaneously it can receive data from another core while adding it to values stored in its local memory. All of this is accomplished using only two machine instructions that run as independent threads.

Code consists of tasks that react to events. Tasks are triggered by other tasks, or by arriving data words. The channel of the arriving word determines the code that is triggered. There is little delay between the completion of a task and the start of a subsequent task, as this is handled in hardware.

More technical details can be found in the article.

Swarm is a massive on-chip communication fabric that delivers breakthrough bandwidth and low latency at a fraction of the power draw of traditional techniques used to cluster graphics processing units. It is fully configurable; software configures all the cores on the WSE to support the precise communication required for training the user-specified model. For each neural network, Swarm provides a unique and optimized communication path.

WSE uses a blend of parallel execution modes, single algorithm uses both model and data parallelism in optimization.

From HotChips 2020 slides

Cerebras CS-1 is a system built on WSE.

CS-1 system

Cerebras calls it the “Cluster-scale deep learning compute in a single system”. The CS-1 is 26-inches (15 rack units) tall and fits in one-third of a standard data center rack. It has 12 x 100 Gigabit Ethernet lanes.

The system consumes 20 KW power. CS-1 is internally water-cooled. Water circulates through a closed loop, fully self-contained within the system. Like a giant gaming PC, the CS-1 uses water to cool the WSE, and then air to cool the water.

You can join several CS-1's in a cluster.

Cerebras hasn’t released MLPerf results or any other independently verifiable apples-to-apples comparisons. Instead, the company prefers to let customers try out the CS-1 using their own neural networks and data.

There is an interesting use case from the HPC field described recently, where the solver on the CS-1 system achieved a performance of 0.86 PFLOPS using mixed precision (FP16/FP32) calculations. This should NOT be directly compared to other peak performance estimates mentioned here or there. But anyway, close to petaflops performance on a real task delivered by a single computing system is pretty impressive! I wish I had access to such a supercomputer :)

Software platform

The Cerebras software platform is comprised of four primary elements:

  1. The optimized Cerebras Graph Compiler (CGC)
  2. A flexible library of high-performance kernels and a kernel-development API
  3. Development tools for debugging, introspection, and profiling
  4. Clustering software

The Cerebras Graph Compiler (CGC) takes as input a user-specified neural network. Researchers can use both existing ML frameworks (TensorFlow, PyTorch) and well-structured graph algorithms written in other general-purpose languages, such as C and Python.

To translate a deep learning network into an optimized executable, CGC extracts a static graph representation of the problem from the source language and converts it into the Cerebras Linear Algebra Intermediate Representation (CLAIR).

Once the CLAIR graph has been extracted, CGC performs matching and covering operation that matches subgraphs to kernels from the Cerebras kernel library. These kernels are optimized to provide high-performance computing at extremely low latency on the fabric of the WSE. The result of this matching operation is a kernel graph. CGC then allocates compute and memory to each kernel in the graph and maps every kernel onto a physical region of the computational array of cores. Finally, a communication path, unique to each network, is configured onto the fabric.

The final result is a CS-1 executable, customized to the unique needs of each neural network so that all 400,000 SLAC cores and 18 Gigabytes of on-chip SRAM can be used at maximum utilization.

The Cerebras software platform includes a kernel API and C/C++ compiler based on the LLVM toolchain that allows users to program custom kernels for CGC.

A high-level overview of the compilation process for the WSE

Because of the massive size of the WSE, every layer in the neural network can be placed onto the fabric at once and run simultaneously. The computation is parallel at three levels: within the core, there is multiple operation per-cycle parallelism; across each fabric region, the cores can work in parallel on one layer; and all layers can run in parallel on separate fabric regions.

CGC can support any hybrid execution mode combining data-parallel, layer-parallel, and layer-pipelining techniques. It can run in a traditional, layer-sequential mode to support exceptionally large networks. It can combine model and layer-parallelism, with the entire model spread across the cores of the WSE, and each layer running pipeline-parallel. It can map multiple copies of a layer-parallelized model to the fabric at once, and train them all in a data-parallel fashion.

More on neural network parallelism on WSE can be found here.

It’s not a practical choice for many of us (unless you’re a National Lab or a large enterprise, or maybe Cerebras provide cloud access once). Some estimates suppose a single system could cost near $2M.

But anyway this is an interesting solution that shows us what’s possible.

And Cerebras just announced their second generation of WSE:

  • 850,000 AI-optimized cores
  • 2.6 Trillion Transistors
  • TSMC 7nm Process

Waiting for more details, it should be cool!

References:

AWS

Amazon has its own solutions for both training and inference.

AWS Inferentia

AWS Inferentia was announced in November 2018. It was designed by Annapurna Labs, a subsidiary of Amazon.

Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a high-performance systolic array matrix multiply engine (as Google TPU). NeuronCores are also equipped with a large on-chip cache (but the exact numbers are unknown). [source]

(source)

AWS Inferentia supports FP16, BF16, and INT8 data types. Furthermore, Inferentia can take a 32-bit trained model and automatically run it at the speed of a 16-bit model using BF16.

Each chip can deliver 64 TFLOPS on FP16 and BF16, and 128 TOPS on INT8 data. (source)

(source)

You can have up to 16 Inferentia chips per EC2 Inf1 instance. Inferentia is optimized for maximizing throughput for small batch sizes, which is beneficial for applications that have strict latency requirements.

The AWS Neuron SDK consists of a compiler, run-time, and profiling tools. It enables complex neural net models, created and trained in popular frameworks such as TensorFlow, PyTorch, and MXNet, to be executed using Inf1 instances. AWS Neuron also supports the ability to split large models for execution across multiple Inferentia chips using a high-speed physical chip-to-chip interconnect.

The technical details on Inferentia are very scarce.

References:

AWS Trainium

December 1st, 2020 Amazon announced its AWS Trainium chip.

AWS Trainium is the second custom machine learning chip designed by AWS and it’s targeted at training models in the cloud.

AWS Trainium shares the same AWS Neuron SDK as AWS Inferentia, so it’s integrated with TensorFlow, PyTorch, and MXNet.

AWS Trainium will be available in 2021.

For now, almost no technical details are available.

References:

Huawei Ascend

Huawei has its own solutions for both training and inference as well. The lineage of AI products is pretty vast, but we’ll focus on accelerator cards mostly.

Atlas 300I Inference Card

The card is built with the Ascend 310 AI processor. The processor delivers 22 TOPS INT8 and 11 TFLOPS FP16 with 8W of power consumption.

The card is a PCIe x16 Gen3.0 device with a maximum consumption of 67 W. It supports one 100 Gbit/s RoCE high-speed port. The card contains 32 GB LPDDR4X with a bandwidth of 204.8 GB/s. A single card provides up to 88 TOPS INT8 computing power.

The pricing is not public now, you have to request the information.

There are also Atlas 800 Inference Servers with up to 7 or 8 Atlas 300I inference cards for AI inference in data centers.

And there are Atlas 200 AI Accelerator Module with the Ascend 310 AI processor to implement image recognition and image classification on the device side, and Atlas 200 DK AI Developer Kit with the Ascend 310 AI processor to facilitate quick development and verification.

Atlas 200 Development Kit

References:

Atlas 300T Training Card

The training card is based on the Ascend 910 AI processor with 32 built-in Da Vinci AI Cores and 16 TaiShan Cores. The processor delivers 320 TFLOPS FP16 and 640 TOPS INT8 of computing performance with 310W of max power consumption. It’s pretty close to NVIDIA’s A100 BF16 peak performance of 312 TFLOPS. Other sources say it delivers up to 256 TFLOPS FP16/512 TOPS INT8.

Ascend 910 is a high-integration SoC processor. In addition to the Da Vinci AI cores, it integrates CPUs, DVPP, and Task Scheduler.

A single card provides 220 TFLOPS FP16 or 280 TFLOPS FP16 computing power in Pro version (?). These numbers are somehow lower than the corresponding numbers of the Ascend 910 processor. It has 32 GB HBM or 16GB DDR4 2933. This card is a PCIe x16 Gen4.0 device with up to 300W power consumption.

HCCS is Huawei’s in-house high-speed interface interconnecting Ascend 910s. On-chip RoCE interconnects nodes directly. PCIe 4.0 doubles the throughput of the previous generation.

The pricing is not public either.

There are also Atlas 800 training server with 8 Ascend 910 processors delivering and 32 GB HBM (1200 GB/s) 2.24 PFLOPS FP16 computing power; Atlas 900 PoD with 64 Ascend 910 processors and 2048 GB HBM delivering up to 17.92 PFLOPS FP16, supporting scaling to an AI cluster of up to 4096 Ascend 910 processors, delivering up to 1 EFLOPS FP16; and Atlas 900 AI Cluster consisting of thousands of Ascend 910 AI processors and the potential to deliver up to 256–1024 PFLOPS at FP16.

Atlas 900 AI Cluster

References:

DaVinci AI architecture

Both Ascend 910 and Ascend 310 contain DaVinci AI cores.

Two cores in Ascend 310 and 32 cores in Ascend 910.

There is MindSpore open source (Apache-2.0 License) deep learning training/inference framework from Huawei with native support for Ascend AI processors and software-hardware co-optimization.

There are also tools for Caffe/Tensorflow model conversion.

References:

Bitmain Sophon

Sophon (Beijing) is a subsidiary of Bitmain (a company developing crypto mining products), focusing on the development of artificial intelligence chips and artificial intelligence products.

Sophon is named after a proton-size supercomputer described in Cixin Liu’s “The Three-Body Problem” novel.

Sophon now has several generations of their Tensor Computing Processors: the BM1680 (1st generation, 2 TFLOPS FP32, 32MB SRAM, 25W), BM1682 (2nd gen, 3 TFLOPS FP32, 16MB SRAM), BM1684 (3rd gen, 2.2TFLOPS FP32, 17.6 TOPS INT8, 32 MB SRAM) and BM1880 (1 TOPS INT8).

Comparing to NVIDIA products such peak performance on FP32 is available with GTX 1050 Ti (~2 TFLOPS) or GTX 1060 (~3.4 TFLOPS).

Each BM1682 chip has 64 NPU processing units, and each NPU has 32 EU arithmetic units. BM1684 also has 64 NPUs.

SC5+

There are Deep Learning Acceleration PCIe Cards. The SC3 with a BM1682 chip (8 GB DDR memory, 65W), SC5 and SC5H with a BM1684 chip and 12 GB RAM (up to 16 GB) with 30W max power consumption, and SC5+ with 3x BM1684 and 36 GB memory (up to 48 GB) with 75W max power consumption.

Artificial Intelligence Server SA3 contains 6 or 18 BM1682, has 48 or 144 GB DDR memory, and can potentially deliver up to 18 or 54 TFLOPS.

Artificial Intelligence Server SA3

For comparison, a single NVIDIA Titan RTX has an FP32 peak performance of 16.3 TFLOPS, and a single RTX 3090 has 35.6 TFLOPS FP32 (be careful, here we mention not tensor core performance, which is much higher, but it’s mixed-precision FP16/32).

BMNNSDK (BitMain Neural Network SDK) is BitMain’s proprietary deep learning SDK based on BM AI chip. BMNNSDK is composed of BMNet Compiler and BMRuntime.

BMNet Compiler is responsible for optimizing and converting various deep neural network models (Caffe, TensorFlow, Pytorch, Mxnet, PaddlePaddle, and Darknet (?)) finally converting it to the bmodel model supported by Bitmain TPU. BMRuntime is responsible for driving the TPU chip and providing a unified programmable interface for the upper-layer application program.

Alibaba Hanguang 800

Alibaba unveiled their first AI inference chip, Hanguang 800, during Apsara Conference on September 25, 2019.

The chip is optimized for Computer Vision Tasks, but GEMM is also accelerated.

A trained neural network model must be converted to INT8, and the software tools do support models from Caffe, MxNet, TensorFlow, and models in ONNX format.

It is SRAM only and has 192MB Local Memory and no DDR. Keep in mind, the models are compressed and quantized, so 192MB is actually not a very small number, but large models need to be split between several chips.

There are 4 cores with a ring bus. Each core has three engines: Tensor, Pooling, and Memory. The host interface is PCIe 4.0 x16.

The chip has a peak performance of 825 TOPS INT8. That’s really huge number for INT8! NVIDIA’s A100 INT8 performance is 624 TOPS, and twice that number for sparse calculations.

The chip has configurable frequency and voltage and can work in different modes optimizing for performance or lower power consumption, from 25W to 280W.

Interestingly, its performance is independent of the batch size.

Instances with Hanguang 800 are available in Alibaba Public Cloud.

Alibaba Group also has a RISC-V Xuantie910 processor with Vector Engine for AI Acceleration. It is rather relevant for our CPU section because it’s not ASIC, but a general-purpose processor. Its performance is more than 300 GFLOPS FP16 (32 FLOPS/core/cycle x 2.5 GHz x 4 Cores), and half that number for FP32. For comparison, the lowest consumer RTX series card, the NVIDIA RTX 2060 has 10000 GFLOPS for pure FP16 performance and even 51600 GFLOPS for FP16 on tensor cores, so that’s a different niche.

References:

Baidu Kunlun

July 2018, Baidu announced their AI chip called Kunlun. It is built on Baidu’s own XPU neural processor architecture.

The 14nm chip is produced at the Samsung Foundry. Kunlun offers 16GB HBM memory with 512 GB/s bandwidth and supplies up to 260 TOPS INT8 (that’s twice the INT8 performance of NVIDIA TESLA T4) or 64 TFLOPS INT16/FP16 at 150W. This chip looks like an inference chip.

The announcement of 2018 mentioned inference Kunlun chip 818–100 and training Kunlun chip 818–300. Their characteristics are unclear, but they are both mentioned among the supported hardware of Paddle Lite (a framework for inference, which seems to be analogous to TensorFlow Lite).

A Baidu Kunlun K200 board contains a single processor.

Baidu Kunlun software stack supports Paddle Paddle, Tensorflow, Pytorch with graph compiler. It also supports new operators by user-written kernels with XPU C/C++ programming language. Deep learning library has APIs for common operators used in a deep learning network.

The processor can be accessed via Baidu Cloud.

September 2020, Baidu announced Kunlun 2. The new chip uses 7 nm process technology and its computational capability is over three times that of the previous generation. The mass production of the chip is expected to begin in early 2021.

References:

Groq

Groq develops its own Tensor Streaming Processor (TSP). Jonathan Ross, Croq’s CEO had co-founded the first Google’s TPU before that.

Instead of creating a small programmable core and replicating it dozens or hundreds of times, the TSP houses a single enormous processor that has hundreds of functional units.

The 14nm chip contains 26.8B transistors. It has 220MB SRAM with 80TB/s on-die memory bandwidth and no board memory (it lacks DRAM controllers and interfaces) and supports PCIe Gen4 x16 with 31.5 GB/s in each direction (to download instructions and data to the TSP).

The chip delivers up to 1000 TOPS INT8 and 250 TFLOPS FP16 (with FP32 acc). For comparison, NVIDIA A100 has 312 TFLOPS on dense FP16 calculations with FP32 acc, and 624 TOPS INT8. The Groq’s number is even larger than the 825 TOPS INT8 of Alibaba’s Hanguang 800. Yet in the Linley Group report, Groq’s ResNet-50 inference performance in inferences/second is significantly lower than Hanguang’s one (20,400 vs 78,563).

The description of the architecture is taken from the Linley Group report.

Groq allows instructions executing at different times in different units. They flow first into a set of function units called Superlane 0, which executes these instructions. In the next cycle, they execute in Superlane 1, while Superlane 0 executes the second group of instructions. This technique simplifies the design and routing, eliminates the need for synchronization, and is easily scalable; the TSP chip features 20 superlanes. Within each superlane, data flows horizontally. In fact, the TSP continually pushes it across the chip on every clock cycle.

TSP conceptual diagram. Instructions flow downward through identical function units, pipelining the operations. Data flows across the processor, allowing the program to perform different operations.

Memory is embedded with the function units, providing a high-bandwidth data source and eliminating the need for external memory. The result is somewhat like a systolic array of heterogeneous function units, but the data only moves horizontally while the instructions move vertically.

The entire TSP executes a single instruction stream, so we consider it one processor core. But it’s effectively a 144-wide VLIW architecture, issuing one gross instruction per cycle to control the superlane. The TSP superlane is actually two sets of mirrored function units divided into what Groq calls the east hemisphere and the west hemisphere. Each function unit contains multiple subunits that accept instructions. For example, the vector unit contains 16 ALUs that are individually controlled.

A superlane comprises 16 lanes. Each instruction is performed on all 16 lanes at once, and then in the next superlane in the subsequent cycle, and so forth. Thus, over 20 cycles, each instruction executes on all 320 lanes across the 20 superlanes, so it effectively becomes a 320-byte SIMD operation having a 20-cycle pipeline.

TSP superlane block diagram. Every superlane is bilaterally symmetric with an east side and a west side. It contains 16 lanes, each of which is 8 bits wide. Data flows from east to west and from west to east.

From a programmer’s perspective, data is organized into streams, which physically comprise one byte per lane (320 bytes). The architecture supports 32 eastward streams and 32 westward streams. Each stream automatically progresses in its designated direction on every cycle, moving 32 bytes per lane. An instruction typically operates on data from different streams. For example, ADD S1, S2, S3 adds each value in stream 1 to the corresponding value in stream 2 and stores the results in stream 3. Thus, instead of a fixed set of 32 registers, each function unit operates on a moving set of 32 values.

The lane structure is optimized for INT8 data, but larger operands (INT16, INT32, FP16, or FP32) can be formed by combining streams. The superlane applies a 9-bit error-correction code (ECC) across all 16 lanes. The SRAM is also ECC protected.

The central vector unit contains 16 ALUs per lane. Each ALU can perform a 32-bit calculation using aligned groups of four stream bytes as operands. In addition to the usual arithmetic and logical operations, these ALUs can convert between integer and floating-point formats, perform the common activation functions (ReLU, tanh), and other math functions (exponentiation, reciprocal square roots).

The matrix units handle heavy computations. Each one contains 320 MAC units per lane that can be grouped into 20 supercells. On each cycle, MAC multiplies the stored weight values by a pair of activation values from the streaming data. Each 16x16 supercell can compute an integer partial sum in one cycle and a complete 320-element fused dot-product in 20 cycles. Each hemisphere has 320x320 MAC units producing 409,600 INT8 operations or 102,400 FP16 operations per cycle. Using all 32 streams in each direction, the TSP can load all 409,600 weight registers in less than 40 cycles.

The switch units can reshape tensor data to better suit the compute units (rotate, transpose, etc). The switch units also perform an important function as the only units that can communicate between superlanes.

Because the architecture lacks register files, the compiler must ensure the streaming data is available to the function unit at the designated time to execute the designated instruction.

The compiler must schedule all data movement, manage the memory and function units, and even manually fetch instructions. The Groq compiler orchestrates everything: Data flows into the chip and is plugged in at the right time and the right place to make sure calculations occur immediately, with no stalls. This allows Groq’s chip performance to be deterministic. The compiler reconfigures the hardware dynamically to perform each calculation so there is no abstraction between the compiler and the chip.

The software tools accept models developed in TensorFlow, and the company is developing drivers for other popular frameworks. Groq provides software that runs on an x86 host processor to download the neural network program (instructions) and data to the TSP; the accelerator can then autonomously execute the model and return results to the host.

The single-core Croq chip achieves peak throughput while processing one image at a time. This is similar to Hanguang 800 performance with no performance penalty for small batch sizes, the Groq architecture is hyper-focused on low latency, single-thread performance at batch size 1.

Image recognition inference performance for ResNet-50 v2 benchmarking at small batch sizes (source)

The Groq TSP accelerator is available on the Nimbix Cloud. It’s early access you need to apply.

References:

Qualcomm Cloud AI 100

The Qualcomm Cloud AI 100 processor is designed for AI inference acceleration in the cloud.

It’s a 7 nm processor with up to 16 cores. It has 144MB on-die SRAM (9MB for each AI Core) and up to 32GB LPDDR4x at 2.1GHz. The processor supports INT8, INT16, FP16, FP32 and for PCIe cards, it delivers up to 400 TOPS at 75W TDP. That’s a good number for just 75W…

Cloud AI 100 chip has a PCIe Gen3/4 x8 interface. It also has M.2 variants with lower TDP and performance.

The technical details are pretty scarce now.

The commercial launch is 1H 2021.

References:

Others

There are other players as well.

There were mentions of the Dataflow Processing Unit (DPU) by Wave Computing (the new owner of the MIPS architecture BTW). [1][2][3] But it’s unclear whether the company solutions for a datacenter or a desktop are available.

There are Arm solutions, but they seem to be for the edge. And, BTW, Arm is now NVIDIA.

There is SambaNova came out of stealth-mode in December 2020 with their Reconfigurable Dataflow Architecture (RDA) delivering “100s of TFLOPS”.

There is a company called Mythic that focuses on Compute-in-Memory, Dataflow Architecture, and Analog Computing.

We will be updating the article, adding new players here, and expanding them into separate sections.

Other interesting links

Summary

We are currently in the kind-of Cambrian explosion situation. The amount of innovation is huge. Many interesting solutions appeared and are already in the market, many are announced, many others are probably in stealth mode or to-be-developed. Some have even died (say, Nervana).

The landscape is becoming more and more complex and it’s hard to compare different solutions. In many cases, it’s not meaningful to compare them based on the technical characteristics only (e.g. peak performance), because it gives only very rough estimates. The open benchmarks like MLPerf do provide good common ground here. But in the end, what you really want to know is the performance on your particular task. So, cloud solutions seem to be a good way to try new systems and evaluate your own tasks without heavy investment in advance.

But this is still not the whole picture.

We haven’t described yet FPGAs (coming soon), mobile/edge AI processors (a very interesting niche on its own), and completely new architectures (neuromorphic processors, memristors, photonic devices). Quantum computers are somewhere here as well. So, stay tuned.

Release Notes

2021/01/12: Published

2022/12/26: Added TPUv4 details

--

--

Grigory Sapunov

ML/DL/AI expert. Software engineer with 20+ years programming experience. Loves Life Sciences. CTO and co-Founder of Intento. Google Developer Expert in ML.