Hardware for Deep Learning. Part 2: CPU
This is a part on CPUs in a series “Hardware for Deep Learning”.
The contents of the series is here.
Table of Contents
- x86 Family
- — Processors, cores, threads, instructions
- — Libraries, etc
- — PCI Express
- — Memory
- — Xeon Phi
- ARM
- POWER9
- RISC-V
- Others
- Conclusions
- Release Notes
x86 Family
Processors, cores, threads, instructions
CPUs are Central Processing Units, the ordinary processors we got used to. They are typically multi-core even on the desktop market (usually from 2 to 10 cores in modern Core i3-i9 Intel CPUs, but up to 18 cores/36 threads in high-end Intel CPUs like i9–7980XE, i9–9980XE or i9–10980XE, and up to 32 cores/64 threads in AMD Ryzen Threadripper).
On the server market there are Intel Xeon/AMD EPYC, usually having more cores (56 cores/112 threads in Intel Xeon Platinum 9282 or 64 cores/128 threads in AMD EPYC 7742) and some other useful capabilities (supporting more RAM, multi-processor configurations, ECC, etc).
CPUs are NOT the current workhorses for DL. GPUs have much more specialized cores (up to 5120 in the latest NVIDIA Volta V100 GPUs), and matrix operations (which is DL is mostly about at the low level) are parallelized much better on GPUs.
There are some attempts to use clusters of CPUs for DL (like BigDL from Intel), optimizing DL libraries for CPUs (Caffe con Troll) and do other hacks (see “Improving the speed of neural networks on CPUs”), but they do not look very promising right now, and can only be useful if you already have a cluster of machines without GPUs.
Among interesting results there is a November 2017 paper called “ImageNet Training in Minutes” finished the 100-epoch ImageNet training with AlexNet in 11 minutes on 1024 CPUs. There is an Intel’s article “Intel Processors for Deep Learning Training” exploring the main factors contributing to record-setting speed including 1) The compute and memory capacity of Intel Xeon Scalable processors; 2) Software optimizations in the Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) and in the popular deep learning frameworks; 3) Recent advancements in distributed training algorithms for supervised deep learning workloads.
In July 2017, Intel launched the Intel Xeon Scalable processor family built on 14nm process technology. The Intel Xeon Scalable processors can support up to 28 physical cores (56 threads) per socket (up to 8 sockets) at 2.50 GHz processor base frequency and 3.80 GHz max turbo frequency, and six memory channels with up to 1.5 TB of 2,666 MHz DDR4 memory. Recommended Customer Price for Xeon Platinum 8180 Processor (28-core) is near $10009.
Intel introduced AVX-512 instructions, available on the latest Xeon Phi (see below) and Skylake-X CPUs, including the Core-X series (excluding the Core i5–7640X and Core i7–7740X), as well as the new Xeon Scalable Processor Family and Xeon D-2100 Embedded Series. See the list of CPUs with AVX-512 here.
Intel AVX-512 is a set of new instructions that can accelerate performance for heavy computational workloads including deep learning. Here is a January 2018 whitepaper with details on how the 512-bit wide Fused Multiply Add (FMA) core instructions, part of the AVX-512, accelerate deep learning by enabling lower-precision operations (more on lower precision computations in the next part of the series dedicated to GPUs).
Sep 26, 2019 update: Among interesting instructions in the AVX-512 are the new set of Intel AVX-512 instructions called Vector Neural Network Instructions (AVX512 VNNI) designed to accelerate convolutional neural network-based algorithms. There are four instructions for integer multiply-and-add present in the Intel Xeon Scalable CPUs since 2019.
This VNNI feature along with the Brain floating-point format are parts of what Intel is calling DL Boost (deep learning boost), a set of technologies designed for inference acceleration.
DL Boost is not limited by the Xeon family. This technology will be present in the year 2019+ 10th Gen Ice Lake processors, including Core-i7, i5, and even i3.
Here is a recent comparison of an Intel CPU with DL Boost and NVIDIA Turing GPU Titan RTX performed on different AI tasks.
More technical details on VNNI instructions.
On June 30, 2021, Intel published AVX512-FP16 Architecture Specification.
AMD supports AVX-256, but does not support larger vectors. More details on AMD vector instructions here and here.
Libraries, etc
Intel also provides a Deep Learning Inference Engine, a part of Deep Learning Deployment Toolkit. The current version of the Inference Engine supports inference of multiple image classification networks, including AlexNet, GoogLeNet, VGG and ResNet families of networks, fully convolutional networks like FCN8 used for image segmentation, and object detection networks like Faster R-CNN. It seems no RNN supported. The Engine supports Caffe, TensorFlow, MXNet.
The current version of the Inference Engine supports inference on Xeon with AVX2 and AVX512, Core Processors with AVX2, Atom Processors with SSE, Intel HD Graphics, Arria A10 FPGA discrete cards. The Inference Engine can inference models in the FP16 and FP32 format (but support depends on configuration).
Intel also provides recipes on system-level optimizations (targeting Xeon and Xeon Phi processors) allowing without a single line of code change in the framework, to boost the performance for deep learning training by up to 2X and inference by up to 2.7X on top of the current software optimizations available from TensorFlow and Caffe. More popular on it.
And here is an Intel presentation from March 2017, called “Intel and ML”.
Sep 26, 2019 update: Intel DL Boost technology can be used with modern DL frameworks for converting a pre-trained FP32 model to a quantized INT8 model. Here you can find links related to TensorFlow, PyTorch, MXNet and other frameworks.
Jan 19, 2020 update: as of the end of 2019 there is a set of libraries for DL on CPU:
- BigDL: distributed deep learning library for Apache Spark
- DNNL 1.2.0, Deep Neural Network Library. The library includes basic building blocks for neural networks optimized for Intel Architecture Processors and Intel Processor Graphics.
- OpenVINO Toolkit for computer vision based on CNNs.
- PlaidML, advanced and portable tensor compiler for enabling deep learning on laptops, embedded devices, or other devices.
Supports Keras, ONNX, and nGraph. - Intel Caffe (optimized for Xeon).
- Caffe Con Troll (research project, with the latest commit in 2016) seems to be dead.
- Intel DL Boost can be used in many popular frameworks:
TensorFlow, PyTorch, MXNet, PaddlePaddle, Intel Caffe. - nGraph, an end to end deep learning graph compiler for inference and training with extensive framework and hardware support.
Looking wider, graph compilers became the hot topic now, bot in TensorFlow and PyTorch ecosystems. Watch especially for MLIR project from Chris Lattner (the author of LLVM and Swift, now in TensorFlow team).
Keeping in mind that current DL systems are mostly built using GPUs (see the next section), CPU is still an indispensable part of them, and the requirements to CPU are shifted towards better fit to such systems.
Number of cores is no more the main parameter, but you still need enough #Cores, which is at least as much as #GPUs you have. Among other important characteristics are #PCIe lanes, and memory speed.
PCI Express
You need to transfer data between CPU host memory and GPU memory. In most x86 systems this is done using PCI Express bus (PCIe).
There are different revisions of PCIe, v3 is the most common now.
PCIe v.3 allows for 985 MB/s per 1 lane, so 15.75 GB/s for x16 links.
PCIe v.4 is twice faster, so 31.51 GB/s for x16 (supported in X570 chipset for AMD and Radeon cards)
PCIe v.5 is twice faster again (spec is released, no products expected before 2020), so 63 GB/s for x16.
What about the number of PCI Express lanes, the typical GPU card works in x16 mode, but may work in x8 or x4 mode as well at lower speed.
It means that to support several GPU cards at the full speed your processor have to support a proportional number of PCIe lanes. For example, if you are using 2 NVIDIA GTX 1080 Ti cards, both in x16 mode, then your system have to support at least 32 PCIe lanes, better more (because you have the chipset and periphery, like SSD M.2, using PCIe as well).
The caveat here is that many Intel processors support less than that number of PCIe lanes (28 for Core i7–7820X or even 16 as in the case of Core i7-7700K or i7–8700K processor). In such a case some of your GPUs work in x8 mode, so decreasing the speed of exchange between system and GPU memory. So, be careful.
A good option here could be an old one Core i7–6850K processor with 40 PCIe lanes or a new one AMD Ryzen Threadripper with even 64 PCIe lanes. Some people prefer Xeons or looking at EPYCs (128 lanes), and they will probably be the main option if you’ll build the system with, say, 4+ GPUs (and a typical ATX motherboard doesn’t help here as well, but there exist special solutions).
As a short summary:
- Typical Intel mainstream desktop processor has 16 PCIe lanes (e.g. i7–7700K, i7–8700K or even i9–9900K processor).
- High-end desktop (HEDT) processor has 28 to 44 lanes (e.g. i7–7820X has 28, rather old i7–6850K has 40, i9–9980XE has 44, upcoming i9–10940X and higher will have 48).
- Xeons have up to 64 lanes (PCIe v.3).
- AMD Ryzen Threadripper has 64 PCIe lanes, EPYC has 128 lanes (PCIe v.4)
- Be careful: Intel sometimes uses “Platform PCIe lanes”, it is CPU+PCH, but the PCH ones have a shared uplink!
- Always check specs at https://ark.intel.com/
More about PCIe lanes.
One number I found regarding x8/x16 modes is that in the case of x16→x8 switching, “performance decrease is much lower than 10% for most tasks/networks.” [source]
If you find any other comprehensive benchmark regarding x8/x16 modes, let me know.
Memory
Another important dimension is a memory configuration. It includes both the memory speed (i.e. DDR4–3200 could be twice as fast as DDR4–1600), and the multi-channel mode support (for example the well known i7–7700K supports only 2 memory channels maximum, while i7–6850K or AMD Ryzen Threadripper support 4 channels, so the latter could be twice as fast working with the same memory, see the list of processors supporting quad-channel mode). Six- and more channel modes are coming.
It would be interesting to benchmark the memory configuration as well, so let me know if you made or find a comprehensive benchmark on the real life ML/DL tasks.
As a reference DDR4 single-channel data transfer rates (multiply by 2 or 4 for two- or quad-channel modes) with PCIe v.3/v.4 for comparison:
- PCIe v.3 x4: 3.94 GB/s
- PCIe v.3 x8: 7.88 GB/s
- DDR4 1600:12.8 GB/s
- DDR4 1866:14.93 GB/s
- PCIe v.3 x16: 15.75 GB/s
- PCIe v.4 x8: 15.75 GB/s
- DDR4 2133:17 GB/s
- DDR4 2400:19.2 GB/s
- DDR4 2666:21.3 GB/s
- DDR4 3200:25.6 GB/s
- PCIe v.4 x16: 31.51 GB/s
Here is a more comprehensive list for many different transports and buses.
There are many manuals on how to assemble your own DL machine, find one of them if you want more details on particular hardware choices.
Intel’s some recent plans on upcoming CPU platforms are leaked. They are interesting, but do not look like a game-changer regarding ML/DL. Nevertheless Intel plays in almost all other interesting niches, and we’ll return to it several times. Starting right now :)
Xeon Phi
UPD: Seems to be dead now. So the text below is for history:
There are still attempts to make a heavily multi-core processors like Intel Xeon Phi with up to 72 cores. The one of the most powerful existing Phi processors, the 7290F is a 72-core (288-thread, 4 threads per core!) processor with peak performance of 3456 GFLOPS DP=FP64 (so probably 2x3456 SP=FP32 GFLOPS) (and $3368.00 recommended price, with 260W TDP) which is roughly comparable to NVIDIA GTX 1060 (1070 Ti if the FP32–64 calculations were correct) by peak performance, but not the price and power ($350–500 depending on memory size 3 or 6 Gb, 120W TDP for 1060; $800/180W for 1070 Ti). Nonetheless to meaningfully compare these different processors we have to do a benchmark on real-life ML/DL tasks.
Knights Hill was cancelled, and Intel targets at exascale computing. “One step we’re taking is to replace one of the future Intel Xeon Phi processors (code name Knights Hill) with a new platform and new micro-architecture specifically designed for exascale,” Intel’s Trish Damkroger, a data center group veep, said.
Intel announced a Knights Mill series specialized in deep learning and there is a 72-core 7295 processor. It is using LGA3647 socket, and there are no more PCI Express versions. The performance and prices are still unknown.
It seems that Xeon Phi line will be succeeded by a family of chips codenamed Knights Cove. These will have 38 or 44 cores each, 32GB of integrated HBM2 memory, and will be based on Ice Lake Scalable Xeons due to arrive in 2019 or 2020. The 44-core part may well be two 22-core chips combined.
Knights Cove will be succeeded by chips codenamed Ice Age and Knights Run, which are understood to be the processors that will go into the one-exaflops Aurora machine.
But it’s too far from now.
In short, Xeon Phi doesn’t seem to be a useful option for DL.
ARM
ARM servers are coming.
UPD: This particular project is stopped
Qualcomm manufactured and is shipping ARM server processor for cloud applications called Centriq 2400. It has up to 48 single-thread cores running at 2.2GHz, peaking to 2.6GHz. It is 64-bit-only: there is no 32-bit mode. Here are some more technical details.
“Previous ARM-compatible server CPUs have failed, notably the Calxeda parts, because, basically, they were 32-bit. Qualcomm’s Centriq is, crucially, 64-bit as well as ARMv8-A compatible, multicore, draws up to just 120W, has suitably fat caches, and server-friendly IO and memory interfaces, and is aimed at data-center workloads” said TheRegister.
There is a Qualcomm Centriq 2400 Motherboard server specification submitted to Open Compute Project.
Qualcomm is already talking about what’s next. Qualcomm Firetail will be the next-gen Centriq, and it will be built around the Saphira core.
Another startup called Ampere Computing just emerged from stealth mode in February 2018, announcing it will be delivering a 64-bit ARM processor for the hyperscale market.
There are many other efforts as well. We’ll probably see many more ARM servers in the public clouds.
Again, it’s not about DL for now, but who knows. Qualcomm is adding ML/DL capabilities to their own mobile chips (more on it later in a separate part of the series), maybe once they will add similar capabilities to the server solutions.
(Nov 22, 2019 update) Qualcomm got rid of its server chip division that was in charge of building Centriq 2400.
Qualcomm’s Centriq IP will continue to live under the Thang Long 4800 brand name used by HXT, a joint venture set up between the Chinese Guizhou Province and Qualcomm, with 55 percent of the joint venture owned by the Guizhou Province.
The Thang Long 4800 seems to be almost identical to the Centriq 2400, with one exception. The chip uses a different encryption module that aims to comply with China’s encryption regulations. (source)
(Jan 19, 2020 update) Yet, there is a lot of other movements in the ARM universe.
- Single-board computers: Raspberry Pi, part of Jetson Nano, and Google Coral Dev Board.
- Mobile: Qualcomm, Apple A11, etc
- Server: Marvell ThunderX, Ampere eMAG, Amazon A1 instance, etc; NVIDIA announced GPU-accelerated Arm-based servers.
- Laptops: Microsoft Surface Pro X
- ARM also has ML/AI Ethos NPU and Mali GPU
Some links for example:
- ARM announces Neoverse N1 platform (scales up to 128 cores)
- Ampere eMAG ARM server microprocessors (up to 32 cores, up to 3.3 GHz)
- Marvell ThunderX ARM Processors (up to 48 cores, up to 2.5 GHz).
They support NVIDIA GPU. - Amazon Graviton ARM processor (16 cores, 2.3GHz)
- Huawei Kunpeng 920 ARM Server CPU (64 cores, 2.6 GHz)
POWER9
There is POWER9 architecture from IBM. It seems POWER9 is targeted to be a better fit with GPU/FPGA/whatever, and that sounds reasonable.
“With POWER9, we’re moving to a new off-chip era, with advanced accelerators like GPUs and FPGAs driving modern workloads, including AI. We see that the era of the on-chip microprocessor — with processing integrated on one chip — is dying as well as Moore’s Law lapsing. POWER9 gives us an opportunity to try new architectural designs to push computing beyond today’s limits by maximizing data bandwidth across the system stack.
In a way, POWER9 is a modern tribute to classic computer designs, where a bunch of chips interacted to handle CPU, graphics and floating-point operations. The bedrock of POWER9 is an internal “information superhighway” that decouples processing and empowers advanced accelerators to digest and analyze massive data sets.” [source]
POWER9 has 24 cores and it offers the industry’s only CPU-to-GPU Nvidia NVLink connection. That’s interesting because POWER9 systems with, say, Tesla V100 could be more productive that x86 systems with V100.
POWER9 is mostly used in supercomputers right now, but it may appear in the cloud.
There are some POWER8 servers in the IBM cloud (the previous generation of POWER architecture). There is some movement targeting IBM current customers. Google and Rackspace working together on Power9 server blueprints for the Open Compute Project.
The current fastest supercomputer in the world, Summit, is based on POWER9, while also using Nvidia’s Volta GPUs as accelerators.
POWER10 is expected in 2020–2021:
- 48 cores
- PCIe v.5
- NVLink 3.0
- OpenCAPI 4.0
- …
Keep an eye on it.
RISC-V
There are some other interesting movements, one of them is RISC-V.
The project began in 2010 at the University of California, Berkeley. David Patterson participated. David Patterson was a Professor of Computer Science at the University of California, Berkeley since 1976, he is a pioneer of RISC processor design who coined the term RISC, author of famous books on computer architecture, now a distinguished engineer at Google participated in development of TPU (we will discuss the TPU in the part about ASICs).
First of all RISC-V is a completely open source instruction set. Remembering recent Meltdown and Spectre attacks, that could be important.
Next, it can be tailored specifically to particular application workload. For example, Western Digital announced plans to use the RISC-V ISA across its existing product stack as well as for future products that will combine processing and storage.
WDC says: “RISC-V will allow the entire industry to realize the benefits of next-generation architectures while also enabling us to create more purpose-built devices, platforms and storage systems for Big Data and Fast Data applications. We are moving beyond just storing data to now creating entire environments that will enable users to realize the value and possibilities of their data.”
“The first products from Western Digital with RISC-V cores will ship in late 2019, or early 2020, says Western Digital without going into details.” writes AnandTech.
“Western Digital also disclosed that it had made a strategic investment in Esperanto Technologies, a developer of RISC-V-based SoCs. In the meantime, Esperanto’s ongoing projects demonstrate the potential of the RISC-V ISA in general. So far, Esperanto has developed the ET-Maxion core with maximized single-thread performance as well as the ET-Minion energy-efficient core with a vector FPU. These cores will be used for an upcoming 7 nm SoC for AI and machine learning workloads. In adddition, these are the cores that Esperanto will license to other companies.”.
“The company’s first RISC-V AI system on chips (SoCs) will leverage 16 ET-Maxion 64-bit RISC-V cores, 4,096 energy-efficient ET-Minion RISC-V cores (each with a vector floating point unit), and be designed on 7 nm CMOS process technology.” [source] [source]
No doubts the chip is targeting machine learning field: “It’s worth noting that for this processor, Esperanto armed the Et-Minion, which already incorporates the new vector extension, with two additional domain-specific extensions for Machine Learning. One such extension is a tensor extension to augment the vector instructions. Ditzel claims those extensions greatly improve the energy efficiency in machine learning workloads. All the cores share the same address space and likely make use of TileLink. TileLink, which was also presented at this year’s RISC-V workshop, is a free and open source scalable cache-coherent fabric for RISC-V.” [source]
The proposed chip blurs the lines between CPUs, GPUs, and DSPs, as does the other (non RISC-V) chip from PEZY called SC2 (2,048 cores, 180 W, 8.192 TFLOPS FP32, x2 for FP16).
There is another interesting case: Microsemi has announced that its FPGA (one of the next parts will be dedicated to FPGA) devices can be configured with a processor core in the Open RISC-V Architecture.
Another good article about RISC-V: “11 Myths About the RISC-V ISA”.
Some more RISC-V related news:
- SiFive U5, U7 and U8 cores
- Alibaba’s RISC-V processor design — the Xuantie 910 (XT 910)
12nm 64-bit 16 cores clocked at up to 2.5GHz, the fastest RISC-V
processor to date - Western Digital SweRV Core designed for embedded devices
supporting data-intensive edge applications. - Esperanto Technologies is building AI chip with 1k+ cores
It’s too early to say something practical regarding DL, but keep an eye on this technology too.
Others
There will be a special part dedicated to mobile processors (you’ll see there Snapdragons, A11 bionic, Samsung, Xiaomi and others). Technically many of them are CPUs as well, but mobile AI worth a separate talk.
There is an interesting KiloCore project by a team at the University of California, Davis, appeared in June, 2016. It’s a microchip which has 1000 independent programmable processors, with a maximum computation rate of 1.78 trillion instructions per second (if they were floating point instructions, then it will be equal to 1.78 TFLOPS). The KiloCore chip was fabricated by IBM using their 32 nm CMOS technology.
Each processor core can run its own small program independently of the others, which is a fundamentally more flexible approach than so-called Single-Instruction-Multiple-Data approaches utilized by processors such as GPUs; the idea is to break an application up into many small pieces, each of which can run in parallel on different processors. Sounds like a natural fit to Erlang/Elixir ecosystem.
Another cool thing with the KiloCore is its energy-efficiency. “The chip is the most energy-efficient “many-core” processor ever reported, Baas said. For example, the 1,000 processors can execute 115 billion instructions per second while dissipating only 0.7 Watts, low enough to be powered by a single AA battery.”
Good start, waiting for the next steps.
Startup Cerebras Systems has unveiled the world’s largest microprocessor, a waferscale chip (8.46x8.46 inches) custom-built for machine learning. The 1.2 trillion transistor silicon wafer incorporates 400,000 cores, 18 GB of SRAM memory, and a network fabric with over 100 Pb/sec of aggregate bandwidth. And yes, those numbers are correct. [source]
Maybe it’s more relevant to the ASIC section actually…
Let me know if I forgot something important.
Conclusions
For the current state of the art neural networks the talk is mostly about matrix multiplications, optimized convolutions, and so on. So the more rather simple computing elements (able to do different kinds of multiplications and additions) you have, the better.
But the situation can change in the future. We all know that a biological neuron is much more complex than an artificial neuron. The real neuron is a kind of computer itself (some even think it’s kind of a quantum computer). In this sense, a network of more powerful (than just an addition and multiplication) processing elements could give rise to different architectures better suitable to different hardware. Maybe massive multi-core distributed environment (which is more about CPUs) will play its role.
Even now RNNs/LSTMs are not extremely efficient on the current hardware. It’s because of their inherent sequential nature. So the trend to re-implement all successful architectures using convolutions and other simple methods.
Other more complex neural networks (MANNs — memory-augmented neural networks and their most known implementations like Neural Turing Machines, NTM and Differentiable Neural Computer, DNC; Capsule Networks; and so on) can be even less efficient.
I suppose, these more complex types of NNs are in the same position as were CNNs 10–20 years ago. The algorithms were almost ready, but the hardware lag behind.
The first major achievement to fix this situation happened in 2009, when Rajat Raina, Anand Madhavan, Andrew Y. Ng suggested in their paper “Large-scale Deep Unsupervised Learning using Graphics Processors” to use GPUs (then a single GTX 280 graphics card with 240 total cores, 1 Gb mem, 933 GFLOPS FP32, that was the best card in 2008) for DL:
“We argue that modern graphics processors far surpass the computational capabilities of multicore CPUs, and have the potential to revolutionize the applicability of deep unsupervised learning methods. We develop general principles for massively parallelizing unsupervised learning tasks using graphics processors. We show that these principles can be applied to successfully scaling up learning algorithms for both DBNs and sparse coding. Our implementation of DBN learning is up to 70 times faster than a dual-core CPU implementation for large models.”
Then, in 2012 GPUs came to the ImageNet competition, and Alex Krizhevsky won with his AlexNet trained on two GTX 580 in parallel (see “ImageNet Classification with Deep Convolutional Neural Networks” by Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton).
This started the GPU advent:
The next part of the series will tell you about GPUs.
Release Notes
2018/02/27: fix in Xeon Phi performance, original comparison with NVIDIA was incorrect because double precision FLOPS were compared to single precision FLOPS.
2018/02/27: added info about KiloCore 1024-core chip.
2018/02/28: added info about performance decrease when using GPUs in x8 mode instead of x16, more info on PCIe lanes added.
2018/02/29: added info about
- PEZY-SC2 2048-core chip
- Esperanto Technologies RISC-V 4096-core chip
It seems the next addition should be about 8192-core chip :)
2018/03/01: added info about Ampere Computing ARM processors
2018/03/04: added info about “ImageNet Training in Minutes” paper, Intel Xeon Scalable processors, Intel AVX-512 instructions and Intel Deep Learning Inference Engine
2019/09/26: added info about Intel DL Boost and VNNI
2019/11/22: added info that Qualcomm stopped working on its Centriq 2400 processor.
2020/01/19: many updates: new Intel processors, killed Xeon Phi, libraries for CPU, restructured x86 section, added many links to ARM section, added many links to RISC-V section, some updates to POWER section, added Cerebras