With the passage of time, Fallback only gets worse. ML models are rapidly evolving, so the reference models of 2022 or 2023 will be replaced with newer, more accurate and more complex ML models in 2025 or 2026, just as today’s silicon enters volume production. These newer ML models will necessitate even more Fallback onto the programmable CPU or DSP. These new ML models likely will have new operator variants or new network topologies that aren’t hardwired into the accelerator.
Machine learning (ML) inference is being built into almost every new system on chip (SoC) design for automotive applications, wireless systems, smart phones, tablets, security cameras, and more. Software for ML applications keeps evolving and becoming more capable – and this software uses more ML processing power every year. Silicon design teams are scrambling to find ML processing power to add to their designs, which already include CPUs DSPs, and GPUs.
ML applications have their own requirements, with workloads that are very different from other tasks. ML inference workloads are dominated by matrix computations (convolutions) on N-dimensional tensor data. These convolutions don’t fit into old compute architectures. Multiple threads of control code run efficiently on CPUs. Vector mathematics in 1-D and 2-D arrays of data are perfect for DSPs. And GPUs were designed to draw polygons in graphics applications.
So how have designers been solving this ML matrix compute problem? Typically, they’ve been force-fitting this new workload into the old platforms by adding offload engines (accelerators) that efficiently execute the most frequently occurring major computation operators. These offload accelerators are called NPUs, or neural processing units. By offloading the 10 or 20 most common ML graph operators from the CPU or DSP, they can let the CPU or DSP orchestrate the rest of the graph execution.
This division of labor between the NPU accelerator and the CPU or DSP is often called “Operator Fallback”. The majority – maybe 95% - of the computational workload runs on the non-programmable ML accelerator, but the program “falls back” to the fully programmable CPU or DSP when needed.
The fatal flaw with this approach is the assumption that Operator Fallback is rare and not performance critical. However, a closer look at the approach reveals that it is not rare and often is performance critical. Here’s an example of a SoC with an applications-class CPU, a vector DSP engine optimized for vision processing, and a 4 TOP/s ML accelerator. The compute resource available in each engine is displayed in the table below:
Engine Type | Number of Available MACs | Performance on Matrix-oriented ML Operators |
ML Offload Accelerator – 4 TOPs | 2048 | Excellent |
Wide Vector DSP – 512-bit | 64 | Poor |
Application Class CPU with 128-bit vector extensions | 16 | Abysmal |
Matrix operations running on the ML offload accelerator are fast because they can take advantage of all 2048 multiply-accumulate units in the accelerator. However, similar operators running on the DSP are 32x slower. And running similar operators on the CPU is 128x slower. If 98% of the computation speeds through the NPU accelerator but the complex SoftMax final layer of the graph executes 100X or 1000X slower on the CPU, the slow CPU performance dominates the entire inference time.
With the passage of time, Fallback only gets worse. ML models are rapidly evolving, so the reference models of 2022 or 2023 will be replaced with newer, more accurate and more complex ML models in 2025 or 2026, just as today’s silicon enters volume production. These newer ML models will necessitate even more Fallback onto the programmable CPU or DSP. These new ML models likely will have new operator variants or new network topologies that aren’t hardwired into the accelerator.
Total performance on the overall multicore, heterogeneous accelerator-based architecture will degrade even more, so the chip design will severely underperform and possibly be inappropriate for the task. Fallback will become their undoing.
What is the Alternative?
Obviously, relying on an inflexible NPU accelerator architecture is the problem. It might be good for one year, for in a few years it will be obsolete, just as the silicon reaches the market.
What’s needed is an accelerator architecture that’s just as programmable as the CPU or DSP. And it must be able to be programmed in the field, using a language like C++, so engineers can easily add new operations as ML tasks evolve.
The requirement to be able to quickly add new ML operators cannot be understated, as ML workloads evolve.
This transforms the old NPU into a general purpose NPU, or GPNPU.
That’s exactly what Quadric’s engineers have done – created the first GPNPU, named the Chimeraä GPNPU. Available in 1 TOPS, 4 TOPS, and 16 TOPS variants, the Chimera GPNPU delivers the performance expected of an ML-optimized compute engine while also being fully programmable in C++ by the software developers. New ML operators can be quickly written and will run just as fast as the “native” operators written by Quadric engineers.
There’s no need for Fallback with a Chimera programmable GPNPU. There’s only fast execution, even with new forms of operators or graphs. Instead of Fallback, you get Fast, Future-Proof, and Fantastic!
© Copyright 2024 Quadric All Rights Reserved Privacy Policy