Change is constant in the AI world. Several new and complicated transformer models have emerged in the past 18 to 24 months as new “must have” networks in advanced automotive use cases.  The problem is that these novel architectures often introduce new network operators or novel ways of combining tensors – often from different types of sensors – in ways to enhance detection and recognition of objects in L3 / L4 / L5 ADAS and autonomous driving software stacks. 

Here are just a few examples: Deformable Attention Transformer (DAT) and DAT++; SegFormer; Detection Transformer (DETR); and Bird’s Eye View Depth Estimation – BEVdepth.  These new functions and operations in these advanced networks can pose extreme porting challenges for SoC programmers trying to run the models on existing NPU accelerators!

The Challenge of Fixed-Function Accelerators

If the car platform or SoC team is lucky, a new graph operator in a novel AI network can be supported by the existing hardwired, less-than-fully programmable NPU. This often is NOT the case, and then the operation needs to fallback to run on the slow companion CPU or DSP – which usually means the performance is so slow as to be unviable.   

But what if the breakthrough function is so new, so novel, that it doesn’t exist in the common graph manipulation tools?   What if the new function is not a graph operator at all?

A Look at a Custom CUDA Function - Voxel Pooling

One such totally new function is the Voxel Pooling operation found in the BEVdepth network. (Available on Megvii’s public Github repo.)   A trusty AI search engine describes Voxel Pooling as: “a technique where 3D point cloud data from multiple camera views is aggregated into a bird's-eye view (BEV) feature map by dividing the space into voxels (3D grid cells) and combining the features of points falling within each voxel, essentially creating a unified representation of the scene in a 2D grid format, which is then used for further processing like object detection.”  Sounds complicated. In fact, in the specific case of BEVdepth, the complex voxel pooling function is written as a custom CUDA function, because it has no equivalent built-in graph operator in PyTorch today, and certainly not in the commonly used interchange formats used by NPU vendor toolchains: ONNX and TFlite.

So how does the algorithm porting team port Voxel Pooling onto an embedded NPU if the function is not represented in the graph interchange format supported by the NPU vendor’s graph compiler toolchain? With other NPUs, the team is stuck – there’s simply no way to get that important new function ported quickly to the NPU, and very likely it can never be ported.  BEVdepth might simply not be able to run on the platform at all.

Only One Embedded AI Processor that Can Run This

There is only one embedded AI processor on the market that does support novel, custom operator functions in a C++ language.  Quadric’s Chimera GPNPU – general purpose NPU – is programmed in C++.   The advanced Chimera Graph Compiler (CGC) ingests networks in the ONNX format – from PyTorch, Tensorflow, or any other training framework.  CGC then converts the entire network into optimized C++ for later compilation by the LLVM compiler. 

But for functions like Voxel Pooling that are written in pure CUDA no direct compilation is possible – not with our CGC graph compiler – nor any other NPU graph compiler.  The solution for Voxel Pooling is to capture the function in a custom C++ representation using our CCL dialect of C++, which is exactly how BEVdepth Voxel Pooling was written in CUDA for the Nvidia platform. 

A skilled programmer who can already write CUDA code should have no trouble writing Voxel Pooling or other similar functions.   The entire Voxel Pooling kernel is some sixty lines of CCL code (including comments!), shown here:

Quickly Stitch It Back Together

The C++ shown above is quickly stitched together with the auto-generated C++ output by the CGC graph compiler, and the full network is now ported completely on the Chimera GPNPU – none of the graph needs to run on any other IP block – no CPU load, no companion DSP needed.

The Chimera GPNPU is Faster Than an Nvidia GPU

You are probably already thinking: “OK, fine, you can run Voxel Pooling. But at what speed? If its not fast, it’s not useful!”   Fear not!  The implementation of Voxel Pooling on the Chimera GPNPU is actually faster than the original CUDA code running on a 450 Watt Nvidia RTX3090 board (with GT102 GPU chip).  The Chimera GPNPU core – burning just a couple of Watts of power dissipation (your mileage will vary depending on the process node and clock frequency, and choices of off-chip memory interface) outperforms the 450W full chip GPU by a factor of 2X.

You might also wonder, “How can that be true?”  Here’s why.  GPUs – such as the RTX3090 – are designed for ease of use in model training and drawing polygons, mostly aimed at datacenters where electricity use is not priority #1.  GPUs use cache-based architectures (caches burn power!) and hardware-heavy Warp & Thread programming models.  The Chimera GPNPU processor is, by contrast, designed for embedded use. Quantized models, ahead-of-time offline compilation, and DMA-based memory hierarchy all save gobs and gobs of power.

Want to know more about the implementation of Voxel Pooling on Chimera cores?  We have a detailed Tutorial – including source, performance profiling and more data - for registered users of the Quadric Dev Studio.

Chimera is a Universal, High-Performance GPNPU Solution The Chimera GPNPU runs all AI/ML graph structures.  And runs non-graph code – so you can run things currently captured in CUDA, C++ (for DSP functions) and even some Python!  The revolutionary Chimera GPNPU processor integrates fully programmable 32bit ALUs with systolic-array style matrix engines in a fine-grained architecture.  Up to 1024 ALUs in a single core, with only one instruction fetch and one AXI data port.  That’s over 32,000 bits of parallel, fully programmable performance. Scalable up to 864 TOPs for bleeding-edge ADAS applications, Chimera GPNPUs have matched and balanced compute throughput for both MAC and ALU operations so no matter what type of network you choose to run  they all run fast, low-power and highly parallel. 

What’s a VLM? Vision Language Models are a rapidly emerging class of multimodal AI models that’s becoming much more important in the automotive world for the ADAS sector. VLMs are built from a combination of a large language model (LLM) and a vision encoder, giving the LLM the ability to “see.” This new technology is really shaking up the industry.

The introduction of DeepSeek in January 2025 captured headlines in every newspaper and online inviting comparisons to the Sputnik moment of 1957. But rapid change is also happening in many areas that are hidden from the general public – and VLMs are a great example of this rapid change.

Why are VLMs so Important?

The big thing with VLMs is their ability to describe in text form – hence understandable by the human driver - what a camera or radar sensor “sees” in a driver assistance system in a car.  A properly calibrated system deploying VLMs and multiple cameras could therefore be designed to generate verbal warnings to the driver, such as “A pedestrian is about to enter the crosswalk from the left curb 250 meters down the road.”  Or such scene descriptions from multiple sensors over several seconds of data analysis could be fed into other AI models which make decisions to automatically control an autonomous vehicle.

How New are VLMs?

VLMs are very, very new.   A Github repo that surveys and tracks automotive VLM evolution (https://github.com/ge25nab/Awesome-VLM-AD-ITS) lists over 50 technical papers from arxiv.org describing VLMs in the autonomous driving space, with 95% of the listing coming from 2023 and 2024.   In our business at Quadric – where we have significant customer traction in the AD / ADAS segment – vision language models were rarely mentioned 18 months ago by customers doing IP benchmark comparisons.  By 2024 LLMs in cars become a “thing” and designers of automotive silicon began asking for LLM performance benchmarks.  Now, barely 12 months later, the VLM is starting to emerge as a possible benchmark litmus test for AI acceleration engines for auto SoCs.

What’s the Challenge of Designing for VLMs on Top of All the Other Models?

Imagine the head spinning changes faced by the designers of hardware “accelerators” over the past four years.  In 2020-2022, the state-of-the-art benchmarks that everyone tried to implement were CNNs (convolutional neural networks).  By 2023 the industry had pivoted to Transformers – such as SWIN transformer (shifted window transformer) as the Must Have solution.  Then last year it was newer transformers – such as BEVformer (birds eye view transformer) or BEVdepth – plus LLMs such as Llama2 and Llama3.  And today, pile on VLMs in addition to needing to run all the CNNs and vision Transformers and LLMs.  So many networks, so many machine learning operators in the graphs!  And, in some cases such as the BEV networks, functions so new that the frameworks and standards (PyTorch, ONNX) don’t support them and hence the functions are implemented purely in CUDA code.

Run all networks.  Run all operators.  Run C++ code, such as CUDA ops?   No hardwired accelerator can do all that.  And running on a legacy DSP or legacy CPU won’t yield sufficient performance.  Is there an alternative?

Why You Need a Fully Programmable, Universal, High-Performance GPNPU Solution

Yes, there is a solution that has been shown to run all those innovative AI workloads, and run them at high-speed! The revolutionary Chimera GPNPU processor integrates fully programmable 32bit ALUs with systolic-array style matrix engines.  Up to 1024 ALUs in a single core, with only one instruction fetch and one AXI data port.  That’s over 32,000 bits of parallel, fully-programmable performance. Scalable up to 864 TOPs for bleeding-edge ADAS applications, Chimera GPNPUs have matched and balanced compute throughput for both MAC and ALU operations so no matter what type of network you choose to run - or whatever style of network gets invented in 2026 - they all run fast, low-power and highly parallel. 

A common approach in the industry to building an on-device machine learning inference accelerator has relied on the simple idea of building an array of high-performance multiply-accumulate circuits – a MAC accelerator. This accelerator was paired with a highly programmable core to run other, less commonly used layers in machine learning graphs. Dozens of lookalike architectures have been built over the past half-decade with this same accelerator plus fallback core concept to solve the ML inference compute problem.

Yet, those attempts haven’t solved the ML inference compute problem. There’s one basic reason. This partitioned, fallback-style architecture simply doesn’t work for modern networks. It worked quite well in 2021 to run classic CNNs from the Resnet era (circa 2015-2017). But now it’s failing to keep up with modern inference workloads.

CNNs and all new transformers are comprised of much more varied ML network operators and far more complex, non-MAC functions than the simple eight (8) operator Resnet-50 backbone of long ago.

For five years, Quadric has been saying that the key to future-proofing your ML solution is to make it programmable in a much more efficient way than just tacking on a programmable core to a bunch of Macs.

There were times when it felt quite lonely being the only company telling the world “Hey, you’re not looking at the new wave of networks.  MACs are not enough! It’s the fully functional ALUs that matter!” So imagine our joy at finding one of market titans – Qualcomm – belatedly coming around to our way of thinking!

In November last year at the Automotive Compute Conference in Munich, a Qualcomm keynote speaker featured this chart in his presentation.

Qualcomm’s Analysis Proves it’s Much More Than MACs

The chart above shows Qualcomm’s analysis of over 1200 different AI/ML networks that they have profiled for their automotive chipsets.  The Y axis shows the breakdown of the performance demands for each network, color-coded by how much of the compute is multiply-accumulate (MAC) functions – Blue; other vector DSP type calculations – ALU Operations (Grey, Orange); and pure scalar/control code - Yellow.   The 3-segment breakdown by Qualcomm corresponds to how Qualcomm’s Hexagon AI accelerator is designed – with three unique computing blocks.

The first 50+ networks on the lefthand side are indeed MAC dominated, requiring MAC accelerated hardware for >90% of the time.  But less than half of the total networks are more than 50% MAC dominated.  And some of these newer networks have little or no classical MAC layers at all! 

Can You Afford to Use 64 DSPs?

At first blush, the argument made by Qualcomm seems plausible: there is a MAC engine for matrix-dominated networks, and a classic DSP core for networks that are mostly ALU operation bound.   Except they leave out two critical pieces of information:

  1. First, the argument in favor of a three-core heterogenous solution ignores the delay and power involved in shuttling data between multiple cores as the computing needs change from operator to operator in the ML graph. 
  2. Second, and most importantly, is the imbalance in size/performance between the matrix accelerator and the programmable DSP engines.  The DSP is orders of magnitude smaller than the accelerator engine! 

Today’s largest licensable DSPs have 512-bit wide vector datapaths. That translates into 16 parallel ALU operations occurring simultaneously.  But common everyday devices – AI PCs and mobile phones – routinely feature 40 TOPs matrix accelerators.  That 40 TOP accelerator can produce 2000+ results of 3x3 convolutions each clock cycle.  2000 versus 16.   That is a greater than two order of magnitude mismatch.  Models run on the DSP are thus two order of magnitude slower than similar models run on the accelerator? Models that need to ping-pong between the different types of compute get bottlenecked on the slow DSP?

Chimera GPNPU – A Better Way!

If silicon area and cost are not a concern, we suppose you could put a constellation of 64 DSPs on chip to make non-MAC networks run as fast as the MAC dominated networks. 64 discrete DSPs with 64 instruction caches, 64 data caches, 64 AXI interfaces, etc.   Good luck programming that! 

Or, you could use one revolutionary Chimera GPNPU processor that integrates a full-function 32-bit ALU with each cluster of 16 or 32 MACs.  Up to 1024 ALUs in a single core, with only one instruction fetch and one AXI data port.  Chimera GPNPUs have matched and balanced compute throughput for both MAC and ALU compute so no matter what type of network you choose to run, they all run fast and highly parallel.

No Gimmicks, Just Data

If you want unfiltered, full disclosure of performance data for a leading machine learning processing solution, head over to Quadric’s online DevStudio tool.  In DevStudio you will find more than 120 AI benchmark models, each with links to the source model, all the intermediary compilation results and reports ready for your inspection, and data from hundreds of cycle-accurate simulation runs on each benchmark model so you can compare all the possible permutations of on-chip memory size options, assumptions about off-chip DDR bandwidth, and presented with a fully transparent Batch=1 set of test conditions.  And you can easily download the full SDK to run your own simulation of a complete signal chain, or a batched simulation to match your system needs.  

If you are comparing alternatives for an NPU selection, give special attention to clearly identifying how the NPU/GPNPU will be used in your target system, and make sure the NPU/GPNPU vendor is reporting the benchmarks in a way that matches your system needs. 

When evaluating benchmark results for AI/ML processing solutions, it is very helpful to remember Shakespeare’s Hamlet, and the famous line: “To be, or not to be.”    Except in this case the “B” stands for Batched.  

To batch or not to batch

Batch-1 Mode Reflects Real-World Cases

There are two different ways in which a machine learning inference workload can be used in a system.  A particular ML graph can be used one time, preceded by other computation and followed by still other computation in a complete signal chain.  Often these complete signal chains will combine multiple, different AI workloads in sequence stitched together by conventional code that prepares or conditions the incoming data stream and decision-making control code in between the ML graphs.   

An application might first run an object detector on a high-resolution image to find pertinent objects (people, cars, road obstacles) and then subsequently call on a face identification model (to ID particular people) or a license plate reader to identify a specific car.  No single ML model runs continuously because the compute resource is shared across a complex signal chain. An ML workload used in this fashion is referred to as being used in Batch=1, or Batch-1 mode.  Batch-1 mode is quite often reflective of real-world use cases.

The Batched Mode Alternative

The alternative is Batched mode, wherein the same network is run repeatedly (endlessly) without interruption. When used in Batched mode, many neural processing units (NPUs) can take advantage of selected optimizations, such as reuse of model weights or pipelining of intermediate results to achieve much higher throughput than the same machine running the same model in Batch-1 mode.  Batched Mode can be seen in real world deployments when a device has a known, fixed workload and all the pre- and post-processing can be shifted to a different compute resource, such as a CPU/DSP or a second ML NPU. 

Check The Fine Print

Many NPU vendors will only report their Batched results, because there can be substantial Inferences-Per-Second differences between Batch-1 and Batch N calculations.   An example screenshot of a popular silicon vendor’s website shows how this Batching “trick” can be presented.  Only when the user clicks the Info popup does it become clear that the reported inference time is “estimated” and assumes a Batch mode over an unspecified number of repetitions.

Inference metrics

If you have to do a lot of digging and calculations to discover the unbatched performance number, perhaps the NPU IP vendor is trying to distract you from the shortcomings of the particular IP?

How to Get Gimmick-Free Data

If you want unfiltered, full disclosure of performance data for a leading machine learning processing solution, head over to Quadric’s online DevStudio tool.  In DevStudio you will find more than 120 AI benchmark models, each with links to the source model, all the intermediary compilation results and reports ready for your inspection, and data from hundreds of cycle-accurate simulation runs on each benchmark model so you can compare all the possible permutations of on-chip memory size options, assumptions about off-chip DDR bandwidth, and presented with a fully transparent Batch=1 set of test conditions.  And you can easily download the full SDK to run your own simulation of a complete signal chain, or a batched simulation to match your system needs.

There are a couple of dozen NPU options on the market today.  Each with competing and conflicting claims about efficiency, programmability and flexibility.  One of the starkest differences among the choices is the seemingly simple question of what is the “best” choice of placement of memory relative to compute in the NPU system hierarchy.

Some NPU architecture styles rely heavily on direct or exclusive access to system DRAM, relying on the relative cost-per-bit advantages of high-volume commodity DRAM relative to other memory choices, but subject to the partitioning problem across multiple die. Other NPU choices rely heavily or exclusively on on-chip SRAM for speed and simplicity, but at high cost of silicon area and lack of flexibility.  Still others employ novel new memory types (MRAM) or novel analog circuitry structures, both of which lack proven, widely-used manufacturing track records.  In spite of the broad array of NPU choices, they generally align to one of three styles of memory locality.  Three styles that bear (pun intended) a striking resemblance to a children’s tale of three bears!

The children’s fairy tale of Goldilocks and the Three Bears describes the adventures of Goldi as she tries to choose among three choices for bedding, chairs, and bowls of porridge. One meal is “too hot”, the other “too cold”, and finally one is “just right”.  If Goldi were faced with making architecture choices for AI processing in modern edge/device SoCs, she would also face three choices regarding placement of the compute capability relative to the local memory used for storage of activations and weights.

In, At or Near?

The terms compute-in-memory (CIM) and compute-near-memory (CNM) originated in discussions of architectures in datacenter system design. There is an extensive body of literature discussing the merits of various architectures, including this recent paper from a team at TU Dresen published earlier this year.  All of the analysis boils down to trying to minimize the power consumed and the latency incurred while shuffling working data sets between the processing elements and storage elements of a datacenter.

In the specialized world of AI inference-optimized systems-on-chip (SoCs) for edge devices the same principles apply, but with 3 levels of nearness to consider: in-memory, at-memory and near-memory compute. Let’s quickly examine each.

In Memory Compute: A Mirage

In-memory compute refers to the many varied attempts over more than a decade to meld computation into the memory bit cells or memory macros used in SoC design.  Almost all of these employ some flavor of analog computation inside the bit cells of DRAM or SRAM (or more exotic memories such as MRAM) under consideration.  In theory these approaches speed computation and lower power by performing compute – specifically multiplication – in the analog domain and in widely parallel fashion. While a seemingly compelling idea, all have failed to date.

The reasons for failure are multiple. Start with the fact that widely used on-chip SRAM has been perfected/optimized for nearly 40 years, as has DRAM for off-chip storage. Messing around with highly optimized approaches leads to area and power inefficiencies compared to the unadulterated starting point.  Injecting such a new approach into the tried-and-true standard cell design methodologies used by SoC companies has proven to be a non-starter. The other main shortcoming of in-memory compute is that these analog approaches only perform a very limited subset of the compute needed in AI inference – namely the matrix multiplication at the heart of the convolution operation. But no in-memory compute can build in enough flexibility to cover every possible convolution variation (size, stride, dilation) and every possible MatMul configuration. Nor can in-memory analog compute implement the other 2300 Operations found in the universe of Pytorch models. Thus in-memory compute solutions also need to have full-fledged NPU computation capability in addition to the baggage of the analog enhancements to the memory – “enhancements” that are an area and power burden when using that memory in a conventional way for all the computation occurring on the companion digital NPU.

In the end analysis, in-memory solutions for edge device SoC are “too limited” to be useful for the intrepid chip designer Goldi.

Near Memory Compute: Near is still really very far away

At the other end of the spectrum of design approaches for SoC inference is the concept of minimizing on-chip SRAM memory use and maximizing the utilization of mass-produced, low-cost bulk memories, principally DDR chips.  This concept focuses on the cost advantages of the massive scale of DRAM production and assumes that with minimal on-SoC SRAM and sufficient bandwidth to low-cost DRAM the AI inference subsystem can lower SoC cost but maintain high performance by leaning on the fast connection to external memory, often a dedicated DDR interface solely for the AI engine to manage.

While successful at first glance at reducing the SoC chip area devoted to AI and thus slightly reducing system cost, near-memory approaches have two main flaws that cripple system performance.  First, the power dissipation of such systems is out of whack.  Consider the table below showing the relative energy cost of moving a 32bit word of data to or from the multiply accumulation logic at the heart of every AI NPU:

Each transfer of data to or from the SoC to DDR consumes between 225 to 600 times the energy (power) of a transfer that stays locally adjacent to the MAC unit.   Even an on-chip SRAM a considerable “distance” away from a MAC unit is still 3 to 8 times more energy efficient than going off-chip.  With most such SoCs power-dissipation constrained in consumer-level devices, the power limitations of relying principally on external memory renders the near-memory design point impractical.  Furthermore, the latency of always relying on external memory means that as newer, more complicated models evolve that might have more irregular data access patterns than old-school Resnet stalwarts, near-memory solutions will suffer significant performance degradation due to latencies. 

The double whammy of too much power and low performance means that the near-memory approach is “too hot” for our chip architect Goldi.

At-Memory: Is Just Right

Just as the children’s Goldilocks fable always presented a “just right” alternative, the At-Memory compute architecture is the solution that is Just Right for edge & device SoCs.   Referring again to the table above of data transfer energy costs the obvious best choice for memory location is the Immediately Adjacent on-chip SRAM.  A computed intermediate activation value saved into a local SRAM burns up to 200X less power than pushing that value off-chip.  But that doesn’t mean that you want to only have on-chip SRAM. To do so creates hard upper limits on model sizes (weight sizes) that can fit in each implementation.

The best alternative for SoC designers is a solution that both take advantage of small local SRAMs – preferably distributed in large number among an array of compute elements – as well as intelligently schedules data movement between those SRAMs and the off-chip storage of DDR memory in a way that minimizes system power consumption and minimizes data access latency.

The Just Right Solution

Luckily for Goldilocks and all other SoC architects in the world, the is such a solution: Quadric’s fully programmable Chimera GPNPU – general purpose NPU –  a C++ programmable processor combining the power-performance characteristics of a systolic array “accelerator” with the flexibility of a C++ programmable DSP.   The Chimera architecture employs a blend of small distributed local SRAMs along with intelligent, compiler-scheduled use of layered SRAM and DRAM to deliver a configurable hardware-software solution “just right” for your next Soc.  Learn more at: www.quadric.io.

The biggest mistake a chip design team can make in evaluating AI acceleration options for a new SoC is to rely entirely upon spreadsheets of performance numbers from the NPU vendor without going through the exercise of porting one or more new machine learning networks themselves using the vendor toolsets.

Why is this a huge red flag?  Most NPU vendors tell prospective customers that (1) the vendor has already optimized most of the common reference benchmarks, and (2) the vendor stands ready and willing to port and optimize new networks in the future.   It is an alluring idea – but it’s a trap that won’t spring until years later.  Unless you know today that the Average User can port his/her own network, you might be trapped in years to come!

Rely on NPU Vendor at Your Customers’ Customers Expense!

To the chip integrator team that doesn’t have a data science cohort on staff, the daunting thought of porting and tuning a complex AI graph for a novel NPU accelerator is off-putting. The idea of doing it for two or three leading vendors during an evaluation is simply a non-starter!   Implicit in that idea is the assumption that the toolsets from NPU vendors are arcane, and that the multicore architectures they are selling are difficult to program. It happens to be true for most “accelerators” where the full algorithm must be ripped apart and mapped to a cluster of scalar compute, vector compute and matrix compute engines.  Truly it is better to leave that type of brain surgery to the trained doctors!

But what happens after you’ve selected an AI acceleration solution?  After your team builds a complex SoC containing that IP core?  After that SoC wins design sockets in the systems of OEMs?  What happens when those systems are put to the test by buyers or users of the boxes containing your leading-edge SoC? 

Who Ports and Optimizes the New AI Network in 2028?

The AI acceleration IP selection decision made today – in 2024 – will result in end users in 2028 or 2030 needing to port the latest and greatest breakthrough machine learning models to that vintage 2024 IP block.  Relying solely on the IP vendor to do that porting adds unacceptable business risks!

Layers Of Risk

Business Relationship Risk: Will the End User – likely with a proprietary trained model – feel comfortable relying on a chip vendor or an IP vendor they’ve never met to port and optimize their critical AI model?   What about the training dataset? If model porting requires substituting operators to fit the limited set of hardwired operators in the NPU accelerator, the network will need to be retrained and requantized – which means getting the training set into the hands of the IP vendor?

Legal Process Risk:  How many NDAs will be needed – with layer upon layer of lawyer time and data security procedures – to get the new model into the hands of the IP vendor for creation of new operators or tuning of performance?  

Porting Capacity Risk:  Will the IP vendor have the capacity to port and optimize all the models needed?  Or will the size of the service team inside the IP vendor become the limiting factor in the growth of the business of the system OEM, and hence the sales of the chips bearing the IP?

Financial Risk:  IP vendor porting capacity can never be infinite no matter how eager they are to take on that kind of work – how will they prioritize porting requests?  Will the IP vendor auction off porting priority to the “highest bidder”?  In a game of ranking customers by importance or willingness to write service fee checks, someone gets shoved the end of the list and goes home crying.

Survivor Risk:   The NPU IP segment is undergoing a shakeout. The herd of more than twenty would-be IP vendors will be thinned out considerably in four years.  Can you count on the IP vendor of 2024 to still have a team of dedicated porting engineers in 2030?

In What Other Processor Category Does the IP Vendor Write All the End User Code?

Here’s another way to look at the situation: can you name any other category of processor – CPU, DSP, GPU – where the chip design team expects the IP vendor to “write all the software” on behalf of the eventual end user?  Of course not!  A CPU vendor delivers world-class compilers along with the CPU core, and years later the mobile phone OS or the viral mobile App software products are written by the downstream users and integrators.  Neither the core developer nor the chip developer get involved in writing a new mobile phone app!   The same needs to be true for AI processors – toolchains are needed that empower the average user – not just the super-user core developer - to easily write or port new models to the AI platform.

A Better Way – Leading Tools that Empower End Users to Port & Optimize
Quadric’s fully programmable GPNPU – general purpose NPU – is a C++ programmable processor combining the power-performance characteristics of a systolic array “accelerator” with the flexibility of a C++ programmable DSP.  The toolchain for Quadric’s Chimera GPNPU combines a Graph Compiler with an LLVM C++ compiler – a toolchain that Quadric delivers to the chip design team and that can be passed all the way down the supply chain to the end user.  End users six years from now will be able to compile brand new AI algorithms from the native Pytorch graph into C++ and then into a binary running on the Chimera processor.   End users do not need to rely on Quadric to be their data science team or their model porting team.  But they do rely on the world-class compiler toolchain from Quadric to empower them to rapidly port their own applications to the Chimera GPNPU core. 

On Chimera QC Ultra it runs 28x faster than DSP+NPU

ConvNext is one of today’s leading new ML networks. Frankly, it just doesn’t run on most NPUs. So if your AI/ML solution relies on an NPU plus DSP for fallback, you can bet ConvNext needs to run on that DSP as a fallback operation. Our tests of expected performance on a DSP + NPU is dismal – less than one inference per second trying to run the ConvNext backbone if the novel Operators – GeLu and LayerNorm – are not present in the hardwired NPU accelerator and instead those Ops ran on the largest and most powerful DSP core on the market today.

This poor performance  makes most of today’s DSP + NPU silicon solutions obsolete. Because that NPU is hardwired, not programmable, new networks cannot efficiently be run.

So what happens when you run ConvNext on Quadric’s Chimera QC Ultra GPNPU, a fully programmable NPU that eliminates the need for a DSP for fallback? The comparison to the DSP-NPU Fallback performance is shocking:  Quadric’s Chimera QC Ultra GPNPU is 28 times faster (28,000%!) than the DSP+NPU combo for the same large input image size: 1600 x 928.  This is not a toy network, but a real useful image size.   And Quadric has posted results for not just one but three different variants of the ConvNext:  both Tiny and Small variants, and with input image sizes at 640x640, 800x800, 928x1600.   The results are all available for inspection online inside the Quadric Developers’ Studio.     

Fully Compiled Code Worked Perfectly

Why did we deliver three different variants of ConvNext all at one time?  And why were those three variants added to the online DevStudio environment at the same time as more than 50 other new network examples a mere 3 months (July 2024) after our previous release (April)?   Is it because Quadric has an army of super-productive data scientists and low-level software coders who sat chained to their computers 24 hours a day, 7 days a week?  Or did the sudden burst of hundreds of new networks happen because of major new feature improvements in our market-leading Chimera Graph Compiler (CGC)?    Turns out that it is far more productive – and ethical - for us to build world-class graph compilation technology rather than violate labor laws.

Our latest version of the CGC compiler fully handles all the graph operators found in ConvNext, meaning that we are able to take model sources directly from the model repo on the web and automatically compile the source model to target our Chimera GPNPU.  In the case of ConvNext, that was a repo on Github from OpenMMLab with a source Pytorch model.  Login to DevStudio to see the links and sources yourself!  Not hand-crafted.  No operator swapping.  No time-consuming surgical alteration of the models.  Just a world-class compilation stack targeting a fully-programmable GPNPU that can run any machine learning model.

Lots of Working Models

Perhaps more impressive than our ability to support ConvNext – at a high 28 FPS frame rate – is the startling increase in the sheer number of working models that compile and run on the Chimera processor as our toolchain rapidly matures (see chart).

Faster Code, Too

Not only did we take a huge leap forward in the breadth of model support in this latest software release, but we also continue to deliver incremental performance improvements on previous running models as the optimization passes inside the CGC compiler continue to expand and improve.   Some networks saw improvements of over 50% in just the past 3 months.  We know our ConvNext numbers will rapidly improve.  

The performance improvements will continue to flow in the coming months and years as our compiler technology continues to evolve and mature Unlike a hardwired accelerator with fixed feature sets and fixed performance the day you tape-out, a processor can continue to improve as the optimizations and algorithms in the toolchain evolve.

Programmability Provides Futureproofing

By merging the best attributes of dataflow array accelerators with full programmability, Quadric delivers high ML performance while maintaining endless programmability.   The basic Chimera architecture pairs a block of multiply-accumulate (MAC) units with a full-function 32-bit ALU and a small local memory.  This MAC-ALU-memory building block – called a processing element (PE) - is then tiled into a matrix of PEs to form a family of GPNPUs offering from 1 TOP to 864 TOPs.  Part systolic array and part DSP, the Chimera GPNPU delivers what others fails to deliver – the promise of efficiency and complete futureproofing. 

What’s the biggest challenge for AI/ML? Power consumption. How are we going to meet it?  In late April 2024, a novel AI research paper was published by researchers from MIT and CalTech proposing a fundamentally new approach to machine learning networks – the Kolmogorov Arnold Network – or KAN.  In the two months since its publication, the AI research field is ablaze with excitement and speculation that KANs might be a breakthrough that dramatically alters the trajectory of AI models for the better – dramatically smaller model sizes delivering similar accuracy at orders of magnitude lower power consumption – both in training and inference.

However, most every semiconductor built for AI/ML can’t efficiently run KANs. But we Quadric can.

First, some background.

The Power Challenge to Train and Infer

Generative AI models for language and image generation have created a huge buzz.  Business publications and conferences speculate on the disruptions to economies and the hoped-for benefits to society.  But also, the enormous computational costs to training then run ever larger model has policy makers worried.  Various forecasts suggest that LLMs alone may consume greater than 10% of the world’s electricity in just a few short years with no end in sight.  No end, that is, until the idea of KANs emerged!  Early analysis suggests KANs can be 1/10th to 1/20th the size of conventional MLP-based models while delivering equal results.

Note that it is not just the datacenter builder grappling with enormous compute and energy requirements for today’s SOTA generative AI models.   Device makers seeking to run GenAI in device are also grappling with compute and storage demands that often exceed the price points their products can support.   For the business executive trying to decide how to squeeze 32 GB of expensive DDR memory on a low-cost mobile phone in order to be able to run a 20B parameter LLM, the idea of a 1B parameter model that neatly fits into the existing platform with only 4 GB of DDR is a lifesaver!

Built Upon a Different Mathematical Foundation

Today’s AI/ML solutions are built to efficiently execute matrix multiplication. This author is more aspiring comedian than aspiring mathematician, so I won’t attempt to explain the underlying mathematical principles of KANs and how they differ from conventional CNNs and Transformers – and there are already several good, high-level explanations published for the technically literate, such this and this.  But the key takeaway for the business-minded decision maker in the semiconductor world is this: KANs are not built upon the matrix-multiplication building block.  Instead, executing a KAN inference consists of computing a vast number of univariate functions (think, polynomials such as: [ 8 X3 – 3 X2 + 0.5 X ] and then ADDING the results.  Very few MATMULs.

Toss Out Your Old NPU and Respin Silicon?

No Matrix multiplication in KANs?   What about the large silicon area you just devoted to a fixed-function NPU accelerator in your new SoC design?  Most NPUs have hard-wired state machines for executing matrix multiplication – the heart of the convolution operation in current ML models.  And those NPUs have hardwired state machines to implement common activation (such as ReLu and GeLu) and pooling functions.  None of that matters in the world of KANs where solving polynomials and doing frequent high-precision ADDs is the name of the game.

GPNPUs to the Rescue

The semiconductor exec shocked to see KANs might be imagined to exclaim, “If only there was a general-purpose machine learning processor that had massively-parallel, general-purpose compute that could perform the ALU operations demanded by KANs!”   

But there is such a processor!   Quadric’s Chimera GPNPU uniquely blends both the matrix-multiplication hardware needed to efficiently run conventional neural networks with a massively parallel array of general-purpose, C++ programmable ALUs capable of running any and all machine learning models.   Quadric’s Chimera QB16 processor, for instance, pairs 8192 MACs with a whopping 1024 full 32-bit fixed point ALUs, giving the user a massive 32,768 bits of parallelism that stands ready to run KAN networks -  if they live up to the current hype – or whatever the next exciting breakthrough invention turns out to be in 2027.  Future proof your next SoC design – choose Quadric!  

Not just a little slow down. A massive failure!

Conventional AI/ML inference silicon designs employ a dedicated, hardwired matrix engine – typically called an “NPU” – paired with a legacy programmable processor – either a CPU, or DSP, or GPU.  You can see this type of solution from all of the legacy processor IP licensing companies as they try to reposition their cash-cow processor franchises for the new world of AI/ML inference.  Arm offers an accelerator coupled to their ubiquitous CPUs.  Ceva offers an NPU coupled with their legacy DSPs. Cadence’s Tensilica team offers an accelerator paired with a variety of DSP processors.  There are several others, all promoting similar concepts.

The common theory behind these two-core (or even three core) architectures is that most of the matrix-heavy machine learning workload runs on the dedicated accelerator for maximum efficiency and the programmable core is there as the backup engine – commonly known as a Fallback core – for running new ML operators as the state of the art evolves.

As Cadence boldly and loudly proclaims in a recent blog : “The AI hardware accelerator will provide the best speed and energy efficiency, as the implementation is all done in hardware. However, because it is implemented in fixed-function RTL (register transfer level) hardware, the functionality and architecture cannot be changed once designed into the chip and will not provide any future-proofing.”   Cadence calls out NPUs – including their own! -  as rigid and inflexible, then extols the virtues of their DSP as the best of the options for Fallback.  But as we explain in detail below, being the best at Fallback is sorta like being the last part of the Titanic to sink: you are going to fail spectacularly, but you might buy yourself a few extra fleeting moments to rue your choice as the ship sinks.

In previous blogs, we’ve highlighted the conceptual failings of the Fallback concept. Let’s now drill deeper into an actual example to show just how dreadfully awful the Fallback concept is in real life!

Let’s compare the raw computational capacity of the conventional architecture concept using Cadence’s products as an example versus that of Quadric’s dramatically higher performance Chimera GPNPU.  Both the Cadence solution chosen for this comparison and the Quadric solution have 16 TOPs of raw machine learning horsepower – 8K multiply-accumulate units each, clocked at 1 GHz.  Assuming both companies employ competent design teams you would be right to assume that both solutions deliver similar levels of performance on the key building block of neural networks – the convolution operation.   

But when you compare the general-purpose compute ability of both solutions, the differences are stark.  Quadric’s solution has 1024 individual 32bit ALUs yielding an aggregate of 32,000 bits of ALU parallelism, compared to only 1024 bits for the Cadence product.  That means any ML function that needs to “fall back” to the DSP in the Cadence solution runs 32 times slower than on the Quadric solution.  Other processor vendor NPU solutions are even worse than that!  Cadence’s 1024-bit DSP is twice as powerful as simpler 512-bit DSPs or four times as powerful as vector extensions to CPUs that only deliver 256-bit SIMD parallelism.  Those other products are 64X or 128X slower than Quadric!

ConvNext – A Specific, Current Day Example

One of the more recent new ML networks is ConvNext, boasting top-1 accuracy levels of as much as 87.8% - nearly 10 full percentage points higher than the Resnet architectures of just a few years ago.  Two key layers in the ConvNext family of networks that typically are not present in fixed-function NPUs are the LayerNorm normalization function and the GeLU activation used in place of the previously more common ReLU activation.   In the conventional architectures promoted by other IP vendors, the GeLU and LayerNorm operations would run on the FallBack DSP, not the accelerator.  The result – shown in the calculations in the illustration below – are shocking!

Fallback Fails – Spectacularly

Our analysis of ConvNext on an NPU+DSP architecture suggests a throughput of less than 1 inference per second.  Note that these numbers for the fallback solution assume perfect 100% utilization of all the available ALUs in an extremely wide 1024-bit VLIW DSP.  Reality would undoubtably be below the speed-of-light 100% mark, and the FPS would suffer even more.   In short, Fallback is unusable.  Far from being a futureproof safety net, the DSP or CPU fallback mechanism is a death-trap for your SoC.

Is ConvNext an Outlier?

A skeptic might be thinking: “yeah, but Quadric probably picked an extreme outlier for that illustration. I’m sure there are other networks where Fallback works perfectly OK!”.   Actually, ConvNext is not an outlier.  It only has two layer types requiring fallback. The extreme outlier is provided to us by none other than Cadence in the previously mentioned blog post.   Cadence wrote about the SWIN (shifted window) Transformer:

When a SWIN network is executed on an AI computational block that includes an AI hardware accelerator designed for an older CNN architecture, only ~23% of the workload might run on the AI hardware accelerator’s fixed architecture. In this one instance, the SWIN network requires ~77% of the workload to be executed on a programmable device.”

Per Cadence’s own analysis: 77% of the entire SWIN network would need to run on the Fallback legacy processor. At that rate, why even bother to have an NPU taking up area on the die – just run the entire network – very, very slowly – on a bank of DSPs.  Or, throw away your chip design and start over again - if you can convince management to give you the $250M budget for a new tapeout in an advanced node!

Programmable, General Purpose NPUs (GPNPU)

There is a new architectural alternative to fallback – the Chimera GPNPU from Quadric.  By merging the best attributes of dataflow array accelerators with full programmability (see last month’s blog for details) Quadric delivers high ML performance while maintaining endless programmability.   The basic Chimera architecture pairs a block of multiply-accumulate (MAC) units with a full-function 32bit ALU and a small local memory.  This MAC-ALU-memory building block – called a processing element (PE) - is then tiled into a matrix of 64, 256 or 1024 PEs to form a family of GPNPUs offering from 1 TOP to 16 TOPs each, and multicore configurations reaching hundreds of TOPs.  Part systolic array and part DSP, the Chimera GPNPU delivers what Fallback fails to deliver – the promise of efficiency and complete futureproofing.  See more at quadric.io.

The ResNet family of machine learning algorithms, introduced to the AI world in 2015, pushed AI forward in new ways. However, today’s leading edge classifier networks – such as the Vision Transformer (ViT) family -  have Top 1 accuracies a full 10% points of accuracy ahead of the top-rated ResNet in the leaderboard. ResNet is old news. Surely other new algorithms will be introduced in the coming years.

Let’s take a look back at how far we’ve come. Shortly after introduction, new variations were rapidly discovered that pushed the accuracy of ResNets close to the 80% threshold (78.57% Top 1 accuracy for ResNet-152 on ImageNet).  This state-of-the-art performance at the time coupled with the rather simple operator structure that was readily amenable to hardware acceleration in SoC designs turned ResNet into the go-to litmus test of ML inference performance.  Scores of design teams built ML accelerators in the 2018-2022 time period with ResNet in mind. 

These accelerators – called NPUs – shared one common trait – the use of integer arithmetic instead of floating-point math. Integer formats are preferred for on-device inference because an INT8 multiply-accumulate (the basic building block of ML inference) can be 8X to 10X more energy efficient than executing the same calculation in full 32-bit floating point.  The process of converting a model’s weights from floating point to integer representation is known as quantization.  Unfortunately, some degree of fidelity is always lost in the quantization process. 

NPU designers have, in the past half-decade, spent enormous amounts of time and energy fine-tuning their weight quantization strategies to minimize the accuracy loss of an integer version of ResNet compared to the original Float-32 source (often referred to as Top-1 Loss).   For most builders and buyers of NPU accelerators, a loss of 1% or less is the litmus test of goodness.  Some even continue to fixate on that number today, even in the face of dramatic evidence that suggests now-ancient networks like ResNet should be relegated to the dustbin of history.  What evidence, you ask?  Look at the “leaderboard” for ImageNet accuracy that can be found on the Papers With Code website:

https://paperswithcode.com/sota/image-classification-on-imagenet

It’s Time to Leave the Last Decade Behind

As the leaderboard chart aptly demonstrates, today’s leading edge classifier networks – such as the Vision Transformer (ViT) family -  have Top 1 accuracies exceeding 90%, a full 10% points of accuracy ahead of the top-rated ResNet in the leaderboard.  For on-device ML inference performance, power consumption and inference accuracy must all be balanced for a given application.  In applications where accuracy really matters – such as automotive safety applications –design teams should be rushing to embrace these new network topologies to gain the extra 10+% in accuracy. 

It certainly makes more sense to embrace 90% accuracy thanks to a modern ML network – such as ViT, or SWIN transformer, or DETR - rather than fine-tuning an archaic network in pursuit of its original 79% ceiling?  Of course.  So then why haven’t some teams made the switch?

Why Stay with an Inflexible Accelerator?

Perhaps those who are not Embracing The New are limited because they chose accelerators with limited support of new ML operators. If a team implemented a fixed-function accelerator four years ago in an SoC that cannot add new ML operators, then many newer networks – such as Transformers – cannot run on those fixed-function chips today.  A new silicon respin – which takes 24 to 36 months and millions of dollars – is needed. 

But one look at the leaderboard chart tells you that picking a fixed-function set of operators now in 2024 and waiting until 2027 to get working silicon only repeats the same cycle of being trapped as new innovations push state-of-the-art accuracy, or retains accuracy with less computational complexity.  If only there was a way to run both known networks efficiently and tomorrow’s SOTA networks on device with full programmability!

Use Programmable, General Purpose NPUs (GPNPU) Luckily, there is now an alternative to fixed-function accelerators.  Since mid-2023 Quadric has been delivering its Chimera General-Purpose NPU (GPNPU) as licensable IP.  With a TOPs range scaling from 1 TOP to 100s of TOPs, the Chimera GPNPU delivers accelerator-like efficiency while maintaining C++ programmability to run any ML operator, and we mean any. Any ML graph.  Any new network. Even ones that haven’t been invented yet.  Embrace the New!

© Copyright 2024  Quadric    All Rights Reserved     Privacy Policy

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram