In today’s disaggregated electronics supply chain the (1) application software developer, (2) the ML model developer, (3) the device maker, (4) the SoC design team and (5) the NPU IP vendor often work for as many as five different companies.  It can be difficult or impossible for the SoC team to know or predict actual AI/ML workloads and full system behaviors as many as two or three years in advance of the actual deployment.  But then how can that SoC team make good choices provisioning compute engines and adequate memory resources for the unknown future without defaulting to “Max TOPS / Min Area”?

There has to be a smarter way to eliminate bottlenecks while determining the optimum local memory for AI/ML subsystems.

Killer Assumptions

I want to maximize the MAC count in my AI/ML accelerator block because the TOPs rating is what sells, but I need to cut back on memory to save cost,” said no successful chip designer, ever.

Emphasis on “successful” in the above quote.  We’ve heard comments like this many times.  Chip architects – or their marketing teams – try to squeeze as much brag-worthy horsepower into a new chip design as they can while holding down silicon cost. 

Many SoC teams deploy home-grown accelerators for machine learning inference.  Those internal solutions often lack accurate simulation models that can be plugged into larger SoC system simulations, and they need to be simulated at the logic level in Verilog to determine memory access patterns to the rest of the system resources.   With slow gate-level simulation speeds, often the available data set gathered is small and limited to one or two workloads profiled.  This lack of detailed information can entice designers into several deadly assumptions. 

Killer assumption number one is that the memory usage patterns of today’s reference networks – many of which are already five or more years old (think ResNet) - will remain largely unchanged as networks evolve. 

The second risky assumption is to simplistically assume a fixed percentage of available system external bandwidth that doesn’t account for resource contention over time. 

Falling prey to those two common traps can lead the team to declare design goals have been met with just a small local memory buffer dedicated to the NPU accelerator.  Unfortunately, they find out after silicon is in-hand that perhaps newer models have different data access patterns that require smaller, more frequent random accesses to off-chip memory – a real performance killer in systems that perform best with large burst transfers. 

If accelerator performance depends on the next activation data element or network weight being ready for computation by your MAC array within a few clock cycles, waiting 1000 cycles to gain access to the DDR channel only to underutilize it with a tiny data transfer can wreak havoc with achievable performance.

Using Big Memory Buffers Might Not Solve the Problem

You might think the obvious remedy to the conundrum is to provision more SRAM on chip as buffer memory than the bare minimum required.  That might help in some circumstances.  Or it might just add cost but not solve the problem if a hardwired state machine accelerator with inflexible memory access and stride patterns continues to request an excessive number of tiny block transfer requests to DDR on the chip’s AXI fabric. 

The key to finding the Goldilocks amount of memory – not too small and not too much – is twofold:

  1. Pick a machine learning inference processing solution that smartly manages local SRAM memory with flexible, code-driven implementations of new networks that minimizes the sheer number of external requests in addition to the absolute volume of GBytes/sec.
  2. Pick an acceleration solution that smartly prefetches data anticipated to be needed ahead in the graph execution, such that most requests for new data are made in advance and thus the subsystem can tolerate variable response times from on-chip and off-chip memory resources.

Solving the Memory Challenge - Smartly

Quadric has written extensively on techniques that analyze data usage across large swathes of an ML graph to ease memory bottlenecks.  Our blog on advanced operator fusion –  Fusion In Local Memory (FILM) – illustrates these techniques.

Additionally, the extensive suite of system simulation capability that accompanies the Chimera core provides a rich set of data that helps the SoC designer understand code and memory behavior.  And the ability of the Chimera Graph Compiler to smartly schedule prefetching of data provides tremendous resiliency to aberrant system response times, even when a Chimera GPNPU is configured with relatively small local memories.

Quadric Chimera GPNPUs can be configured with a local buffer memory ranging from 1 MB to 32 MB, depending on system requirements.   Some chip architects assume that a fully C++ programmable processor like the Chimera core needs “large” local memories to deliver good performance.  But as the chart below shows, smartly managing local memory with a code-driven graph compilation technology delivers excellent resiliency to system resource contention with even very small local memory configuration.   

For this analysis we sampled five different ML networks (ResNet50, two Yolo variants, UNet and CenterNet) across all three Chimera core sizes (1 TOP, 4 TOP, 16 TOP).  In all scenarios we modeled core performance with a relatively small 4 MB of local SRAM (called L2 memory in our architecture) and assumed peak AXI system bandwidth to DDR memory of 32 GB/sec (an LPDDR4 or LPDDR5 connection). 

Thanks to extensive FILM region optimization and smart data prefetching, all five scenarios show remarkable tolerance of sluggish system response, as the Chimera core performance degrades less than 1% even when average DDR response time is 10X worse than the system ideal of 100 cycles (1000 cycles instead of 100).

The bottom line:  Don’t agonize over how much or how little memory to choose.   Don’t squeeze memory and simply hope for the best.  Choose an ML solution with the kind of programmability, modeling capability, and smart memory management that Quadric offers, and instead of agony, you’ll know you made the right resource choices before you tapeout, and you’ll be saying “Thanks for the memories!” See for yourself at www.quadric.io

Wait! Didn’t That Era Just Begin?

The idea of transformer networks has existed since the seminal publication of the Attention is All You Need paper by Google researchers in June 2017.  And while transformers quickly gained traction within the ML research community, and in particular demonstrated superlative results in vision applications (ViT paper), transformer networks were definitely not a topic of trendy conversation around the family holiday dinner table.  Until late 2022, that is. 

On Nov 30, 2022, OpenAI released ChatGPT and within weeks millions of users were experimenting with it.  Soon thereafter the popular business press and even the TV newscasts were both marveling at the results as well as publishing overwrought doomsday predictions of societal upheaval and chaos.

ChatGPT ushered in the importance of Transformer Networks in the world of Machine Learning / AI.

During the first few months of Transformer Fever the large language models (LLMs) based on transformer techniques were the exclusive province of cloud compute centers because model size was much too large to contemplate running on mobile phone, a wearable device, or an embedded appliance. But by the midsummer of 2023 the narrative shifted and in the second half of 2023 every silicon vendor and NPU IP vendor was talking about chip and IP core changes that would support LLMs on future devices. 

Change is Constant

Just six weeks after the release of the Llama 2 LLM, Quadric was the first to demonstrate the Llama2 LLM running on an existing IP core in September 2023.  But we were quick to highlight that Llama2 wasn’t the end of the evolution of ML models and that hardware solutions needed to be fully programable to react to the massive rate of change occurring in data science.  No sooner had that news hit the street – and our target customers in the semiconductor business started ringing our phone off the hook – then we started seeing the same media sources that hyped transformers and LLMs in early 2023 begin to predict the end of the lifecycle for transformers!

Too Early to Predict the End of Transformers?

In September 2023, Forbes was first out of the gate predicting that other ML network topologies would supplant attention-based transformers.   Keep in mind, this is a general business-oriented publication aimed at the investor-class, not a deep-tech journal.  More niche-focused, ML-centric publications such as Towards Data Science are also piling on, with headlines in December 2023 such as A Requiem For the Transformer.

Our advice: just as doomsday predictions about transformers were too hyperbolic, so too are predictions about the imminent demise of transformer architectures.   But make no mistake, the bright minds of data science are hard at work today inventing the Next New Thing that will certainly capture the world’s attention in 2024 or 2025 or 2026 and might one day indeed supplant today’s state of the art. 

What Should Silicon Designers Do Today?

Just as today’s ViT and LLM models didn’t completely send last decade’s Resnet models to the junk heap, tomorrow’s new hero model won’t eliminate the LLMs that are consuming hundreds of billions of dollars of venture capital right now.  SoC architects need compute solutions that can run last year’s benchmarks (Resnet, etc) plus this year’s hot flavor of the month (LLMs) as well as the unknown future champion of 2026 that hasn’t been invented yet.

Should today’s chip designer choose a legacy hardwired NPU optimized for convolutions?  Terrible idea – you already know that first-generation accelerator is broken.  Should they adapt and build a second generation, hardwired accelerator evolved to support both CNNs and transformers?  No way – still not fully programmable – and the smart architect won’t fall for that trap a second time. 

Quadric offers a better way – the Chimera GPNPU.  “GP” because it is general purpose: fully C++ programmable to support any innovation in machine learning that comes along in the future.  “NPU” because it is massively parallel, matrix-optimized offering the same efficiency and throughput as hardwired “accelerators” but combined with ultimate flexibility.  See for yourself at www.quadric.io

It Will Make for a Much Better NPU Vendor Evaluation.

We’ve written before about the ways benchmarks for NPUs can be manipulated to the point where you just can’t trust them. There are two common major gaps in collecting useful comparison data on NPU IP: [1] not specifically identifying the exact source code repository of a benchmark, and [2] not specifying that the entire benchmark code be run end-to-end, with any omissions reported in detail. Our blog explains these gaps.

However, there is a straight-forward, low-investment method to short-circuit all the vendor shenanigans and get a solid apples-to-apples result:  Build Your Own Benchmarks.  BYOB!

This might sound like a daunting task, but it isn’t. At the very beginning of your evaluation, it’s important to winnow the field of possible NPU vendors. This winnowing is essential now that a dozen or more IP companies are offering NPU “solutions.” At this stage, you don’t need to focus on absolute inference accuracy as much as you need to judge key metrics of [1] performance, [2] memory bandwidth, [3] breadth of NN model support; [4] breadth of NN operator support; and [5] speed and ease of porting of new networks using the vendors’ toolsets. Lovingly crafted quantization can come later.

The Problem with Most ML Reference Models

All machine learning models begin life in a model training framework using floating point data representation. The vast majority of published reference models are found in their native FP32 format – because it’s easy for the data science teams that invent the latest, greatest model to simply publish the original source.   But virtually all device-level inference happens in quantized, INT-8 or mixed INT-8/16 formats on energy-optimized NPUs, GPNPUs, CPUs and GPUs that perform model inference at 5X to 10X lower energy than a comparable FP32 version of the same network. 

Quantizing FP32 models to INT8 in a manner that preserves inference accuracy can be a daunting prospect for teams without spare data scientists looking for something to keep them busy.  But do you really need a lovingly crafted, high-accuracy model to judge the merits of 10 vendor products clamoring for your attention? In short, the answer is: NO. 

First, recognize that the benchmarks of today (late 2023, early 2024) are not the same workloads that your SoC will run in 2026 or 2027 when it hits the market – so any labor spent fine-tuning model weights today is likely wasted effort.  Save that effort for perhaps a final step in the IP vendor selection, not early in the process.

Secondly, note that the cycle counts and memory traffic data that will form the basis for your PPA (power, performance, area) conclusions will be exactly the same whether the model weights were lovingly hand-crafted or simply auto-converted.  What matters for the early stages of an evaluation are: can the NPU run my chosen network? Can it run all the needed operators? What is the performance?

Making Your Own Model

Many IP vendors have laboriously hand-tuned, hand-pruned and twisted a reference benchmark beyond recognition in order to win the benchmark game.  You can short-circuit that gamesmanship by preparing your own benchmark model.  Many training toolsets have existing flows that can convert floating point models to quantized integer versions. (e.g TensorFlow QAT;  Pytorch Quantization).  Or you can utilize interchange formats, such as ONNX Runtime, to do the job.  Don’t worry about how well the quantization was performed – just be sure that all layers have been converted to an Integer format. 

Even if all you do is take a handful of the common networks with some variety – for example: Resnet50, an LLM, VGG16, Vision Transformer, and a Pose network – you can know that the yardstick you give to the IP vendor is the same for all vendors.  Ask them to run the model thru their tools and give you both the performance results and the final form of the network they ran to double-check that all of the network ran on the NPU – and none of the operators had to fallback onto a big, power-guzzling CPU.  

Better yet – ask the vendor if their tools are robust enough to let you try to run the models yourself!

Getting the Information That Really Counts

BYOB will allow you to quickly ascertain the maturity of a vendor toolchain, the breadth of operator coverage, and the PPA metrics you need to pare that list of candidates down to a Final Two that deserve the full, deep inspection to make a vendor selection.

If you’d like to try running your own benchmarks on Quadric’s toolchain, we welcome you to register for an account in our online Developer’s Studio and start the process!  Visit us at www.quadric.io

Llama2, YOLO Family and MediaPipe family of networks now available

October 31, 2023

Quadric today delivered a Halloween treat for all registered users of the Quadric Developers’ Studio. (Link).   This latest update of the industry’s flagship GPNPU software tools showcases a significant new batch of machine learning benchmark networks and several big enhancements to the underlying ChimeraTM software development toolkit that powers Quadric’s Chimera general purpose neural processor.

Llama2 Code Available

Quadric’s first to market implementation of the Llama2 large language model is now available for customer viewing and inspection in the benchmarks section of the Studio.   Users can explore all aspects of the implementation of the 15M parameter “baby” Llama2 implementation, including all source codes.

MediaPipe Network Family, Yolo Networks

A suite of the Google Mediapipe family of networks is now available: Hand Landmark (full, lite); Palm Detection (Full, Lite); Face Landmark; Face Detection short range; and SSDLite Object Detection all run on the Chimera GPNPU and can be benchmarked and viewed by prospective licensees.

Additionally, full implementations of the YOLO V3 and YOLO V4 networks are also released to DevStudio users.  These YOLO model ports are of the complete YOLO networks – all layers, all operators – running entirely on the Chimera GPNPU processor.  As with all networks that have been ported to the Chimera family of GPNPUs (spanning a range from 1 TOPs to 16 TOPs) the entire network always runs 100% on the GPNPU and there is no companion CPU or DSP required. 

Significant Compiler and Tool Advancements in Chimera SDK 23.10 Release

The quality and performance of code generated by the Chimera Graph Compiler and Chimera C++ compiler continues to make substantial strides with each release.  This latest SDK release adds a further 33% performance improvement to Quadric’s already market-leading performance levels on the ViT_B vision transformer, and a 13% performance improvement on the older RESNET18 benchmark network.  A full performance improvement chart showing the speedups for more than a dozen networks is available inside DevStudio.

A host of new and enhanced code development and debug tools are also now available to further speed the porting and debug process including a new tool for numerical validation/correlation of accuracy of results between Chimera code and ONNX runtime results, and enhanced performance visualization tools at both the operator and subgraph levels.

Immediate Availability

Current and prospective users can access these new features and benchmarks at https://studio.quadric.io/login

Your Spreadsheet Doesn’t Tell the Whole Story.

Thinking of adding an NPU to your next SoC design? Then you’ll probably begin the search by sending prospective vendors a list of questions, typically called an RFI (Request for Information) or you may just send a Vendor Spreadsheet. These spreadsheets ask for information such as leadership team, IP design practices, financial status, production history, and – most importantly – performance information, aka “benchmarks”.

It's easy to get benchmark information on most IP – these benchmarks are well understood. For an analog I/O cell you might collect jitter specs. For a specific 128-pt complex FFT on a DSP there’s very little wiggle room for the vendor to shade the truth.  However, it’s a real challenge for benchmarks for machine learning inference IP, which is usually called an NPU or NPU accelerator.

Why is it such a challenge for NPUs? There are two common major gaps in collecting useful “apples to apples” comparison data on NPU IP: [1] not specifically identifying the exact source code repository of a benchmark, and [2] not specifying that the entire benchmark code be run end to end, with any omissions reported in detail.   

Specifying Which Model

It’s not as easy as it seems. Handing an Excel spreadsheet to a vendor that asks for “Resnet50 inferences per second” presumes that both parties know exactly what “Resnet50” means.  But does that mean the original Resnet50 from the 2015 published paper?   Or from one of thousands of other source code repos labeled “Resnet50”? 

The original network was trained using FP32 floating point values.  But virtually every embedded NPU runs quantized networks (primarily INT8 numerical formats, some with INT4, INT16 or INT32 capabilities as well.) which means the thing being benchmarked cannot be “the original” Resnet50.    Who should do the quantization – the vendor or the buyer?  What is the starting point framework – a PyTorch model, a TensorFlow Model, TFLite, ONNX – or some other format?   Is pruning of layers or channels allowed?  Is the vendor allowed to inject sparsity (driving weight values to zero in order to “skip” computation) into the network?   Should the quantization be symmetric, or can asymmetric methods be used?

You can imagine the challenge of comparing the benchmark from vendor X if they use one technique but Vendor Y uses another. Are the model optimization techniques chosen reflective of what the actual users will be willing and able to perform three years later on the state-of-the-art models of the day when the chip is in production?  Note that all of these questions are about the preparation of the model input that will flow into the NPU vendor’s toolchain.  The more degrees of freedom that you allow an NPU vendor to exploit, the more the comparison yardstick changes from vendor to vendor.

Figure 1. So many benchmark choices.

Using the Entire Model

Once you get past the differences in source models there are still more questions. You need to ask [a] does the NPU accelerator run the entire model – all layers - or does the model need to be split with some graph layers running on other processing elements on chip, and [b] do any of the layers need to be changed to conform to the types of operators supported by the NPU?  

Most NPUs implement only a subset of the thousands of possible layers and layer variants found in modern neural nets.  Even for old benchmark networks like Resnet50 most NPUs cannot perform the final SoftMax layer computations needed and therefore farm that function out to a CPU or a DSP in what NPU vendors typically call “fallback”.  

This Fallback limitation magnifies tenfold when newer transformer networks (Vision Transformers, LLMs) that employ dozens of NMS or Softmax layer types are the target.  One of the other most common limitations of NPUs is a restricted range of supported convolution layer types.  While virtually all NPUs very effectively support 1x1 and 3x3 convolutions, many do not support larger convolutions or convolutions with unusual strides or dilations.  If your network has an 11x11 Conv with Stride 5, do you accept that this layer needs to Fallback to the slower CPU, or do you have to engage a Data Scientist to alter the network to use one of the known Conv types that the high-speed NPU can support?

Taking both of these types of changes into consideration, you need to carefully look at the spreadsheet answer from the IP vendor and ask “Does this Inferences/Sec benchmark data include the entire network, all layers?  Or is there a workload burden on my CPU that I need to measure as well?” 

Power benchmarks also are impacted: the more the NPU offloads back to the CPU, the better the NPU vendor’s power numbers look – but the actual system power numbers look far, far worse once CPU power and system memory/bus traffic power numbers are calculated.

Gathering and analyzing all the possible variations from each NPU vendor can be challenging, making true apples to apples comparisons almost impossible using only a spreadsheet approach. 

The Difference with a Quadric Benchmark

Not only is the Quadric general purpose NPU (GPNPU) a radically different product – running the entire NN graph plus pre- and post-processing C++ code – but Quadric’s approach to benchmarks is different also.  Quadric pushes the Chimera toolchain out in the open for customers to see at www.Quadric.io.   Our DevStudio includes all the source code for all the benchmark nodes shown, including links back to the source repos.  Evaluators can run the entire process from start to finish – download the source graph, perform quantization, compile a graph using our Chimera Graph Compiler and LLVM C++ compilers, and run the simulation to recreate the results.  No skipping layers.  No radical network surgery, pruning or operator changes.  No removal of classes.  The full original network with no cheating.  Can the other vendors say the same?

In July of 1887, Carl Benz held the first public outing for his “vehicle powered by a gas engine” – the first automobile ever invented. Nine short months after the first car was publicly displayed, the first auto race was staged on April 28, 1887 and spanned a distance of 2 kilometres (1.2 mi) from Neuilly Bridge to the Bois de Boulogne in Paris.

As cars became more dependable and commonplace as a means of convenient transportation, so too grew the desire to discover the technology’s potential. Auto racing formally became an organized sport in the early 1900’s and today it’s a multi-billion dollar per year industry and has produced vehicles capable of max speeds of almost 500kph (over 300mph).

Similar to that first auto race, Artificial Intelligence (AI) applications are in their nascent years, but many are speculating that their impact may be even greater than that of the automobile and many are racing to test their current limits. Similar to how race cars today look nothing like the original three-wheeled Motor Car, AI platforms are changing to meet the performance demands of this new class of programs.

If you’ve been following any of the companies that are joining this race, you may have heard them use the term “operator fusion“ when bragging about their AI acceleration capabilities. One of the most brag-worthy features to come out of the PyTorch 2.0. release earlier this year was the ‘TorchInductor’ compiler backend that can automatically perform operator fusion, which resulted in 30-200% runtime performance improvements for some users.

From context, you can probably infer that “operator fusion“ is some technique that magically makes Deep Neural Network (DNN) models easier and faster to execute… and you wouldn’t be wrong. But what exactly is operator fusion and how can I use operator fusion to accelerate my AI inference applications?

AI applications, especially those that incorporate Deep Neural Network (DNN) inference, are composed of many fundamental operations such as Multiply-Accumulate (MAC), Pooling layers, activation functions, etc.

Conceptually, operator fusion (sometimes also referred to as kernel or layer fusion) is an optimization technique for reducing the costs of two or more operators in succession by reconsidering their scope as if they were a single operator.

You intuitively do this type of “fusion“ routinely in your everyday life without much thought. Consider for example that you have two tasks in a given day: going to the office for work and going to the grocery store for this week's groceries.

Figure 1. Reducing instances of driving by one by fusing two tasks – “going to work“ and “going to the grocery store“ – by reconsidering them as a single task. Image by Author.

If your office and local grocery store are in the same part of town, you might instinctively stop by the grocery store on your way home from work, reducing the number of legs of your journey by 1 which might result in less total time spent driving and less total distance driven.

Operator fusion intuitively optimizes AI programs in a similar way, but the performance benefits are measured in fewer memory read/writes which results in fewer clock cycles dedicated to program overhead which generates a more efficient program binary.

Let’s look at a practical example. Here’s a visual diagram of a network before and after operator fusion performed by NVIDIA’s TensorRT™ Optimizer:

Figure 2. A GoogLeNet Inception module network before and after layer fusion performed by NVIDIA’s TensorRT™ Optimizer. Original image available in this blog post by NVIDIA.

In the example in Figure 2, operator fusion was able to reduce the number of layers (blocks not named “input“ or “next input“) from 20 to 5 by fusing combinations of the following operators:

The performance benefits vary depending on the hardware platform being targeted, but operator fusion can provide benefits for nearly every runtime target. Because of its universality, operator fusion is a key optimization technique in nearly all DNN compilers and execution frameworks.

If reducing the number of kernels reduces program overhead and improves efficiency and these benefits are applicable universally, this might lead us to ask questions like:

To better understand operator fusion's benefits and limitations, let’s take a deeper dive into the problem it’s solving.

A Practical Example: Fusing Convolutions, Bias Adds, & Activation Functions

Fusing convolutional layers, bias adds, and activation function layers, like NVIDIA’s TensorRT tool did in Figure 2, is an extremely common choice for operator fusion. Convolutions and activation functions can be decomposed into a series of matrix multiplication operations followed by element-wise operations that look something like:

Figure 3. Tensors used in a 3x3 Conv., Bias Add, and ReLU activation function sequence of graph operators. Image by Author.

If these operations are performed in three sequential steps, the graph computation would look something like (left side of Figure 4):

  1. Input patches (x) and kernel weights (w) are loaded from global into local memory
  2. Tensor product of input patches (x) and weights (w) are computed, outputs are stored in intermediate tensor (m) in local memory
  3. Intermediate tensor output (m) is written from local memory into global memory
  4. Bias value (b) and intermediate tensor output (m) are loaded from global memory into local memory
  5. Bias value (b) is summed with the intermediate tensor (m) to produce the convolution outputs (z)
  6. Convolution outputs (z) are written to global memory
  7. Convolution outputs (z) are loaded into local memory
  8. ReLU activation function is computed for convolution outputs (z), activation outputs (y) are stored in local memory

Activation outputs (y) are written to global memory to be used by next operator(s) in the graph

Figure 4. Local memory loads and stores for tensors used in a 3x3 Conv., Bias Add, and ReLU activation function sequence of graph operators compared to a single fused operator. Image by Author.

If the two operations are fused, the graph computation becomes simplified to something like (right side of Figure 4):

  1. Input patches (x), kernel weights (w), and bias values (b) are loaded from memory
  2. Tensor product of input patches (x) and weights (w) are computed, outputs are stored in intermediate tensor (m) in local memory
  3. Bias values (b) are summed with the intermediate tensor (m) to produce the convolution outputs (z)
  4. ReLU activation function is computed for convolution outputs (z), activation outputs (y) are stored in local memory
  5. Activation outputs (y) are written to global memory to be used by next operator(s) in the graph

By representing these three operations as a single operation, we are able to remove four steps:the need to write and read the intermediate tensor (m) and the convolution output tensor (z) out of and back into local memory. In practice, operator fusion is typically achieved when a platform can keep intermediate tensors in local memory on the accelerator platform.

There are three things that dictate whether the intermediate tensors can remain in local memory:

As we mentioned before, some accelerator platforms are better suited to capitalize on the performance benefits of operator fusion than others. In the next section, we’ll explore why some architectures are more equipped to satisfy these requirements.

The Accelerator Architectural Dilemma: Optimize for Compute or Memory?

DNN models are becoming increasingly deep with hundreds or even thousands of operator layers in order to achieve higher accuracy when solving problems with increasingly broader scopes. As DNNs get bigger and deeper, the memory and computational requirements for running inference also increase.

There are numerous hardware platforms that optimize performance for these compute and memory bottlenecks in a number of clever ways. The benefits of each can be highly subjective to the program(s) being deployed.

We’re going to look at three different types of accelerator cores that can be coupled with a host CPU on a System-on-Chip (SoC) and be targeted for

and see how capable each architecture is of operator fusion:

GPUs: Throwing Cache at the Problem

GPUs, like the Arm Mali-G720, address the memory bottlenecks of AI and abstract the programming complexity of managing memory by using L2 cache. They address the compute bottlenecks by being comprised of general compute cores that operate in parallel, but cannot communicate directly with one another.

Figure 5. Arm Mali-G720 GPU Memory Hierarchy. Adapted from a diagram in this blog post

In the context of a GPU’s memory hierarchy, operator fusion is possible if the intermediate tensors can fit entirely in the Local Register Memory (LRM) of a given Shader Core without needing to communicate back out to L2 cache.

Figure 6. Tensor locations for 3x3 Conv., Bias Add, and ReLU activation function sequence of graph operators in an Arm Mali-G720 GPU memory hierarchy. Image by Author.

Since the GPU cores cannot share data directly with one another, they must write back to L2 Cache in order to exchange intermediate tensor data between processing elements or reorganize to satisfy the requirements of the subsequent operators. This inability to share tensor data between compute cores without writing to and from L2 cache prevents them from fusing multiple convolutions together without duplicating weight tensors across cores which taxes their compute efficiency and memory utilization. Since the memory hierarchy is cache-based, the hardware determines when blocks of memory are evicted or not which can make operator fusion nondeterministic if the available memory is approaching saturation.

NPU Accelerators: Optimized Compute at the Cost of System Flexibility

For these reasons, GPUs satisfy the compute generality requirement for operator fusion, but are occasionally vulnerable to the memory availability and memory organization requirements.

NPUs, like the Arm Ethos-N78, are customized hardware-accelerator architectures for accelerating AI and, therefore, the way they accelerate AI varies; however the most performant ones today typically use hard-wired systolic arrays to accelerate MAC operations.

Figure 7. Diagram of a systolic array for MAC operations. Original image adapted from two images in this blog post.

MAC operations are the most frequent and, therefore, are often the most expensive operations found in DNN inference. Since NPUs that use systolic arrays are designed to accelerate this compute, they are ultra-efficient and high-performance for 90%+ of the compute needed to run DNN inference and can be the logical choice for some AI applications.

Although NPUs are extremely performant for MAC operations, they’re not capable of handling all operators and most be coupled with a more general processing element to run an entire program. Some NPUs will off-load this compute to the host CPU. The more performant ones have dedicated hardware blocks next to the systolic arrays to perform limited operator fusion for things like bias adds and activation functions.

Figure 8. Memory hierarchy for a MAC acceleration systolic array NPU. Image by Author.

In the context of an ASIC’s memory hierarchy, operator fusion is possible if the intermediate tensors can continuously flow through the systolic array and on-chip processing elements without needing to communicate back to the host’s shared memory.

The most performant NPU’s are designed with operator fusion in mind for a limited set of permutations of operators. Since they are hard-wired and therefore cannot be programmed, they cannot handle memory reshaping or reorganization or even fuse some activation functions that aren’t handled by their coupled hardware blocks. To perform these operations, the programs must write back out to some shared memory: either on device or shared memory on the SoC.

Figure 9. Tensor locations for 3x3 Conv., Bias Add, and ReLU activation function sequence of graph operators in an NPU memory hierarchy. Image by Author.

Their lack of programmability and sensitivity to some model architectures, like the increasingly popular transformers, make them inaccessible to many AI applications. Shameless plug: if you’re interested in learning more about this subject, check out our recent blog post on this topic.

For these reasons, custom NPU accelerators do not satisfy the compute generality and memory organization requirements for broadly-applicable operator fusion.


GPNPUs: Maximizing Memory and Compute Utilization

GPNPUs, like the Quadric Chimera QB16 processor, are also purpose-built processor architectures and, therefore, the way they accelerate AI varies; however, most use distributed compute elements in a mesh network for parallelization of compute and more efficient memory management.

GPNPUs strike a middle ground between NPUs and GPUs by being capable of both AI application-specific and generally programmable. Using thoughtful hardware-software co-design, they’re able to reduce the amount of hardware resources consumed while also being fully programmable.

The Chimera family of GPNPUs are fully C++ programmable processor cores containing a networked mesh of Processing Elements (PEs) within the context of a single-issue, in-order processor pipeline. Each PE has its own local register memory (LRM) and collectively are capable of running scalar, vector, and matrix operations in parallel. They’re mesh-connected thus are capable of sharing intermediate tensor data directly with neighboring PEs without writing to shared L2 memory.

Figure 10. Memory hierarchy for a Quadric GPNPU. Image by Author.

In the context of a GPNPU's memory hierarchy, two successive operations are considered to be fused if the intermediate tensor(s) between those two operations do not leave the distributed LRM, the memory block within each PE.

The difference between the PEs in a systolic array and the array of PEs in the GPNPU is that the PEs are collectively programmed to operate in parallel using the same instruction stream. Since the dataflow of most DNN operators can be statically defined using access patterns, simple APIs can be written to describe the distribution and flow of the data through the array of PEs. This greatly simplifies the developer experience because complex algorithms can be written without needing to explicitly program every internal DMA transfer. In the instance of Quadric’s Chimera processor, the Chimera Compute Library (CCL) provides all of those higher-level APIs to abstract the data movement such that the programmer does not need to explicitly gain deep knowledge of the two-level DMA within the architecture.

For these reasons, GPNPUs strike a balance between thoughtfully designed hardware for AI compute with the programmability and flexibility of a general-parallel compute platform like a GPU. GPNPUs satisfy the general compute and the memory organization requirements for operator fusion and are only limited occasionally by the memory availability requirement.

Let’s take a look at a similar practical example of operator fusion performed by a GPNPU and its compiler, the Chimera Graph Compiler (CGC).

Quadric’s Chimera Graph Compiler Does Operator Fusion Better

Here’s a visual diagram of a MobileNetV2 network before and after operator fusion performed by Quadric’s Chimera™ Graph Compiler (CGC) in the style of NVIDIA’s diagram from Figure 2:

Figure 11. A MobileNetV2 module network before and after layer operator fusion performed by Quadric’s Chimera™ Graph Compiler (CGC).

In the example above, operator fusion performed by CGC was able to reduce the number of layers from 17 to 1. This feat is even more impressive when you consider the sequence of layers that were fused.

The fused layer produced by CGC contains four convolutional operators with varying number of channels and filter sizes. Since the GPNPU uses a networked mesh of PEs, data can be shared by neighboring PEs without writing back out to shared L2 memory. These data movements are an order of magnitude faster than writing out to L2 memory and allow data to “flow” through the PEs similar to how a systolic array computes MAC operations, but in a programmable way.

Figure 12. Tensor locations for 3x3 Conv., Bias Add, and ReLU activation function sequence of graph operators in an Quadric Chimera GPNPU memory hierarchy. Image by Author.

Since the PEs are collectively programmed with a single instruction stream, the APIs available to represent these operators are remarkably simple which can result in very short, but performant code. Below is the auto-generated C++ code-snippet for the MobileNetV2 operators from Figure 10 generated by Quadric’s CGC. Without comments, which have been added, it is 37 lines of code for all 17 operators:

The total size of the intermediate tensors that were fused in this code snippet and therefore not moved between LRM and L2 memory was 6,623.2 KB or ~6.6 MB. For context, the input data passed into the first layer of this block was 150.5 KB and the intermediate tensor data that was finally moved out to L2 memory was 1.2 MB. By aggressively leveraging operator fusion, CGC was able to reduce the total memory movement overhead of this section of the MobileNetV2 kernel by 83%.

For high-performance edge applications that are optimizing performance for TOPS/W, these data movement savings translate directly to power savings. Below in Figure 13 is a table of the relative costs associated with moving a 32b data element from the ALU/MAC engine register file memory on a Chimera GPNPU to each of the other levels of memory:

By performing operator fusion and not moving those 6.6 MB of intermediate tensor data, an application developer can expect a ~185x reduction in power consumption.

If you’re a hardware designer looking to enable high-performance, lower power AI applications for your developers and are interested in the potential performance benefits that Quadric’s Chimera Graph Compiler (CGC) can provide, consider signing up for a Quadric DevStudio account to learn more.

Graph Compilers are just getting started!

The GNU C Compiler – GCC – was first released in 1987.  36 years ago.  Several version streams are still actively being developed and enhanced, with GCC13 being the most advanced, and a GCC v10.5 released in early July this year.

You might think that with 36 years of refinement by thousands of contributors that penultimate performance has been achieved.  All that could be discovered has been discovered?  You’d be wrong.  As chart A (source: openbenchmarking.org) shows, incremental performance is still being squeezed out of GCC despite it being old enough to be a grandparent.  The geometric mean improvement on the set of 70 benchmarks from Phoronix is 0.2% over the preceding version, and a full 1.6% compared to GCC 5.5 from 2017 – which came 30 full years after the initial release.  36 years after the birth of GCC and still there are meaningful gains to be had.

Figure A:  GCC Performance Benchmarks

Gains in GCC are no longer coming in leaps and bounds.  But gains are still coming.  The asymptote of “perfectly generated code” will likely always be just out of reach.

Graph Compiler Infancy

Compared to the mature 36-year-old GCC, the TVM compiler project – an open-source graph compiler project managed by the Apache software foundation – is in its infancy.  First described in a 2017 research paper from University of Washington researchers, the TVM project was adopted by the Apache foundation in 2020 and has gained notable traction in the machine learning inference world. 

Quadric’s Chimera Graph Compiler (CGC) is based in-part on TVM and has been heavily extended to optimize for the Chimera architecture.  CGC is less than a full 1 year old since Beta deliveries to Quadric customers began in late 2022.  CGC is therefore very, very early in the compiler maturity curve, as suggested by Figure B.

Figure B:  Quadric’s Graph Compiler is only just beginning to shine.

Big Leaps in Performance

Unlike fixed-function CNN accelerators deployed in many conventional SoCs today, the Quadric Chimera GPNPU is driven by compiled code.  C++ code is generated from Machine Learning (ML) graphs by the CGC graph compiler and by human programmers, then compiled by the LLVM compiler to create the executable binary running on the Chimera core.

With each new major release of CGC we are seeing big steps improvements in total performance.  How big?  This chart shows the detailed performance improvements that Quadric delivered from the June 2023 release to the latest August 2023 release:

The most recent CGC update delivered over 50% improvement on one of the common Resnet18 benchmarks, and even a 17% performance improvement on the Vision Transformer (ViT_B).  While most fixed-function NN accelerators cannot run Vision Transformers at all, Quadric not only runs transformers, but is poised to continue to deliver large increases in performance – without hardware changes - as the CGC to LLVM compiler stack continues to mature and improve.  

35 More Years of Improvement?

Will 17-50% boosts in performance happen with every quarterly incremental release?  Perhaps not that extreme, but we are also nowhere near the end of the curve.    New optimizations. New re-orderings. Smarter memory layout of tensors.  More refined prefetching. Multi-tiered fusions.  And more still to come.

Can your fixed-function convolution accelerator be tweaked as knowledge of algorithms grows? No.  Can you easily add new graph operators to a finite state machine hard-wired in all layer silicon?  No.  With a conventional accelerator the functionality is frozen the minute the mask set is created – so you’d better hope that you found those 36 years’ worth of optimizations before paying for that mask set.

Transformers were first introduced by the team at Google Brain in 2017 in their paper, “Attention is All You Need“. Since their introduction, transformers have inspired a flurry of investment and research which have produced some of the most impactful model architectures and AI products to-date, including ChatGPT which is an acronym for Chat Generative Pre-trained Transformer.

Transformers are also being employed for vision applications (ViTs). This new class of models was empirically proven to be viable alternatives to more traditional Convolutional Neural Networks (CNNs) in the paper “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale“, published by the team at Google Brain in 2021.

Vision transformers are of a size and scale that are approachable for SoC designers targeting the high-performance, edge AI market. There’s just one problem: vision transformers are not CNNs and many of the assumptions made by the designers of first-generation Neural Processing Unit (NPU) and AI hardware accelerators found in today’s SoCs do not translate well to this new class of models.

What makes Vision Transformers so special?

ViTs garnered a lot of hype because the team at Google Brain proved that they were viable alternatives to CNNs. CNNs, as their name suggests, are built using convolutional filters. In Figure 1 below, we have a 6x6 input matrix on the left, a 3x3 convolutional filter in the middle, and a 4x4 output tensor on the right. The output tensor’s values are calculated by multiplying each 3x3 section of the input matrix on the left with the 3x3 convolutional filter.

This particular convolutional filter, with positive 1 values in its left column, 0 values in its middle column, and negative 1 values in its right column, produces positive output values where vertical edges are found in the original matrix, i.e. in the middle of the example input matrix.

Figure 1: Convolutional filter for detecting vertical edges

Figure 1: Convolutional filter for detecting vertical edges

The important thing to note from the above example is that convolutional filters, like this vertical edge detection filter, learn local features within an image. CNNs have many of these filters and each filter learns what values will extract the most meaningful information from the input image, but each filter only considers information in a localized window of the input image, e.g. a 3x3 crop of the image.

By stacking layers of these convolutional filters on top of one another, i.e. creating deep neural networks (DNNs), these local filters gradually gain greater attention over more abstract patterns that exist in larger sections of the image because they are consuming as inputs the filter outputs from a collection of adjacent local filters. We can see this progression of learned abstractions – from edges, to textures, to patterns, to object parts, to full objects – by inspecting the intermediate layers of DNNs at different depths as depicted below in Figure 2.

Figure 2: Progression of learned abstractions by visualizing features of a pre-trained DNN at increasingly deep layers. Original image available in this blog post by Google Research team.

Figure 2: Progression of learned abstractions by visualizing features of a pre-trained DNN at increasingly deep layers. Original image available in this blog post by Google Research team.

Vision transformers are revolutionary because they employ global attention at each layer. Attention, as introduced in the paper “Attention is All You Need“, is a dense mapping of weights between different elements in a sequence. These weights represent the relative importance of each element in the sequence to all other elements in the sequence.

To more intuitively understand attention, take a look at the sentence below:

"I poured water from the bottle into the cup until it was full."

We might infer that the “it“ pronoun is referring to the “cup“ noun in this sentence because of the adjective “full“; however, by changing the word “full” to “empty”, the reference object for “it” changes from “cup” to “bottle“. We change this inference without much thought because of the innate knowledge we have about how the verb “pour” works, i.e., the act of pouring implies that the bottle is losing water and the cup is gaining water. This example demonstrates the relative importance of the word “full” to the context of the word “it” in this sentence.

Notice that in this example does not consider groups of three words at a time, but instead considered the entire sentence at the same time. Conceptually, this is what it means to have global attention and it can be very useful in comparison to local attention in inferring context within a problem space.

There’s just one significant problem with the concept of global attention employed by transformers: it’s a dense mapping and dense mappings scale quadratically.

In the above sentence, there are 13 words, i.e. N=13 elements in the sequence. To achieve global attention on this sequence, we need W=N*(N-1) or W=13*12=156 weights to represent the relative importance of each element to each other element (excluding ground truth class labels and patch delineators).

Figure 3: Visualization of an attention map.

Figure 3: Visualization of an attention map.

This operation is expensive, but feasible for this two-dimensional data. Unfortunately, global attention becomes untenable when we try to adapt to higher dimensional data like RGB images used in computer vision applications, i.e. when N=224x224=50,176 pixels in an image and we need W=50176*(50175)=2,517,580,800 weights for global attention.

To solve this problem, ViTs preprocess the three-dimensional image data into a two-dimensional representation. They accomplish this by:

  1. splitting the inputs into patches,
  2. creating a linear projection or two-dimensional “embedding” of each patch, and
  3. linearly combining a constant positional vector with the patch embedding to retain the patches position within the original image.
Figure 4: Depiction of image preprocessing required for Vision Transformers (ViT). Original image pulled from this paper.

Figure 4: Depiction of image preprocessing required for Vision Transformers (ViT). Original image pulled from this paper.

To put it more simply, in traditional natural language transformers, the input sequences of data are sentences composed of words. Analogously in vision transformers, each image is a “sentence” and each patch embedding is a “word”.

At face value, these concepts do not seem to be so revolutionary. Dense or fully-connected layers were implemented as a part of Multilayer Perceptrons (MLP), the earliest proof-of-concept of neural networks. Similarly, the preprocessing needed for image vectorization is, fundamentally, just a form of mathematical embedding learned by a neural network.

Why is it so hard to get ViT to run on my AI accelerator?

To understand this why it's so hard to run ViT on most AI accelerators, we need to understand:

  1. the sequence and type of operators that make up a transformer encoder, and
  2. the architectural assumptions made by NPU and AI accelerator designers.

Transformer Encoder vs. CNN

In Figure 4, we looked at an image focusing on the pre-processing needed to adapt three-dimensional image data to work with a transformer architecture. Below, in Figure 5, we zoom out to see what happens after the image data is preprocessed:

Figure 5: Entire Vision Transformer (ViT) architecture. Original image pulled from this paper

Figure 5: Entire Vision Transformer (ViT) architecture. Original image pulled from this paper.

Specifically, we want to look at the Transformer Encoder block on the right side of the image above. These encoder blocks are stacked L times for different sizes of ViT models, just like how ResNet-18 and ResNet-50 models are the same architecture with different numbers of stacked residual blocks.

The key differences to note between the ViT Encoder block and most CNN blocks is that it has normalization (represented as Norm layers in Figure 5) and softmax layers (the activation function used for the MLP layer in Figure 5) in the middle of the network. In almost all CNN architectures, normalization is performed once at the beginning of inference and Softmax is performed once at the end of the network.

Normalization and softmax layers are simple enough mathematical operations that do operate on large tensors in the context of DNNs. The challenge these pose to many AI SoCs targeting the edge is that they cannot be accelerated by linear algebra accelerators and in heterogeneous compute platforms need to be processed by a DSP, GPU, or CPU.

Architectural Assumptions Made by NPU Designers

Heterogeneous compute nodes are computing devices with different architectures optimized for specific tasks, e.g. an AI SoC might include a CPU, a DSP, and an NPU like the design on the left in Figure 6 below:

Figure 6: A heterogeneous AI SoC design with a dedicated NPU, DSP and CPU (left) compared with a homogeneous SoC design with a single, Chimera general-purpose NPU (GPNPU) processor core (right).

Figure 6: A heterogeneous AI SoC design with a dedicated NPU, DSP and CPU (left) compared with a homogeneous SoC design with a single, Chimera general-purpose NPU (GPNPU) processor core (right). Original image pulled from Quadric website.

Heterogeneous computing, as a design principle for AI, requires that programs be segmented into their component tasks and each task must target its most optimal compute node for runtime. If programmed or compiled incorrectly to target an inefficient compute node, e.g. the CPU instead of the AI accelerator, the runtime performance of the program can suffer greatly.

Heterogeneous computing platforms, and the NPU cores used within them, have been optimized for performance on most CNNs. Since most CNNs do not have any softmax or normalization operators in the middle of the network, most NPUs have been designed to optimize for only the convolutional compute which is just basic linear algebra.

NPUs have optimized for these multiply-accumulate (MAC) operations that constitute linear algebra math with great success and heterogeneous computing platforms that use these NPUs have excelled at running CNNs because there’s very infrequent, if any, data movement between compute nodes during inference. The entire inference program can be easily pipelined into three stages:

  1. Input data is color converted, reshaped, formatted, and normalized by a GPU or DSP,
  2. formatted data is off-loaded to the NPU for the MAC linear algebra operations like convolutions and fully-connected layers, and
  3. convolutional outputs are off-loaded to the GPU or DSP for softmax activation.

Heterogeneous computing platforms can hide most of the expensive memory-movement operations in these types of programs by pipelining the compute. Latency, or the time it takes to run the first inference, may be long, but throughput, the time it takes to run inference on average, is only limited by the slowest stage in this pipeline.

This runtime strategy, when applied to ViT architectures, creates a pipeline that requires frequent data movement between the different compute nodes:

  1. Input data is color converted, reshaped, formatted, and normalized by a GPU or DSP,
  2. formatted data is off-loaded to the NPU for the linear projection of image patches into two-dimensions,
  3. image patches are sent back to the GPU or DSP for normalization,
  4. normalized patches are sent back to the NPU for attention mapping,
  5. back to the GPU or DSP for normalization,
  6. back to the NPU for MLP layer,
  7. back to GPU or DSP for Softmax activation
  8. Repeat steps 3-7 for L stacked transformer encoder blocks. (The smallest “base“ ViT model has L=6, the large has L=12, and the huge has L=16.)

This frequent movement of intermediate tensors between different compute nodes results in complex scheduling algorithms and significant overhead. This overhead of moving data between compute nodes substantially reduces the runtime efficiency of a model and burns excessive power. In AI SoC targeting power-sensitive edge applications, those extra memory-movement operations may render the system unviable.

Optimizing AI SoCs for performance on CNNs has enabled a lack of curiosity surrounding how to accelerate inference broadly. Heterogeneous computing platforms are using existing hardware IP and optimizing it for performance on AI tasks using complex software tricks. The only new hardware block that has been invented to address AI applications is the NPU and it was assumed that the only operations it would need to accelerate were MAC operations that make up the convolutional and dense layers in the middle of CNNs. The pervasiveness of this mindset can be seen by some NPU developers reporting model complexity in number of MAC operations. If MAC counts alone were indicative of a model’s complexity, ViTs would not be so challenging to run on AI SoCs that are optimized with these assumptions.

Conclusion

The SoC for AI applications that most easily adapts to new model architectures, like vision transformers, will win in the market long-term because:

  1. ViTs are a clever adaptation of the popular transformer architecture that works for Computer Vision applications,
  2. ViTs employ a unique permutation of common ML operators in comparison to the previously most popular computer vision architectures, CNNs, and
  3. ViTs are problematic for heterogeneous AI SoCs with NPUs that were designed to only accelerate MAC operations.

We demonstrated our Quadric Developer Studio at Embedded Vision Summit on Tuesday and Wednesday, May 22 and 23, 2023, at the Santa Clara Convention Center.

We’d like to say the crowds went wild, but when’s the last time you saw wild crowds at an industry trade show? However, we can say that we had wonderful discussions about the industry’s need for an integrated ML plus DSP development system. Until now, most neural processing units (NPUs) used for AI have been hard-coded inflexible hardware, and any programming changes had to be offloaded to a much slower DSP or CPU core.

Our DevStudio features a GUI for constructing complex signal chains mixing classic C++ code plus neural net graph code, uploading and compiling machine learning ONNX graphs, uploading and compiling C++ code, and simulating entire workloads. By merging neural network graphs and C++ code into one application, only one tools is required for scalar, vector, and matric computations.

Visitors enjoyed the opportunity to see the DevStudio up close and watch a demo. They had lots of questions because no other tool does this. They understood that it’s essentially impossible to get an NPU to run new code – and offloading that code to a DSP or CPU is inherently much, much slower.

A big reason purely electric car sales only reached 6% of new vehicle sales in Q3 2022 is the fear of running low on battery power and the lack of readily available fast-charging infrastructure. This “Range Anxiety” in a way parallels the semiconductor market, which has range anxiety problems, too.

Machine learning (ML) processing power is being built into new chip designs for nearly every end market. To build in ML capabilities, designers are using mixtures of programmable cores (CPUs, GPUs and DSPs) along with dedicated ML inference accelerators (NPUs) to run today’s latest ML inference models.

However, ML algorithms are rapidly changing as data scientists discover new, more efficient and powerful techniques. The ML benchmarks used today to evaluate new building blocks for SoC designs did not exist three or four years ago. The silicon being designed today to 2023 standards will be deployed in 2025 and 2026, at which time ML algorithms will change and improve. Changes made to ML models include the creation of fundamentally new ML operators and different topologies that rearrange known operators into ever deeper and more complex networks. The creation of new ML operators is a huge concern for SoC designers. What if today’s ML accelerator can’t support these new ML operators? The term “Operator Anxiety” describes the concerns that the accelerator selected now might not support future features. This anxiety is as powerful for chip designers as range anxiety is for electric car purchasers.

Today’s SoC architects typically pair a fully programmable DSP, GPU or CPU with an NPU accelerator for machine learning. NPU accelerators are typically hardwired for maximum efficiency for the most common ML operators, including activations, convolutions, and pooling. Most NPU accelerators have no ability to add new functions once implemented in silicon. A few offer limited flexibility for the IP vendor to hand write micro-coded command streams, but the actual software developer for the chip doesn’t have this flexibility.

With NPU accelerators, the CPU, DSP or GPU must be used to run the new operators in what is called a Fallback Mode. The challenge is that the CPU, DSP or GPU may be an order of magnitude slower than the NPU, particularly at running these types of operators. This is the source of the “Operator Anxiety.” Overall system performance usually suffers when the function must move out of the accelerator to run on a much slower processor.

Just like electric car owners suffer significant delays when they must charge their cars with an extension cord plugged into a 110V wall socket, the chip performance suffers when a new ML operator must be diverted to a slow DSP or CPU.

Electric vehicle Range Anxiety can be solved by deploying a widespread network of fast chargers. Operator Anxiety can be solved by using a processor that can run any operator with the performance efficiency of a dedicated NPU.

Curing Operator Anxiety

Yes, there is a processor that can run any new operator with the performance efficiency of a dedicated NPU. The Chimera™ GPNPU from Quadric – available in 1 TOPS, 4 TOPS, and 16 TOPS variants – is that sought-after anxiety relieving solution.  The Chimea GPNPU can deliver the matrix-optimized performance of an NPU optimized for ML while also being fully C++ programmable by the software developer, long after the chip is produced.

New ML operators can be written in C++ and will run just as fast as the native operators that come with the Chimera GPNPU. The Chimera core eliminates Operator Anxiety as there is no need to fallback to the CPU or DSP, even if there are significant improvements in operators or graphs in the future.

© Copyright 2024  Quadric    All Rights Reserved     Privacy Policy

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram