Why You Can’t Trust Your NPU Vendor’s Benchmarks

October 15, 2023

Your Spreadsheet Doesn’t Tell the Whole Story.

Thinking of adding an NPU to your next SoC design? Then you’ll probably begin the search by sending prospective vendors a list of questions, typically called an RFI (Request for Information) or you may just send a Vendor Spreadsheet. These spreadsheets ask for information such as leadership team, IP design practices, financial status, production history, and – most importantly – performance information, aka “benchmarks”.

It's easy to get benchmark information on most IP – these benchmarks are well understood. For an analog I/O cell you might collect jitter specs. For a specific 128-pt complex FFT on a DSP there’s very little wiggle room for the vendor to shade the truth. However, it’s a real challenge for benchmarks for machine learning inference IP, which is usually called an NPU or NPU accelerator.

Why is it such a challenge for NPUs? There are two common major gaps in collecting useful “apples to apples” comparison data on NPU IP: [1] not specifically identifying the exact source code repository of a benchmark, and [2] not specifying that the entire benchmark code be run end to end, with any omissions reported in detail.

Specifying Which Model

It’s not as easy as it seems. Handing an Excel spreadsheet to a vendor that asks for “Resnet50 inferences per second” presumes that both parties know exactly what “Resnet50” means. But does that mean the original Resnet50 from the 2015 published paper? Or from one of thousands of other source code repos labeled “Resnet50”?

The original network was trained using FP32 floating point values. But virtually every embedded NPU runs quantized networks (primarily INT8 numerical formats, some with INT4, INT16 or INT32 capabilities as well.) which means the thing being benchmarked cannot be “the original” Resnet50. Who should do the quantization – the vendor or the buyer? What is the starting point framework – a PyTorch model, a TensorFlow Model, TFLite, ONNX – or some other format? Is pruning of layers or channels allowed? Is the vendor allowed to inject sparsity (driving weight values to zero in order to “skip” computation) into the network? Should the quantization be symmetric, or can asymmetric methods be used?

You can imagine the challenge of comparing the benchmark from vendor X if they use one technique but Vendor Y uses another. Are the model optimization techniques chosen reflective of what the actual users will be willing and able to perform three years later on the state-of-the-art models of the day when the chip is in production? Note that all of these questions are about the preparation of the model input that will flow into the NPU vendor’s toolchain. The more degrees of freedom that you allow an NPU vendor to exploit, the more the comparison yardstick changes from vendor to vendor.

Figure 1. So many benchmark choices.

Using the Entire Model

Once you get past the differences in source models there are still more questions. You need to ask [a] does the NPU accelerator run the entire model – all layers - or does the model need to be split with some graph layers running on other processing elements on chip, and [b] do any of the layers need to be changed to conform to the types of operators supported by the NPU?

Most NPUs implement only a subset of the thousands of possible layers and layer variants found in modern neural nets. Even for old benchmark networks like Resnet50 most NPUs cannot perform the final SoftMax layer computations needed and therefore farm that function out to a CPU or a DSP in what NPU vendors typically call “fallback”.

This Fallback limitation magnifies tenfold when newer transformer networks (Vision Transformers, LLMs) that employ dozens of NMS or Softmax layer types are the target. One of the other most common limitations of NPUs is a restricted range of supported convolution layer types. While virtually all NPUs very effectively support 1x1 and 3x3 convolutions, many do not support larger convolutions or convolutions with unusual strides or dilations. If your network has an 11x11 Conv with Stride 5, do you accept that this layer needs to Fallback to the slower CPU, or do you have to engage a Data Scientist to alter the network to use one of the known Conv types that the high-speed NPU can support?

Taking both of these types of changes into consideration, you need to carefully look at the spreadsheet answer from the IP vendor and ask “Does this Inferences/Sec benchmark data include the entire network, all layers? Or is there a workload burden on my CPU that I need to measure as well?”

Power benchmarks also are impacted: the more the NPU offloads back to the CPU, the better the NPU vendor’s power numbers look – but the actual system power numbers look far, far worse once CPU power and system memory/bus traffic power numbers are calculated.

Gathering and analyzing all the possible variations from each NPU vendor can be challenging, making true apples to apples comparisons almost impossible using only a spreadsheet approach.

The Difference with a Quadric Benchmark

Not only is the Quadric general purpose NPU (GPNPU) a radically different product – running the entire NN graph plus pre- and post-processing C++ code – but Quadric’s approach to benchmarks is different also. Quadric pushes the Chimera toolchain out in the open for customers to see at www.Quadric.io. Our DevStudio includes all the source code for all the benchmark nodes shown, including links back to the source repos. Evaluators can run the entire process from start to finish – download the source graph, perform quantization, compile a graph using our Chimera Graph Compiler and LLVM C++ compilers, and run the simulation to recreate the results. No skipping layers. No radical network surgery, pruning or operator changes. No removal of classes. The full original network with no cheating. Can the other vendors say the same?