Why You Should Create Your Own NPU Benchmarks

January 2, 2024

It Will Make for a Much Better NPU Vendor Evaluation.

We’ve written before about the ways benchmarks for NPUs can be manipulated to the point where you just can’t trust them. There are two common major gaps in collecting useful comparison data on NPU IP: [1] not specifically identifying the exact source code repository of a benchmark, and [2] not specifying that the entire benchmark code be run end-to-end, with any omissions reported in detail. Our blog explains these gaps.

However, there is a straight-forward, low-investment method to short-circuit all the vendor shenanigans and get a solid apples-to-apples result: Build Your Own Benchmarks. BYOB!

This might sound like a daunting task, but it isn’t. At the very beginning of your evaluation, it’s important to winnow the field of possible NPU vendors. This winnowing is essential now that a dozen or more IP companies are offering NPU “solutions.” At this stage, you don’t need to focus on absolute inference accuracy as much as you need to judge key metrics of [1] performance, [2] memory bandwidth, [3] breadth of NN model support; [4] breadth of NN operator support; and [5] speed and ease of porting of new networks using the vendors’ toolsets. Lovingly crafted quantization can come later.

The Problem with Most ML Reference Models

All machine learning models begin life in a model training framework using floating point data representation. The vast majority of published reference models are found in their native FP32 format – because it’s easy for the data science teams that invent the latest, greatest model to simply publish the original source. But virtually all device-level inference happens in quantized, INT-8 or mixed INT-8/16 formats on energy-optimized NPUs, GPNPUs, CPUs and GPUs that perform model inference at 5X to 10X lower energy than a comparable FP32 version of the same network.

Quantizing FP32 models to INT8 in a manner that preserves inference accuracy can be a daunting prospect for teams without spare data scientists looking for something to keep them busy. But do you really need a lovingly crafted, high-accuracy model to judge the merits of 10 vendor products clamoring for your attention? In short, the answer is: NO.

First, recognize that the benchmarks of today (late 2023, early 2024) are not the same workloads that your SoC will run in 2026 or 2027 when it hits the market – so any labor spent fine-tuning model weights today is likely wasted effort. Save that effort for perhaps a final step in the IP vendor selection, not early in the process.

Secondly, note that the cycle counts and memory traffic data that will form the basis for your PPA (power, performance, area) conclusions will be exactly the same whether the model weights were lovingly hand-crafted or simply auto-converted. What matters for the early stages of an evaluation are: can the NPU run my chosen network? Can it run all the needed operators? What is the performance?

Making Your Own Model

Many IP vendors have laboriously hand-tuned, hand-pruned and twisted a reference benchmark beyond recognition in order to win the benchmark game. You can short-circuit that gamesmanship by preparing your own benchmark model. Many training toolsets have existing flows that can convert floating point models to quantized integer versions. (e.g TensorFlow QAT; Pytorch Quantization). Or you can utilize interchange formats, such as ONNX Runtime, to do the job. Don’t worry about how well the quantization was performed – just be sure that all layers have been converted to an Integer format.

Even if all you do is take a handful of the common networks with some variety – for example: Resnet50, an LLM, VGG16, Vision Transformer, and a Pose network – you can know that the yardstick you give to the IP vendor is the same for all vendors. Ask them to run the model thru their tools and give you both the performance results and the final form of the network they ran to double-check that all of the network ran on the NPU – and none of the operators had to fallback onto a big, power-guzzling CPU.

Better yet – ask the vendor if their tools are robust enough to let you try to run the models yourself!

Getting the Information That Really Counts

BYOB will allow you to quickly ascertain the maturity of a vendor toolchain, the breadth of operator coverage, and the PPA metrics you need to pare that list of candidates down to a Final Two that deserve the full, deep inspection to make a vendor selection.

If you’d like to try running your own benchmarks on Quadric’s toolchain, we welcome you to register for an account in our online Developer’s Studio and start the process! Visit us at www.quadric.io