Let's Visit the Zoo

November 17, 2025

The term “model zoo” first gained prominence in the world of AI / machine learning beginning in the 2016-2017 timeframe.  Originally used to describe open-source public repositories of working AI models – the most prominent of which today is Hugging Face – the term has since been adopted by nearly all vendors of AI chips and licensable neural processors (NPUs) to describe the set of working models that have been ported to those vendors’ specific platforms.

Why Vendor-Specific Zoos In the First Place?

Understanding model zoos starts with understanding the origin of the “animals” in the zoos.  Data scientists create new AI models by inventing and training the models.  That training process is dominated almost entirely by researchers using Nvidia GPUs as the training hardware and training software frameworks such as PyTorch and Tensorflow. Almost 100% of such models are created and published using floating point data types (FP32).  FP32 delivers the highest possible accuracy and is a native data type for NVDA GPUs.

While FP32 is great for ease of training, it is not energy efficient as a data representation for real-world AI inference.  NPUs – neural processors – in edge devices and embedded systems typically perform most AI inference computations using 8-bit Integer data formats (INT8), and some systems are pushing to even smaller formats (INT4, others).  The advantages of INT8 versus FP32 are stark – 10X to 20X lower energy consumption per unit of computation, plus dramatic model size reduction arising from converting model weights from 32bits per element to only 8 or fewer bits per element.  Take, for example, the idea of running 8 billion parameter Large Large Model (LLM) on a phone or laptop: executing the model in the original FP32 format would entail needing 4 x 8 = 32 GBytes of DRAM on your phone just to hold the model weights, and to stream that entire model from DDR into the phone applications processor SoC repeatedly for each token generated.  32GB of DDR is cost and power prohibitive on a phone.  Compressing the model weights to 8bit Integer or 4bit integer formats makes it practical to execute such a model on your phone.  As a result, every commercially available NPU uses one or more data types that are dramatically smaller than FP32, with the majority of NPUs using INT8 as the basic building block.

Sadly, converting a model – a process known as Quantization – is neither pushbutton nor uniform.  Merely blindly converting every layer and every model weight from the high precision of floating point to the more granular INT8 format often unacceptably reduces model accuracy.  Some operators need to be preserved in higher-precision data formats. That’s where the conversion process gets tricky.  Each NPU and each SoC architecture has different compute resources with different data types – INT4, INT8, INT16, FP16, FX32, FP32, and others. Picking and choosing which model layers to treat with high data precision and then choosing an available data format in the target AI chip takes engineering effort and powerful analytical tooling.

The net result is that because this “porting” of published models to a specific platform takes effort, the various AI chip and NPU vendors publish collections of pre-ported models that are known to work on their platforms with acceptable accuracy loss compared to the original model.

Zoo Population Size – a Proxy for Ease of Use?

The process of porting a new AI model to a given NPU accelerator is not uniformly speedy or straightforward.  The effort involves both a data science component (choosing the optimal quantization strategies and data precisions) as well as an embedded software performance tuning component. That performance tuning is simple and easy if the target platform has an all-in-one, general purpose NPU such as the Quadric Chimera GPNPU processor.  But if the target silicon employs a heterogeneous architecture with a combination of CPU plus DSP plus hard-wired NPU accelerator then many AI model graph layers could be executed on two or three different compute engines, and the software engineer has to select which layers run on the different resources and tune the use of memory accordingly to achieve acceptable inferences per second performance.

Is therefore a reasonable conclusion to judge that the size of the available model zoo – i.e. the number of animals in the zoo – is a reflection of two main factors: [1] the ease of use of the vendor-specific porting tools; and [2] the relative popularity of the platform – i.e. how many engineers are working to port and publish new models in that zoo.

Mine is Bigger than Yours!

Let us take a look at several public zoos / code repositories to judge the Productivity * Popularity metric.  The following table shows data from a snapshot taken from public model repos as of September 2025.  It also shows the approximate funding levels (data from Crunchbase) or employee headcount size for the vendor.

NPU VendorFunding Raised /  Employee HeadcountNPU Models in the Zoo – September 2025
Well known RISC-V NPU proponentFunding > $1B Headcount > 100049 models
Market Leading CPU & NPU LicensorMarket Capitalization > $140B Headcount > 800043 models
Publicly listed Asian GPU/NPU licensor and ASIC providerMarket Capitalization > $15B72 models
QuadricFunding (thru mid-2025): $43M Headcount < 100352 Model examples   110 models added in 2025

Our attorneys recommended that we not call out those competitors by name and to not include clickable hyperlinks to their underpopulated zoos.   But we think you can do your own homework on that topic.

Simply Better Architecture and Tools!

The comparison is shocking.  Tiny – but rapidly growing – Quadric is running laps around the Other Guys in the race to populate the model zoo.  It’s certainly not because we spend more money and have more engineers porting models.  The reason for the disparity is simply our vastly superior underlying architecture and accompanying toolchain.  Quadric’s Chimera GPNPU natively executes every layer of every AI model without needing to partition the model and Fallback to a legacy CPU or DSP.   Quadric’s Chimera Graph Compiler (CGC) automatically lowers (compiles) AI models from their native ONNX graph format into a C++ representation of the graph, which is then compiled by the LLVM C++ compiler to run end-to-end on the Chimera processor.  And most critically when a new AI model cannot automatically compile – usually because of a proprietary graph structure that they want to keep private - users can rapidly write the missing functionality by themselves in common everyday C++ to complete the model port without any intervention from Quadric.

To Solve AI Compute Challenge, Choose a Purpose-Built NPU Do you want to see the Quadric model zoo for yourself?  Check out the online Quadric DevStudio at  studio.quadric.io  to see the full list of classic models, transformers, LLMs and DSP pre- and post-processing code

© Copyright 2025  Quadric
All Rights Reserved
Privacy Policy
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram