How Big is Your Model Zoo?

November 7, 2025

The term “model zoo” first gained prominence in the world of Artificial Intelligence/Machine Learning (AI/ML) beginning in the 2016-2017 timeframe. Originally used to describe open-source public repositories of working AI models — the most prominent of which today is Hugging Face — the term has since been adopted by nearly all vendors of AI chips and licensable Neural Processors Units (NPUs) to describe the set of working models that have been ported to those vendors’ specific platforms.

Why Vendor-Specific Zoos in the First Place?

Understanding model zoos starts with understanding the origin of the “animals” in the zoos. Data scientists create new AI models by inventing and training the models. That training process is dominated almost entirely by researchers using Nvidia Graphic Processing Units (GPUs) as the training hardware and training software frameworks such as PyTorch and TensorFlow. Almost 100% of such models are created and published using Floating Point data types (FP32). FP32 delivers the highest possible accuracy and is a native data type for NVIDA GPUs.

While FP32 is great for ease of training, it is not energy efficient as a data representation for real-world AI inference. NPUs – neural processors – in edge devices and embedded systems typically perform most AI inference computations using 8-bit Integer (INT8) data formats and some systems are pushing to even smaller formats (INT4, others). The advantages of INT8 versus FP32 are stark — 10X to 20X lower energy consumption per unit of computation, plus dramatic model size reduction arising from converting model weights from 32bits per element to only 8 or fewer bits per element. Take, for example, the idea of running 8-billion-parameter Large Language Model (LLM) on a phone or laptop: executing the model in the original FP32 format would entail needing 4 x 8 = 32 Gigabytes (GB) of Dynamic Random-Access Memory (DRAM) on your phone just to hold the model weights, and to stream that entire model from DRAM into the phone applications processor System on a Chip (SoC) repeatedly for each token generated. 32 GB of DRAM is cost- and power-prohibitive on a phone. Compressing the model weights to INT8 or INT4 formats makes it practical to execute such a model on your phone. As a result, every commercially available NPU uses one or more data types that are dramatically smaller than FP32, with the majority of NPUs using INT8 as the basic building block.

Unfortunately, converting a model – a process known as quantization — is neither push-button nor uniform. Merely blindly converting every layer and every model weight from the high precision of floating point to the more granular INT8 format often unacceptably reduces model accuracy. Some operators need to be preserved in higher-precision data formats. That’s where the conversion process gets tricky. Each NPU and each SoC architecture has different compute resources with different data types – INT4, INT8, INT16, FP16, FX32, FP32, and others. Picking and choosing which model layers to treat with high-data precision and then choosing an available data format in the target AI chip takes engineering effort and powerful analytical tooling.

The net result is that because this “porting” of published models to a specific platform takes effort, the various AI chip and NPU vendors publish collections of pre-ported models that are known to work on their platforms with acceptable accuracy loss compared to the original model.

Zoo Population Size – a Proxy for Ease of Use?

The process of porting a new AI model to a given NPU accelerator is not uniformly speedy or straightforward. The effort involves both a data science component (choosing the optimal quantization strategies and data precisions) as well as an embedded software performance tuning component. That performance tuning is simple and easy if the target platform has an all-in-one, general-purpose NPU (GPNPU) such as the Quadric Chimera processor. But if the target silicon employs a heterogeneous architecture with a combination of Central Processing Unit (CPU) plus Digital Signal Processor (DSP) plus hard-wired NPU accelerator then many AI model graph layers could be executed on two or three different compute engines, and the software engineer has to select which layers run on the different resources and tune the use of memory accordingly to achieve acceptable inferences per second performance.

It ss therefore reasonable to conclude that the size of the available model zoo – i.e. the number of animals in the zoo – is a reflection of two main factors: [1] the ease of use of the vendor-specific porting tools; and [2] the relative popularity of the platform – i.e. how many engineers are working to port and publish new models in that zoo.

Mine is Bigger than Yours!

Let us take a look at several public zoos and code repositories to judge the [Productivity x Popularity] metric. The following table shows data from a snapshot taken from public model repositories as of September 2025. It also shows the approximate funding levels (data from Crunchbase) or employee headcount size for the vendor.

NPU Vendor	Funding Raised/ Employee Headcount	NPU Models in the Zoo – September 2025
Well known RISC-V NPU proponent	Funding > $1B Headcount > 1000	49 models
Market Leading CPU & NPU Licensor	Market Capitalization > $140B Headcount > 8000	43 models
Publicly listed GPU/NPU licensor and Application-Specific Integrated Circuit (ASIC) provider	Market Capitalization > $15B	72 models
Quadric	Funding (thru mid-2025): $43M Headcount < 100	352 Model examples. 110 models added in 2025.

Our attorneys recommended that we not call out those competitors by name and to not include clickable hyperlinks to their underpopulated zoos. But we think you can do your own homework on that topic.

Simply Better Architecture and Tools!

The comparison is shocking. Tiny – but rapidly growing – Quadric is running laps around the Other Guys in the race to populate the model zoo. It’s certainly not because we spend more money and have more engineers porting models. The reason for the disparity is simply our vastly superior underlying architecture and accompanying toolchain. Quadric Chimera processors natively execute every layer of every AI model without needing to partition the model and fallback to a legacy CPU or DSP. Quadric Chimera Graph Compiler (CGC) automatically lowers (compiles) AI models from their native Open Neural Network Exchange (ONNX) graph format into a C++ representation of the graph, which is then compiled by the LLVM C++ compiler to run end-to-end on the Chimera processor. And most critically when a new AI model cannot automatically compile – usually because of a proprietary graph structure that they want to keep private - users can rapidly implement the missing functionality themselves using standard C++ to complete the model port without any intervention from Quadric.to complete the model port without any intervention from Quadric.

To Solve AI Compute Challenge, Choose a Purpose-Built NPU

Do you want to see the Quadric model zoo for yourself? Check out the online Quadric DevStudio at studio.quadric.io to see the full list of classic models, transformers, LLMs and DSP pre- and post-processing code.