Does Your NPU Do VLMs? It Better!

March 1, 2025

What’s a VLM? Vision Language Models are a rapidly emerging class of multimodal AI models that’s becoming much more important in the automotive world for the ADAS sector. VLMs are built from a combination of a large language model (LLM) and a vision encoder, giving the LLM the ability to “see.” This new technology is really shaking up the industry.

The introduction of DeepSeek in January 2025 captured headlines in every newspaper and online inviting comparisons to the Sputnik moment of 1957. But rapid change is also happening in many areas that are hidden from the general public – and VLMs are a great example of this rapid change.

Why are VLMs so Important?

The big thing with VLMs is their ability to describe in text form – hence understandable by the human driver - what a camera or radar sensor “sees” in a driver assistance system in a car. A properly calibrated system deploying VLMs and multiple cameras could therefore be designed to generate verbal warnings to the driver, such as “A pedestrian is about to enter the crosswalk from the left curb 250 meters down the road.” Or such scene descriptions from multiple sensors over several seconds of data analysis could be fed into other AI models which make decisions to automatically control an autonomous vehicle.

How New are VLMs?

VLMs are very, very new. A Github repo that surveys and tracks automotive VLM evolution (https://github.com/ge25nab/Awesome-VLM-AD-ITS) lists over 50 technical papers from arxiv.org describing VLMs in the autonomous driving space, with 95% of the listing coming from 2023 and 2024. In our business at Quadric – where we have significant customer traction in the AD / ADAS segment – vision language models were rarely mentioned 18 months ago by customers doing IP benchmark comparisons. By 2024 LLMs in cars become a “thing” and designers of automotive silicon began asking for LLM performance benchmarks. Now, barely 12 months later, the VLM is starting to emerge as a possible benchmark litmus test for AI acceleration engines for auto SoCs.

What’s the Challenge of Designing for VLMs on Top of All the Other Models?

Imagine the head spinning changes faced by the designers of hardware “accelerators” over the past four years. In 2020-2022, the state-of-the-art benchmarks that everyone tried to implement were CNNs (convolutional neural networks). By 2023 the industry had pivoted to Transformers – such as SWIN transformer (shifted window transformer) as the Must Have solution. Then last year it was newer transformers – such as BEVformer (birds eye view transformer) or BEVdepth – plus LLMs such as Llama2 and Llama3. And today, pile on VLMs in addition to needing to run all the CNNs and vision Transformers and LLMs. So many networks, so many machine learning operators in the graphs! And, in some cases such as the BEV networks, functions so new that the frameworks and standards (PyTorch, ONNX) don’t support them and hence the functions are implemented purely in CUDA code.

Run all networks. Run all operators. Run C++ code, such as CUDA ops? No hardwired accelerator can do all that. And running on a legacy DSP or legacy CPU won’t yield sufficient performance. Is there an alternative?

Why You Need a Fully Programmable, Universal, High-Performance GPNPU Solution

Yes, there is a solution that has been shown to run all those innovative AI workloads, and run them at high-speed! The revolutionary Chimera GPNPU processor integrates fully programmable 32bit ALUs with systolic-array style matrix engines. Up to 1024 ALUs in a single core, with only one instruction fetch and one AXI data port. That’s over 32,000 bits of parallel, fully-programmable performance. Scalable up to 864 TOPs for bleeding-edge ADAS applications, Chimera GPNPUs have matched and balanced compute throughput for both MAC and ALU operations so no matter what type of network you choose to run - or whatever style of network gets invented in 2026 - they all run fast, low-power and highly parallel.