At some point in everyone’s teenage years of schooling we were all taught in a nature or biology class about cycles of population surges and then inevitable population collapses.  Whether the example was an animal, plant, insect or even bacteria, some external event triggers a rapid surge in the population of a species which leads to overpopulation and competition for resources (food, space, shelter) at which point either the surge in population exhausts the food supply, or the species itself becomes food for some other predator that decimates the local population.    The cycle is consistent: external trigger, population surge, resource exhaustion, and finally a collapse back to a population level that can be sustained long term in the native habitat’s normal state of affairs.

The same cycle of Boom and Bust applies to the business enterprises that we humans create in response to economic opportunities. Cycles of production in new house construction, oil production or even trendy boba tea merchants see waves of new ventures followed by inevitable market shakeouts that thin the herd where weaker players disappear, and strong competitors survive and thrive.  The semiconductor and semiconductor IP businesses are no different, as observers with long memories can no doubt recall.

IP Industry Shakeouts of the Past

For the past 25 years every time a new interface standard emerges, or a new design trend becomes mainstream, we’ve witnessed a surge of both IP suppliers and chip startups that attempt to race to market and grab a piece of the emerging pie.  In the world of processor IP, we’ve seen this happen with CPUs, DSPs, GPUs and even esoteric categories such as packet processors.

Consider, for instance, the huge population of CPU and DSP architectures that existed in the year 2000.  The image below shows the three public processor IP companies from late 2000 as well some of the more than 40 other companies either licensing cores or building silicon and systems with competing CPUs or DSP.

The semiconductor world did not need – and could not support – all 40 architectures.  That surge of market participants occurred because it was the beginning of the age of System on Chip design.  SoCs need processors and thus many processors were either born or spun out of existing tech companies. Within a few years of that Y2K peak population the number of CPU and DSP licensing companies collapsed to less than 10. Most the names and logos in that image above ceased to exist by 2005 while several of the survivors (such as Tensilica and Arc) were gobbled up by bigger companies.

Following the analogy of the natural world, it’s fair to ask what “resource scarcity” did those CPU/DSP companies run out of?  Two primary ingredients ran short: investment capital and compiler talent.  Investors do not have infinite patience and thus many enterprises begun during the population explosion years of 1998-2000 (which coincides with one of the biggest stock valuation bubbles of all time!) could not sustain themselves beyond 2005. And less competitive architectures that were difficult to program suffered as they struggled to hire enough compiler talent to build advanced tools to compensate for inferior architectures. The clock ran out for most of those names.

NPUs Today - We’ve Seen this Movie Before!

25 years removed from Peak CPU/DSP we are seeing the same movie being replayed with different actors in the world of NPU architectures.  The external trigger: the meteoric rise of AI and the rush to embed AI horsepower in every device from your smartwatch to the giant datacenter. A tidal wave of opportunity drew a flood of investment dollars from 2018 thru 2024.

This second image captures a snapshot of just some of the competing NPU accelerator offerings – as IP and embedded in silicon - from just two years ago, when the wave of NPUs was at its peak. 

Already we’ve seen several of these names disappear as each successive rapid change in AI – such as the emergence of transformers, then LLMs – rendered wave after wave of fixed-function accelerators obsolete.  Today, the same scarcities that doomed the herd of CPU startups a generation ago is now rapidly thinning the herd of AI acceleration startups in today’s market.  Silicon startups are failing or being acqui-hired for their engineers (the latest: Untether). NPU licensing companies struggle to build sophisticated compilers that map ever more complex AI algorithms to unnecessarily complex architectures that bolt together legacy processors with matrix accelerators. And most ominously, venture investors are no longer willing to endlessly write big checks for round after round of frenzied investment. Instead, they demand to see market traction – either silicon volume ramps or increasing licensing success.

How Many Will Survive? The world doesn’t need, and cannot support, 50+ NPUs.   Nor does the world want to see only 1 survivor - no one likes having an 800lb gorilla dominate a market. 50+ will dwindle down to 5 to 10 winners.    2025 will be an inflection point in the NPU world as the population of contenders collapses.  The winners will be marked by: (1) superior software toolchains (compilers) that can handle thousands or tens of thousands of AI models; (2) tooling that empowers end users to easily program new AI models onto silicon as data scientists continue to innovate as a rapid pace; and (3) business traction that attracts the fresh capital needed to continue to invest and grow. 

Five or six years ago, CPU IP providers jumped into the NPU accelerator game to try to keep their CPUs relevant with a message of “use our trusted CPU and offload those pesky, compute hungry matrix operations to an accelerator engine.”  DSP IP providers did the same.  As did configurable processor IP vendors.  Even GPU IP licensing companies did the same thing. 

The playbook for those companies was remarkably similar: (1) tweak the legacy offering instruction set a wee bit to boost AI performance slightly, and (2) offer a matrix accelerator to handle the most common one or two dozen graph operators found in the ML benchmarks of the day: Resnet, Mobilenet, VGG. 

The result was a partitioned AI “subsystem” that looked remarkably similar across all the 10 or 12 leading IP company offerings: legacy core plus hardwired accelerator.

The fatal flaw in these architectures: always needing to partition the algorithm to run on two engines.  As long as the number of “cuts” of the algorithm remained very small, these architectures worked very well for a few years.  For a Resnet benchmark, for instance, usually only one partition is required at the very end of the inference. Resnet can run very efficiently on this legacy architecture.  But along came transformers with a very different and wider set of graph operators that are needed, and suddenly the “accelerator” doesn’t accelerate much, if any, of the new models and overall performance became unusable.  NPU accelerator offerings needed to change. Customers with silicon had to eat the cost – a very expensive cost – of a silicon respin.

An Easy First Step that Quickly Was Outdated

Today these IP licensing companies find themselves trapped.  Trapped by their decisions five years ago to take an “easy” path towards short-term solutions. The motivations why all of the legacy IP companies took this same path has as much to do with human nature and corporate politics as it does with technical requirements. 

When what was then generally referred to as “machine learning” workloads first burst onto the scene in vision processing tasks less than a decade ago, the legacy processor vendors were confronted with customers asking for flexible solutions (processors) that could run these new, fast-changing algorithms.  Caught flat-footed with processors (CPU, DSP, GPU) ill-suited to these news tasks, the quickest short-term technical fix was the external matrix accelerator.  The option of building a longer-term technical solution - a purpose built programmable NPU capable of handling all 2000+ graph operators found in the popular training frameworks – would take far longer to deliver and incur much more investment and technical risk.

The Not So Hidden Political Risk

But let us not ignore the human nature side of the equation faced by these legacy processor IP companies.  A legacy processor company choosing a strategy of building a completely new architecture – including new toolchains/compilers – would have to explicitly declare both internally and externally that the legacy product was simply not as relevant to the modern world of AI as that legacy (CPU, DSP, GPU) IP core previously was valued.  The breadwinner of the family that currently paid all the bills would need to pay the salaries of the new team of compiler engineers working on the new architecture that effectively competed against the legacy star IP.  (It is a variation on the Innovator’s Dilemma problem.) And customers would have to adjust to new, mixed messages that declare “the previously universally brilliant IP core is actually only good for a subset of things – but you’re not getting a royalty discount.”

All of the legacy companies chose the same path: bolt a matrix accelerator onto the cash cow processor and declare that legacy core still reigns supreme.  Three years later staring at the reality of transformers, they declared the first-generation accelerator obsolete and invented a second one that repeated the same shortcomings of the first accelerator. And now faced with the struggles of the 2nd iteration hardwired accelerator having also become obsolete in the face of continuing evolution of operators (self-attention, multiheaded self-attention, masked self-attention, and more new ones daily) they either have to double-down again and convince internal and external stakeholders that this third time the fixed-function accelerator will solve all problems forever; or admit that they need to break out of the confining walls they’ve built for themselves and instead build a truly programmable, purpose-built AI processor.

Quadric Did Something Very Different

At Quadric we do a lot of first-time introductory visits with prospective new customers.  As a rapidly expanding processor IP licensing company that is starting to get noticed (even winning IP Product of the Year!) such meetings are part of the territory.  Which means we hear a lot of similar sounding questions from appropriately skeptical listeners who hear our story for the very first time.  The question most asked in those meetings sounds something like this:

Chimera GPNPU sounds like the kind of breakthrough I’ve been looking for. But tell me, why is Quadric the only company building a completely new processor architecture for AI inference? It seems such an obvious benefit to tightly integrate the matrix compute with general purpose compute, instead of welding together two different engines across a bus and then partitioning the algorithms. Why don’t some of the bigger, more established IP vendors do something similar?”

The answer I always give: “They can’t, because they are trapped by their own legacies of success!”

The legacy companies might struggle to decide to try something new.  But the SoC architect building a new SoC doesn’t have to wait for the legacy supplier to pivot.  A truly programmable, high-performance AI solution already exists today. 

The Fully Programmable Chimera Architecture

The Chimera GPNPU from Quadric runs all AI/ML graph structures.  The revolutionary Chimera GPNPU processor integrates fully programmable 32bit ALUs with systolic-array style matrix engines in a fine-grained architecture.  Up to 1024 ALUs in a single core, with only one instruction fetch and one AXI data port.  That’s over 32,000 bits of parallel, fully-programmable performance. 

The flexibility of a processor with the efficiency of a matrix accelerator.  Scalable up to 864 TOPS for bleeding-edge applications, Chimera GPNPUs have matched and balanced compute throughput for both MAC and ALU operations so no matter what type of network you choose to run they all run fast, low-power and highly parallel.  When a new AI breakthrough comes along in five years, the Chimera processor of today will run it – no hardware changes, just application SW code.

The biggest mistake a chip design team can make in evaluating AI acceleration options for a new SoC is to rely entirely upon spreadsheets of performance numbers from the NPU vendor without going through the exercise of porting one or more new machine learning networks themselves using the vendor toolsets.

Why is this a huge red flag?  Most NPU vendors tell prospective customers that (1) the vendor has already optimized most of the common reference benchmarks, and (2) the vendor stands ready and willing to port and optimize new networks in the future.   It is an alluring idea – but it’s a trap that won’t spring until years later.  Unless you know today that the Average User can port his/her own network, you might be trapped in years to come!

Rely on NPU Vendor at Your Customers’ Customers Expense!

To the chip integrator team that doesn’t have a data science cohort on staff, the daunting thought of porting and tuning a complex AI graph for a novel NPU accelerator is off-putting. The idea of doing it for two or three leading vendors during an evaluation is simply a non-starter!   Implicit in that idea is the assumption that the toolsets from NPU vendors are arcane, and that the multicore architectures they are selling are difficult to program. It happens to be true for most “accelerators” where the full algorithm must be ripped apart and mapped to a cluster of scalar compute, vector compute and matrix compute engines.  Truly it is better to leave that type of brain surgery to the trained doctors!

But what happens after you’ve selected an AI acceleration solution?  After your team builds a complex SoC containing that IP core?  After that SoC wins design sockets in the systems of OEMs?  What happens when those systems are put to the test by buyers or users of the boxes containing your leading-edge SoC? 

Who Ports and Optimizes the New AI Network in 2028?

The AI acceleration IP selection decision made today – in 2024 – will result in end users in 2028 or 2030 needing to port the latest and greatest breakthrough machine learning models to that vintage 2024 IP block.  Relying solely on the IP vendor to do that porting adds unacceptable business risks!

Layers Of Risk

Business Relationship Risk: Will the End User – likely with a proprietary trained model – feel comfortable relying on a chip vendor or an IP vendor they’ve never met to port and optimize their critical AI model?   What about the training dataset? If model porting requires substituting operators to fit the limited set of hardwired operators in the NPU accelerator, the network will need to be retrained and requantized – which means getting the training set into the hands of the IP vendor?

Legal Process Risk:  How many NDAs will be needed – with layer upon layer of lawyer time and data security procedures – to get the new model into the hands of the IP vendor for creation of new operators or tuning of performance?  

Porting Capacity Risk:  Will the IP vendor have the capacity to port and optimize all the models needed?  Or will the size of the service team inside the IP vendor become the limiting factor in the growth of the business of the system OEM, and hence the sales of the chips bearing the IP?

Financial Risk:  IP vendor porting capacity can never be infinite no matter how eager they are to take on that kind of work – how will they prioritize porting requests?  Will the IP vendor auction off porting priority to the “highest bidder”?  In a game of ranking customers by importance or willingness to write service fee checks, someone gets shoved the end of the list and goes home crying.

Survivor Risk:   The NPU IP segment is undergoing a shakeout. The herd of more than twenty would-be IP vendors will be thinned out considerably in four years.  Can you count on the IP vendor of 2024 to still have a team of dedicated porting engineers in 2030?

In What Other Processor Category Does the IP Vendor Write All the End User Code?

Here’s another way to look at the situation: can you name any other category of processor – CPU, DSP, GPU – where the chip design team expects the IP vendor to “write all the software” on behalf of the eventual end user?  Of course not!  A CPU vendor delivers world-class compilers along with the CPU core, and years later the mobile phone OS or the viral mobile App software products are written by the downstream users and integrators.  Neither the core developer nor the chip developer get involved in writing a new mobile phone app!   The same needs to be true for AI processors – toolchains are needed that empower the average user – not just the super-user core developer - to easily write or port new models to the AI platform.

A Better Way – Leading Tools that Empower End Users to Port & Optimize
Quadric’s fully programmable GPNPU – general purpose NPU – is a C++ programmable processor combining the power-performance characteristics of a systolic array “accelerator” with the flexibility of a C++ programmable DSP.  The toolchain for Quadric’s Chimera GPNPU combines a Graph Compiler with an LLVM C++ compiler – a toolchain that Quadric delivers to the chip design team and that can be passed all the way down the supply chain to the end user.  End users six years from now will be able to compile brand new AI algorithms from the native Pytorch graph into C++ and then into a binary running on the Chimera processor.   End users do not need to rely on Quadric to be their data science team or their model porting team.  But they do rely on the world-class compiler toolchain from Quadric to empower them to rapidly port their own applications to the Chimera GPNPU core. 

© Copyright 2025  Quadric    All Rights Reserved     Privacy Policy

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram