KANs Upend the AL/ML Scene, and We’re Ready

July 3, 2024

What’s the biggest challenge for AI/ML? Power consumption. How are we going to meet it? In late April 2024, a novel AI research paper was published by researchers from MIT and CalTech proposing a fundamentally new approach to machine learning networks – the Kolmogorov Arnold Network – or KAN. In the two months since its publication, the AI research field is ablaze with excitement and speculation that KANs might be a breakthrough that dramatically alters the trajectory of AI models for the better – dramatically smaller model sizes delivering similar accuracy at orders of magnitude lower power consumption – both in training and inference.

However, most every semiconductor built for AI/ML can’t efficiently run KANs. But we Quadric can.

First, some background.

The Power Challenge to Train and Infer

Generative AI models for language and image generation have created a huge buzz. Business publications and conferences speculate on the disruptions to economies and the hoped-for benefits to society. But also, the enormous computational costs to training then run ever larger model has policy makers worried. Various forecasts suggest that LLMs alone may consume greater than 10% of the world’s electricity in just a few short years with no end in sight. No end, that is, until the idea of KANs emerged! Early analysis suggests KANs can be 1/10^th to 1/20^th the size of conventional MLP-based models while delivering equal results.

Note that it is not just the datacenter builder grappling with enormous compute and energy requirements for today’s SOTA generative AI models. Device makers seeking to run GenAI in device are also grappling with compute and storage demands that often exceed the price points their products can support. For the business executive trying to decide how to squeeze 32 GB of expensive DDR memory on a low-cost mobile phone in order to be able to run a 20B parameter LLM, the idea of a 1B parameter model that neatly fits into the existing platform with only 4 GB of DDR is a lifesaver!

Built Upon a Different Mathematical Foundation

Today’s AI/ML solutions are built to efficiently execute matrix multiplication. This author is more aspiring comedian than aspiring mathematician, so I won’t attempt to explain the underlying mathematical principles of KANs and how they differ from conventional CNNs and Transformers – and there are already several good, high-level explanations published for the technically literate, such this and this. But the key takeaway for the business-minded decision maker in the semiconductor world is this: KANs are not built upon the matrix-multiplication building block. Instead, executing a KAN inference consists of computing a vast number of univariate functions (think, polynomials such as: [ 8 X³ – 3 X² + 0.5 X ] and then ADDING the results. Very few MATMULs.

Toss Out Your Old NPU and Respin Silicon?

No Matrix multiplication in KANs? What about the large silicon area you just devoted to a fixed-function NPU accelerator in your new SoC design? Most NPUs have hard-wired state machines for executing matrix multiplication – the heart of the convolution operation in current ML models. And those NPUs have hardwired state machines to implement common activation (such as ReLu and GeLu) and pooling functions. None of that matters in the world of KANs where solving polynomials and doing frequent high-precision ADDs is the name of the game.

GPNPUs to the Rescue

The semiconductor exec shocked to see KANs might be imagined to exclaim, “If only there was a general-purpose machine learning processor that had massively-parallel, general-purpose compute that could perform the ALU operations demanded by KANs!”

But there is such a processor! Quadric’s Chimera GPNPU uniquely blends both the matrix-multiplication hardware needed to efficiently run conventional neural networks with a massively parallel array of general-purpose, C++ programmable ALUs capable of running any and all machine learning models. Quadric’s Chimera QB16 processor, for instance, pairs 8192 MACs with a whopping 1024 full 32-bit fixed point ALUs, giving the user a massive 32,768 bits of parallelism that stands ready to run KAN networks - if they live up to the current hype – or whatever the next exciting breakthrough invention turns out to be in 2027. Future proof your next SoC design – choose Quadric!