Is the Transformer Era Over?

January 30, 2024

Wait! Didn’t That Era Just Begin?

The idea of transformer networks has existed since the seminal publication of the Attention is All You Need paper by Google researchers in June 2017. And while transformers quickly gained traction within the ML research community, and in particular demonstrated superlative results in vision applications (ViT paper), transformer networks were definitely not a topic of trendy conversation around the family holiday dinner table. Until late 2022, that is.

On Nov 30, 2022, OpenAI released ChatGPT and within weeks millions of users were experimenting with it. Soon thereafter the popular business press and even the TV newscasts were both marveling at the results as well as publishing overwrought doomsday predictions of societal upheaval and chaos.

ChatGPT ushered in the importance of Transformer Networks in the world of Machine Learning / AI.

During the first few months of Transformer Fever the large language models (LLMs) based on transformer techniques were the exclusive province of cloud compute centers because model size was much too large to contemplate running on mobile phone, a wearable device, or an embedded appliance. But by the midsummer of 2023 the narrative shifted and in the second half of 2023 every silicon vendor and NPU IP vendor was talking about chip and IP core changes that would support LLMs on future devices.

Change is Constant

Just six weeks after the release of the Llama 2 LLM, Quadric was the first to demonstrate the Llama2 LLM running on an existing IP core in September 2023. But we were quick to highlight that Llama2 wasn’t the end of the evolution of ML models and that hardware solutions needed to be fully programable to react to the massive rate of change occurring in data science. No sooner had that news hit the street – and our target customers in the semiconductor business started ringing our phone off the hook – then we started seeing the same media sources that hyped transformers and LLMs in early 2023 begin to predict the end of the lifecycle for transformers!

Too Early to Predict the End of Transformers?

In September 2023, Forbes was first out of the gate predicting that other ML network topologies would supplant attention-based transformers. Keep in mind, this is a general business-oriented publication aimed at the investor-class, not a deep-tech journal. More niche-focused, ML-centric publications such as Towards Data Science are also piling on, with headlines in December 2023 such as A Requiem For the Transformer.

Our advice: just as doomsday predictions about transformers were too hyperbolic, so too are predictions about the imminent demise of transformer architectures. But make no mistake, the bright minds of data science are hard at work today inventing the Next New Thing that will certainly capture the world’s attention in 2024 or 2025 or 2026 and might one day indeed supplant today’s state of the art.

What Should Silicon Designers Do Today?

Just as today’s ViT and LLM models didn’t completely send last decade’s Resnet models to the junk heap, tomorrow’s new hero model won’t eliminate the LLMs that are consuming hundreds of billions of dollars of venture capital right now. SoC architects need compute solutions that can run last year’s benchmarks (Resnet, etc) plus this year’s hot flavor of the month (LLMs) as well as the unknown future champion of 2026 that hasn’t been invented yet.

Should today’s chip designer choose a legacy hardwired NPU optimized for convolutions? Terrible idea – you already know that first-generation accelerator is broken. Should they adapt and build a second generation, hardwired accelerator evolved to support both CNNs and transformers? No way – still not fully programmable – and the smart architect won’t fall for that trap a second time.

Quadric offers a better way – the Chimera GPNPU. “GP” because it is general purpose: fully C++ programmable to support any innovation in machine learning that comes along in the future. “NPU” because it is massively parallel, matrix-optimized offering the same efficiency and throughput as hardwired “accelerators” but combined with ultimate flexibility. See for yourself at www.quadric.io