By: Thomas Stahura
I remember getting GPT-2 to write bad poetry on my Asus Zenbook Duo gaming laptop. The fan spun up like a tiny jet engine, the 8GB of GPU memory maxed out, and after a few minutes, it produced something that vaguely rhymed. It felt magical. But that was five years ago. Today, running a frontier model requires data centers – or so it seems.
The story of how we got from there to here is really about the constant co-evolution of silicon and software, especially during the Transformer era.
Which brings me back to January 23, 2020. The day the Scaling Law paper proclaimed models with more parameters, trained on more data, produced better results.
Beginning AI’s brute force era, where dense models ruled and parameter counts climbed. About 500 fold in the case of GPT-3’s 175 billion versus GPT-2 mediums’s 355 million parameters. And since these are dense models, every single one of those parameters is activated for each token generated. Meaning astronomical compute costs.
Those design choices then set the hardware agenda. The game was VRAM capacity. The challenge was simply fitting the model’s weights into memory. That left the system profoundly memory-bandwidth-bound. Arithmetic intensity, the ratio of compute to memory work, was already low, which means the model spends more time waiting on high‑bandwidth memory (HBM) than computing.
So models kept evolving: FlashAttention reduced off‑chip reads and writes. Multi‑query and grouped‑query attention cut key‑value (KV) duplication. While quantization decreased weight footprints down to 8, 4, and even 1.58 bits without crushing quality too much. Still, the physics didn’t change. At full precision, dense models demanded big HBM, big bandwidth, and strong tensor cores mostly to hide memory stalls.
Enter Mixture of Experts. Which decouple total parameters from active parameters. So instead of activating all parameters, a router network selects a specialized subset of the parameters (experts) for each token.
As a result, memory bandwidth per token drops since fewer parameters are needed during inference. The new tax becomes communication. Experts are sharded across GPUs, sometimes across nodes. Tokens get scattered to the right experts, computed, then gathered back every layer.
But even with MoE lightening the bandwidth load, the optimizations continue to the decode loop — the part of inference that runs over and over, token by token.
Speculative decoding improves that loop: A smaller draft model proposes several tokens ahead, and the larger target model verifies those tokens in a single teacher-forced pass, accepting the longest matching prefix and reusing the computed KV state for the next step.
That swap turns many tiny, memory-bound decode steps into one larger pass per layer, which improves weight reuse, raises arithmetic intensity, and trims per-token HBM traffic. Ultimately speeding up inference while averaging model performance between both draft and teacher models.
The payoff depends on acceptance rate (and temperature). As such, modern runtimes co-schedule draft and target on the same node, keep KV in HBM, and pipeline the accept-reject loop with persistent kernels to avoid stalls. This is co-evolution in practice, shifting the bottleneck from bandwidth and fabric back toward on-chip compute when it helps, and back again when routing dominates.
It’s easy to imagine a future where these optimizations fade into the background, just as few of us think about TCP packets when we stream Netflix. In that world, AI will feel weightless and who knows, maybe even light enough to run on laptops again.
So stay tuned!