• Playlist
  • Seattle Startup Toolkit
  • Portfolio
  • About
  • Job Board
  • Blog
  • Token Talk
  • News
Menu

Ascend.vc

  • Playlist
  • Seattle Startup Toolkit
  • Portfolio
  • About
  • Job Board
  • Blog
  • Token Talk
  • News

Token Talk 33: Survival of the Fittest Architecture

October 9, 2025

By: Thomas Stahura

I remember getting GPT-2 to write bad poetry on my Asus Zenbook Duo gaming laptop. The fan spun up like a tiny jet engine, the 8GB of GPU memory maxed out, and after a few minutes, it produced something that vaguely rhymed. It felt magical. But that was five years ago. Today, running a frontier model requires data centers – or so it seems.

The story of how we got from there to here is really about the constant co-evolution of silicon and software, especially during the Transformer era.

Which brings me back to January 23, 2020. The day the Scaling Law paper proclaimed models with more parameters, trained on more data, produced better results. 

Beginning AI’s brute force era, where dense models ruled and parameter counts climbed. About 500 fold in the case of GPT-3’s 175 billion versus GPT-2 mediums’s 355 million parameters. And since these are dense models, every single one of those parameters is activated for each token generated. Meaning astronomical compute costs.

Those design choices then set the hardware agenda. The game was VRAM capacity. The challenge was simply fitting the model’s weights into memory. That left the system profoundly memory-bandwidth-bound. Arithmetic intensity, the ratio of compute to memory work, was already low, which means the model spends more time waiting on high‑bandwidth memory (HBM) than computing.

So models kept evolving: FlashAttention reduced off‑chip reads and writes. Multi‑query and grouped‑query attention cut key‑value (KV) duplication. While quantization decreased weight footprints down to 8, 4, and even 1.58 bits without crushing quality too much. Still, the physics didn’t change. At full precision, dense models demanded big HBM, big bandwidth, and strong tensor cores mostly to hide memory stalls.

Enter Mixture of Experts. Which decouple total parameters from active parameters. So instead of activating all parameters, a router network selects a specialized subset of the parameters (experts) for each token.

As a result, memory bandwidth per token drops since fewer parameters are needed during inference. The new tax becomes communication. Experts are sharded across GPUs, sometimes across nodes. Tokens get scattered to the right experts, computed, then gathered back every layer.

But even with MoE lightening the bandwidth load, the optimizations continue to the decode loop — the part of inference that runs over and over, token by token.

Speculative decoding improves that loop: A smaller draft model proposes several tokens ahead, and the larger target model verifies those tokens in a single teacher-forced pass, accepting the longest matching prefix and reusing the computed KV state for the next step. 

That swap turns many tiny, memory-bound decode steps into one larger pass per layer, which improves weight reuse, raises arithmetic intensity, and trims per-token HBM traffic. Ultimately speeding up inference while averaging model performance between both draft and teacher models.

The payoff depends on acceptance rate (and temperature). As such, modern runtimes  co-schedule draft and target on the same node, keep KV in HBM, and pipeline the accept-reject loop with persistent kernels to avoid stalls. This is co-evolution in practice, shifting the bottleneck from bandwidth and fabric back toward on-chip compute when it helps, and back again when routing dominates.

It’s easy to imagine a future where these optimizations fade into the background, just as few of us think about TCP packets when we stream Netflix. In that world, AI will feel weightless and who knows, maybe even light enough to run on laptops again.

So stay tuned!

Tags Token Talk, MoE

FEATURED

Featured
Metro Multiples: A ranking of top startup ecosystems by return on investment
Metro Multiples: A ranking of top startup ecosystems by return on investment
Mapping Cascadian Dynamism
Mapping Cascadian Dynamism
Subscribe to Token Talk
Subscribe to Token Talk
You Let AI Help Build Your Product. Can You Still Own It?
You Let AI Help Build Your Product. Can You Still Own It?
Startup-Comp (1).jpg
Early-Stage Hiring, Decoded: What 60 Seattle Startups Told Us
Booming: An Inside Look at Seattle's AI Startup Scene
Booming: An Inside Look at Seattle's AI Startup Scene
SEATTLE AI MARKET MAP V2 - EDITED (V4).jpg
Mapping Seattle's Enterprise AI Startups
Our 2025 Predictions: AI, space policy, and hoverboards
Our 2025 Predictions: AI, space policy, and hoverboards
Mapping Seattle's Active Venture Firms
Mapping Seattle's Active Venture Firms
PHOTOS: Founders Bash 2024
PHOTOS: Founders Bash 2024
VC for the rest of us: A big tech employee’s guide to becoming startup advisors
VC for the rest of us: A big tech employee’s guide to becoming startup advisors
Valley VCs.jpg
Event Recap: Valley VCs Love Seattle Startups
VC for the rest of us: The ultimate guide to investing in venture capital funds for tech employees
VC for the rest of us: The ultimate guide to investing in venture capital funds for tech employees
Seattle VC Firms Led Just 11% of Early-Stage Funding Rounds in 2023
Seattle VC Firms Led Just 11% of Early-Stage Funding Rounds in 2023
Seattle AI Market Map (1).jpg
Mapping the Emerald City’s Growing AI Dominance
SaaS 3.0: Why the Software Business Model Will Continue to Thrive in the Age of In-House AI Development
SaaS 3.0: Why the Software Business Model Will Continue to Thrive in the Age of In-House AI Development
3b47f6bc-a54c-4cf3-889d-4a5faa9583e9.png
Best Practices for Requesting Warm Intros From Your Investors
 

Powered by Squarespace