Token Talk 9: Who's really paying the cloud bill?

March 5, 2025

By: Thomas Stahura

My AWS bill last week was $257. I have yet to be charged by Amazon.

In fact, I have never been charged for any of my token consumption. Thanks to hackathons and their generous sponsors, I’ve managed to accumulate a bunch of credits. Granted they expire in 2026. I’ll probably run out sooner rather than later.

With the rise of open source, closed-source incumbents have been branding their model as “premium” and priced them accordingly. Claude 3.7 Sonnet is around $6 per million tokens, o1 is around $26 per million tokens, and gpt-4.5 is $93 per million tokens (averaging input and output token pricing).

I'm no startup — simply an AI enthusiast and tinkerer — but all these new premium AI models have me wondering: how can startups afford their AI consumption?

Take Cursor, the AI IDE pioneer. It charges $20 per month for 500 premium model requests. That sounds reasonable until you realize that coding with AI is very context heavy. Every request is jam packed with multiple scripts, folders, and logs, easily filling Claude’s 200k context window. A single long (20 request) conversation with Claude 3.7 in Cline will cost me $20, let alone the additional 480 requests.

To break even, by my calculations, Cursor would have to charge at least 15 to 20 times more per month. I highly doubt it will do that anytime soon.

The AI industry continues to be in its subsidized growth phase. Claude 3.7 is free on Github Copilot. Other AI IDEs like Windsurf and Pear AI are $15 per month. The name of the game is growth at any cost. Like Uber and Airbnb during the sharing economy or Facebook and Snapchat during Web 2.0, the AI era is no different.

Or is it?

It all comes down to who is subsidizing and how that subsidy is being accounted for.

During previous eras, VCs were the main culprits, funding companies that spent millions acquiring customers through artificially low prices. Much of that applies today; Anysphere (which develops Cursor) raised at least $165 million. Besides salaries, it could be theorized most of that money is going to the cloud due to AI’s unique computational demands. Big Tech has much more power this time around and are funding these startups and labs through billions of cloud credits.

OpenAI sold 49% of its shares to Microsoft in exchange for cloud credits. Credits that OpenAI ultimately spent on Azure. Anthropic and Amazon have a similar story; however, Amazon invested $8 billion in Anthropic instead of giving credits. But, as a condition of the deal, Anthropic agreed to use AWS as its primary cloud provider so that money is destined to return to Amazon eventually.

Take my $257 AWS bill from last week — technically, I haven't been charged because I'm using credits. However, this allows Amazon, Microsoft, and other cloud providers to forecast stronger future cloud revenue numbers to shareholders, in part on the bet of continued growth by AI startups. (Credits given to startups expire so its use ‘em or lose ‘em before they inevitably convert to paid usage.)

Since 2022, the top three cloud providers, AWS, Azure, and Google, have grown their cloud revenue by 20%, 31%, and 33% each year, respectively. That rapid growth needs to continue to justify their share prices — and it’s no secret they are using AI to sustain that momentum.

The real question is when will it end? The global demand for compute is set to skyrocket, so perhaps never. Or maybe distilling large closed-sourced models into smaller, local models will pull people from the cloud. Or Jevons Paradox reigns true and even more demand is unlocked.

Only time will tell. Stay tuned!

P.S. If you have any questions or just want to talk about AI, email me! thomas@ascend.vc

Token Talk 8: The Robot Revolution Has Nowhere Left to Hide

February 26, 2025

By: Thomas Stahura

Escaping a rogue self-driving Tesla is simple: climb a flight of stairs.

While a Model Y can’t climb stairs, Tesla’s new humanoid surely can. If Elon Musk and the Tesla bulls have their way, humanoids could outnumber humans by 2040. That means there’s quite literally nowhere left to hide — the robot revolution is upon us.

Of course, Musk isn’t alone in building humanoids. Boston Dynamics has spent decades stunning the internet with robot acrobatics and dancing. For $74,500, you can own Spot, its robot dog. Agility Robotics in Oregon and Sanctuary AI in British Columbia are designing humanoids for industrial labor, not the home. China’s Unitree Robotics is selling a $16,000 humanoid today.

These machines may feel like a sudden leap into the future, but the idea of humanoid robots has been with us for centuries. Long before LLMs and other abstract technologies, robots were ingrained in culture, mythology, and our collective engineering dreams.

Around 1200 BCE, the ancient Greeks told stories of Talos, a towering bronze guardian patrolling Crete. During the Renaissance, Leonardo da Vinci sketched his mechanical knight. The word “robot” itself arrived in 1920 with Karel Čapek’s play R.U.R. (Rossum’s Universal Robots). By 1962, The Jetsons brought Rosie the Robot into American homes. And in 1973, Japan’s Waseda University introduced WABOT-1, the first full-scale — if clunky — humanoid robot.

Before the advent of LLMs, the vision was to create machines that mirror the form and function of a human being. Now it seems the consensus is to build a body for these models. Or rather, to build models for these bodies.

They're calling it a vision-language action (VLA) model and it's a new architecture purpose-built for general robot control. Currently, there are two types of model architectures dominating the market, transformer and diffusion. Transformer models are used to process and predict sequential data, think text generation, while diffusion models are used to generate continuous data through an iterative denoising process, think image generation.

VLA models (like π0) combine elements from both approaches to address the challenges of robotic control in the real-world. These hybrid architectures enable robots to translate visual observations (from cameras) and language instructions (robots given task) into precise physical actions using the sequential reasoning of transformers and the continuous output of diffusion models. Other frontier VLA model startups include: Skild (reportedly in talks to raise $500 million at a $4 billion valuation); Hillbot; and Covariant.

A new architecture means a new training paradigm. Lucky Robots (Ascend portfolio company) is pioneering synthetic data generation for VLA models by having robots learn in a physics simulation enabling developers to play with these models without needing a real robot. Nvidia is cooking up something similar with its Omniverse platform.

Some believe that more data and better models will lead to an inflection point in robotics, similar to what happened with large language models. However, unlike text and images, physical robotics data cannot be scraped from the web and must either be collected by an actual robot, or synthesized in a simulation. Regardless of how the model is trained, a real robot is needed to act upon the world.

At the very least, it’s far from a solved problem. Since a robot can have any permutation of cameras, joints, and motors, making a single unified model that can inhabit every robot is extremely challenging. Figure AI (valued at $2.6 billion, of which OpenAI is an investor) recently dropped OpenAI’s models in favor of in-house models. It’s not alone. So many VLA models are being uploaded to Hugging Face that the platform had to add a new model category just to keep up.

The step from concept to reality has been a long one for humanoid robots, but the pace of progress suggests we're just getting started.

P.S. If you have any questions or just want to talk about AI, email me! thomas@ascend.vc

Token Talk 7: AI's walls, moats, and bottlenecks

February 18, 2025

By: Thomas Stahura

Is Grok SOTA?

If that phrase comes across as gibberish, allow me to explain.

On Monday, xAI (Elon’s AI company) launched Grok 3, claiming state-of-the-art (SOTA) in terms of performance. SOTA has become a sort of catch-all term for crowning AI models. Grok’s benchmarks are impressive, scoring a 93, 85, and 79 on AIME (math), GPQA (science), and LCB (coding). These marks outperform the likes of o3-mini-high, o1, DeepSeek R1, sonnet-3.5, and gemini 2.0 flash. Essentially, Grok 3 outperforms every model except for the yet-to-be released o3. An impressive feat for a 17-month-old company!

I could mention that Grok used 100k+ GPUs during training, or that it built an entire data center in a matter of months. But much has been documented there. So given all that's happened this year with open source, distillation, and a number of tiny companies achieving SOTA performance, it’s much more useful to discuss walls, moats, and bottlenecks in the AI industry.

Walls

The question about a “Wall” in AI is really a question about where, when, or if AI researchers will reach a point where model improvements stall. Some say we will run out of viable high-quality data and hit the “data wall”. Others claim more compute during training will cause models to reach a “training wall”. Regardless of this panic, AI has yet to hit the brakes on improvement. Synthetic data (reinforcement learning) seems to be working, and more compute, demonstrated by grok 3, continues to lead to better performance.

So where is this “Wall”?

The scaling laws in AI suggest that while there isn't a hard "wall" per se, there is a fundamental relationship between compute, model size, and performance that follows a power law distribution. This relationship, often expressed as L ∝ C^(-α) where L is the loss (lower is better) and C is compute, shows that achieving each incremental improvement requires exponentially more resources. For instance, if we want to reduce the loss by half, we might need to increase compute by a factor of 10 or more, depending on where we are on the scaling curve. This doesn't mean we hit an absolute wall, but rather face increasingly diminishing returns that create economic and practical limitations — essentially there exists a "soft wall" where the cost-benefit ratio becomes prohibitively expensive. So how then have multiple small AI labs reached SOTA so quickly?

Moats

When OpenAI debuted ChatGPT in November 2022, the consensus was it would take years for competitors to develop their own models and catch up. Ten months later Mistral, a previously unknown AI lab out of France, launched Mistral 7b, a first-of-its-kind open-source small language model. Turns out that training a model, while still extremely expensive, costs less than a single Boeing 747 plane.

The power law relationship can also help us understand how smaller AI firms catch up so quickly. The lower you are on the curve, the steeper the improvements are for each unit of compute invested, allowing smaller players to achieve significant gains with relatively modest resources. This "low-hanging fruit" phenomenon means that while industry leaders might need to spend billions to achieve marginal improvements at the frontier, newer entrants can leverage existing research, open-source implementations, and more efficient architectures to rapidly climb the steeper part of the curve. (At Ascend, we define this as AI’s “fast followers”.)

Costs have only gone down since 2022, thanks to new techniques like model distillation and synthetic data generation. Techniques that DeepSeek used to build R1 for a reported $6 million.

The perceived "moat" of computational resources isn't as defensible as initially thought. It seems the application layer is the most defensible part of the AI stack. But what is holding up mass adoption?

Bottlenecks

Agents, as I mentioned last week, are the main AI application. And agents, in their ultimate form, are autonomous systems tasked with accomplishing a goal in the digital environment. These systems need to be consistently reliable if they are to be of value. Agent reliability is mainly affected by two things: prompting and pointing.

Since an agent is in a reasoning loop until its given goal is achieved, the prompt that is used to set up and maintain that loop is crucial. The loop prompt will be run on every step and should reintroduce the task, tools, feedback, and response schema to the LLM. Ultimately, these AI systems are probabilistic so the loop prompt should be worded in a way to increase the probability of a correct response as much as possible. Much harder said than done.

Vision is another bottleneck. For example, if an agent decides it needs to open the Firefox browser to get online, it first needs to move the mouse to the Firefox icon, which means it needs to see and understand the user interface (UI).

Thankfully, we have vision language models (VLMs) for this! The thing is, these VLMs, while they can caption an image, do not understand the precise icon location well enough to provide pixel perfect x and y coordinates. At least not yet to any reliable degree.

To prove this point, I conducted a VLM pointing competition wherein I had gpt-4o, sonnet-3.5, moondream 2, llama 3.3 70b, and molmo 7b (running on replicate) point at various icons on my Linux server.

Our perception of icons and logos is second nature to us humans. Especially those of us who grew up in the information age. It boggles the mind that these models, who are now as smart as a graduate student, can’t do this simple task ten times in a row. In my opinion, agents will be viable only when they can do hundreds or even thousands of correct clicks. So maybe in a few months… Or you can tune in next week for Token Talk 8!

P.S. If you have any questions or just want to talk about AI, email me! thomas@ascend.vc

Matthew McConaughey stars in Salesforce’s Superbowl commercial promoting Agentforce.

Token Talk 6: Everyone's got something to say about agents

February 11, 2025

By: Thomas Stahura

AGENTS! AGENTS!! AGENTS!!!

Big tech can’t get enough of them! Google’s got Mariner. Microsoft’s got Copilot. Salesforce rolled out Agentforce. OpenAI’s cooking up Operator. And Anthropic has Computer Use. (Naming is hard.)

You’ve heard the hype. Maybe you’re already sick of it. They even got Matthew McConaughey to say it during the Super Bowl — America's most sacred Sunday ritual.

But have you actually used one? Probably not. And funny enough, most of the “agents” I just listed aren’t even real agents.

So what is an agent, anyway?

An agent, put simply, is a Large Language Model (LLM) in a reasoning loop that has access to tools (like a browser, code interpreter, or calculator). The LLM is prompted to break down tasks into steps and to use tools to autonomously accomplish its given goal. The tools then provide feedback from the digital environment and the LLM continues to its next step until the task is complete.

A browser agent is given a task: “Book a flight from San Francisco to Seattle.” First, it runs an “open browser” command, and the browser confirms: “Browser is open,” with a screenshot. Next, it types “San Francisco to Seattle flights” into the search bar, hits enter, and waits for results. It scans the listings, picks a booking site, clicks through, and follows the prompts— step by step. Each action generates feedback to keep it on track until the task is complete.

Most agents have a litany of specific tools, but all you really need is to move the mouse, click, type, and scroll. After all, that's all humans need to use a computer.

So what, then, makes me say that most agents out there aren't actually agents? For starters, Mariner is on a waitlist, Copilot doesn't have access to any tools, and Agentforce only has access to Salesforce-specific tools. OpenAI’s Operator and Anthropic’s Computer Use are what I’d call actual agents. But Operator is $200/month and Computer use is in beta.

Open source is not far behind. Browser use (YC W25) exploded onto the scene about a month ago and already has 27k github stars. I’ve used browser-use for my AI bias hackathon project. Works with any LLM in only 15 lines of code. Totally free.

Autogen, a Microsoft agent framework, is also open source with 39k stars. Along with Skyvern (12k stars YC S23) and Stagehand (7.5k stars). And these are just browser agents! There are also coding agents that live within an integrated development environment (IDE) like the closed-source Replit, GitHub Copilot, and Cursor, and the open-source Cline (28k stars), Continue.dev (23k stars), and Void (10k stars/YC S24).

Agents, at the end of the day, are about autonomous control. Whether it's a browser or a calculator, the more tools, control, and thus access you give an LLM, the more it can do on your behalf. In that respect, not all agents are created equal.

When I use my computer, I don't just use the browser or IDE. Sure, I spend a bunch of time online (who doesn't?), and coding (so much), but I control my computer on the OS level. I’m able to jump between different applications and navigate my file system with my keyboard and mouse, so shouldn't my agent, too?

Many thought an OS-level agent was impossible a few months ago. Now it seems inevitable. Imagine a future where we interact with our devices in the same way Tony Stark interacts with Jarvis in Iron Man (2008). This is an entirely new human-computer interaction paradigm that is set to completely change the industry.

Big tech knows this. Apple has enabled developers to write custom tools for Apple Intelligence to interact with. And MS Copilot Recall automatically records your screen to automate tasks (that is before it was recalled over privacy issues).

In the open community, Open Interpreter (58k stars) is an OS-level agent that can write and execute commands in the command line. It has limitations (no vision capabilities) but is impressive and the first of its kind. Other models such as OS-Atlas and UI-TARS exist but are not nearly as popular as browser or IDE agents. (We invested in Moondream, a startup building vision “pointing” capabilities for agent developers.)

The OS agent wars are existential for big tech. Any agent that exists within Windows or MacOS will get hamstrung by permissions requirements enshittifying the experience of alternatives while Microsoft and Apple keep their control over the industry. If these companies own and control the software that controls your computer, is it really your computer? I think not.

Regardless, agents still have a long way to go. Reliability remains a large issue along with handling authentication (to email, social media, and other sites). These, however, are solvable problems. Meta has already set up GAIA, a general AI assistant benchmark, that if solved “would represent a milestone in AI research.” And Okta, owners of Auth0, invested in Browserbase to help the agent company manage web authentication.

It's only a matter of time at this point.

P.S. If you have any questions or just want to talk about AI, email me! thomas@ascend.vc

Token Talk 5: Big Models Teach, Small Models Catch Up.

February 5, 2025

By: Thomas Stahura

O3-mini is amazing and totally free. OpenAI achieved this through distillation from the yet-released larger o3 model.

Right now, the model ranks second globally — beating DeepSeek R1 but trailing the massive o1. Estimates put o1 at 200-300 billion parameters, DeepSeek at 671 billion, and o3-mini at just 3-30 billion. (The only reasoning models to top the benchmarks this week.)

What’s remarkable is that o3-mini achieves intelligence close to o1 while being just one-hundredth its size, thanks to distillation.

There are a variety of distillation techniques; but, at a high level, distillation involves using a larger teacher model to teach a smaller student model.

For example, GPT-4 (1.4 trillion parameter model) was trained on a million GBs of public internet data (one petabyte). GPT-4 was trained to represent that data, to represent the internet.

The resulting 1.4 trillion parameter model, if downloaded, would occupy 5,600 GB, or 5.6 terabytes of space. In a sense, you can think of GPT-4 (or any LLM) as a highly compressed queryable representation of the training set, in this case the internet. After all, going from 1 petabyte to 5.6 terabytes is a 99.45% reduction.

So, how does this apply to distillation? If you think about models in terms of compression of the training dataset, then you can “uncompress” that training dataset by querying the larger teacher model, in this case GPT-4. Until you generate 1 petabyte of synthetic data, then use that dataset to train or fine-tune a smaller student (3-10 billion parameter) model to mimic the larger teacher model in performance.

This remains an active area of research today.

Of course, distilling from a closed-source model is strictly against OpenAI’s terms of service. Though, that didn’t stop DeepSeek, which is currently being probed by Microsoft over synthetic data training allegations.

The cats out of the bag. OpenAI themselves distilled o3-mini from o3, and Microsoft distilled phi-3.5-mini-instruct from phi-3.5. It seems like from now on, whatever model performs the best will become the “teacher” for all the “student” models, which will be fine-tuned to quickly catch up to it in performance. This new paradigm shifted the AI industry's focus from LLMs to AI applications with the main one being agents.

OpenAI (in addition to launching o3-mini) debuted a new web agent called deep research (only available at the $200 / month tier). I’ve used many web agents and browsers like browser base, browser-use, and computer-use. I have buddies who are building CopyCat (YC W25), and I’ve even built my own browser agent. All this to say the AI application space is heating up!

Stay tuned because I’ll talk more about agents next week!

P.S. If you have any questions or just want to talk about AI, email me: thomas @ ascend dot vc

Token Talk 4: Open source won the AI race

February 5, 2025

By: Thomas Stahura

If it wasn’t clear already, open source won the AI race.

To recap: Deepseek R1 is an open-source reasoning model that was quietly launched during the 14 hours TikTok was banned. The reasoning version of Deepseek V3, Deepseek R1 performs at o1 levels on most benchmarks. Very impressive and was reportedly trained for just $6 million, though many are skeptical on those numbers.

By Monday, a week after R1 launched, the model caused a massive market selloff. Nvidia lost $500 billion in value (-17%), the biggest one-day selloff in US history, as the market adjusts to our new open-source reality.

So, what does this mean?

For starters, models have been commoditized. Well-performing open-source models at every scale are available. But that’s besides the point. Deepseek is trained on synthetic data generated by ChatGPT. Essentially extracting the weights of a closed model and open sourcing them. This eliminates the moats of OpenAI, Anthropic, and the other closed source AI labs.

What perplexes me is why Nvidia got hit the hardest. The takes I’ve heard seem to suggest it’s the lower costs it took to train Deepseek that spooked the market. The thinking goes: LLMs become cheaper to train, so hyperscalers need fewer GPUs.

The bulls, on the other hand, cite Jevons’ paradox. Wherein, the cheaper a valuable commodity becomes, the more it gets used.

I seem to be somewhere in the middle. Lower costs are great for developers! But I have yet to see a useful token-heavy application. Well maybe web agents… I’ll cover those in another edition!

I suspect the simple fact the model came out of China is what caused it to blow up. After all, there seems to be such moral panic over the implications on US AI sovereignty. And for good reasons.

Over the weekend, I attended a hackathon hosted by Menlo where I built a browser agent. I had different LLMs take the pew research center political topology quiz.

Anthropic’s claude-sonnet-3.5, gpt-4o, o1, and llama got outsider left. Deepseek R1 and V3 got establishment liberals. Notably, R1 answered, “It would be acceptable if another country became as militarily powerful as the U.S.”

During my testing, I found that Deepseek’s models would refuse to answer questions about Taiwan or Tiananmen square. In all fairness, most American models won’t answer questions about Palestine. Still, as these models are open and widely used and used by developers, there is fear that these biases will leak into AI products and services.

I’d like to think that this problem is solvable with fine-tuning. I suppose developers are playing with Deepseek’s weights as we speak! We’ll just have to find out in the next few weeks…

Token Talk 3: Decentralizing AI Compute for Scalable Intelligence

February 5, 2025

By: Thomas Stahura

Compute is king in the age of AI. At least, that's what big tech wants you to believe. The truth is a little more complicated.

When you boil it down, AI inference is simply a very large set of multiplications. All computers do this kind of math all the time, so why can't any computer run a LLM or diffusion model?

It's all about scale. Model scale is the number of parameters (tunable neurons) in a model. Thanks to platforms like Hugging Face, developers now have access to very well performing open source models at every scale. From the small models like moondream2 (1.93b), and llama 3.2 (3b), to medium range ones like phi-4 (14b), and then the largest models like bloom (176b). These models can run on anything from a Raspberry pi to an A100 GPU server.

Sure, the smaller models take a performance hit, but only by 10-20% on most benchmarks. I got llama 3.2 (1b) to flawlessly generate and run a snake game in python. So why, then, do most developers rely on big tech to generate their tokens? The short answer is speed in performance.

Models at the largest scale (100b+ like gpt4o and the such) perform best and cost the most. That will probably be true for a long time but maybe not forever. In my opinion, it would be good if everyone could contribute their compute to collectively run models at the largest scale.

I am by no means the first person to have this idea.

Folding@home, launched October 2000 as a first-of-its-kind distributed computing project, aimed at simulating protein folding. The project reached its peak in 2020 during the pandemic, achieving 2.43 exaflops of compute by April of that year. That made it the first exaflop computing system ever.

This also exists in the generative AI community. Petals, a project made by BigScience (the same team behind bloom 176b), enables developers to run and finetune their large model in a distributed fashion. (Check out the live network here.) Nous Research has its DisTrO system (distributed training over the internet). (Check its status here.) And there are plenty of others like hivemind and exo.

While there are so many examples of distributed compute systems, none have taken off for the reason that it's too difficult to join the network.

I’ve done some experimenting, and I think a solution to this could be using the browser to join the network and running inference using webllm in pure javascript. I will write more about my findings, so stay tuned.

If you are interested in this topic, email me! Thomas @ ascend dot vc

OpenAI’s o3 model performs well on benchmarks. But it’s still unclear on how it all works.

Token Talk 2: The Rise in Test Time Compute and Its Hidden Costs

February 5, 2025

By: Thomas Stahura

Reasoning models are branded as the next evolution of large language models (LLMs). And for good reason.

These models, like OpenAI’s o3 and High-Flyer’s DeepSeek, rely on test-time compute. Essentially, they think before speaking by writing their train of thought before producing a final answer. (This type of LLM is called a “reasoning model.”)

Reasoning models are showing terrific benchmark improvements! AI researchers (and the public at large) demand better performing models, and there are five ways to do so: data, training, scale, architecture, and inference. At this point, almost all public internet data is exhausted, models are trained at every size and scale, and transformers have dominated most architectures since 2017. This leaves inference, which, for the time being, seems to be improving AI test scores.

OpenAI’s o3 nails an 87% on GPQA-D and achieves 75.5% on the ARC Prize (at a $10,000 compute limit). However, the true costs remain (as of Jan 2025) a topic of much discussion and speculation. Discussion on OpenAIs Dev Forum suggests, per query, roughly $60 for o3-mini and $600 for o3. Seems fair; however, whatever the costs are at the moment, OpenAIs research will likely be revealed, fueling competition, eventually lowering costs for all.

One question still lingers: How exactly did OpenAI make o3?

There exists no dataset on the internet of questions, logically sound steps, and correct answers. (Ok, maybe Chegg, but they might be going out of business.) Anyways, much of the data is theorized to be synthetic.

StaR (Self-Taught Reasoner) is the subject of a research paper that suggests a technique to turn a regular LLM into a reasoning model. The paper calls for using an LLM to generate a dataset of rationals, then use that dataset to fine-tune the same LLM to become a reasoning model. StaR relies on a simple loop to make the dataset: generate rationales to answer many questions; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; and repeat.

It's now 2025 and the AI world moves FAST. Many in the research community believe the future are models that can think outside of language. This is cutting-edge research as of today.

I plan to cover more as these papers progress, so stay tuned!

Token Talk 1: DeepSeek and the ways to evaluate new models

January 8, 2025

By: Thomas Stahura

DeepSeek 3.0 debuted to a lot of hubbub.

The open-weight large language model (LLM) developed by Chinese quantitative trading firm High-Flyer Capital Management outperformed benchmarks set by leading American companies like OpenAI, all while operating on a reported budget of just $6 million. (I anticipate Meta’s next Llama release to surpass DeepSeek as the top-performing open-source LLM.)

Here’s how DeepSeek performed on leading benchmarks: 76% on MMLU, 56% on GPQA-D, and 85% on MATH 500.

As more and more AI competition hits the internet, the question of how we evaluate these models becomes all the more pressing. Although various benchmarks exist, for simplicity, let’s focus on the three mentioned above: MMLU, GPQA-D, and MATH 500.

MMLU

MMLU, which stands for Massive Multitask Language Understanding, is essentially a large-scale, ACT-style multiple-choice exam. It spans 57 subjects, ranging from abstract algebra to world religions, testing a model’s ability to handle diverse and complex topics.

Question: Compute the product (12)(16) in Z sub 24.

Choices:

A) 0
B) 1
C) 4
D) 6

Answer: A) 0

Question: In his final work, Laws, Plato shifted from cosmology to which of the following issues?

Choices:

A) Epistemology
B) Morality
C) Religion
D) Aesthetics

Answer: B) Morality

An AI is prompted to select the correct option given a question and a list of choices. If the model’s answer matched the correct choice, it gets a point for that question. Otherwise, no points. The final score is typically calculated as the equal weight average across all 57 subjects.

GPQA-D

GPQA-D is a little more complicated. It’s designed by Google to be, ironically, a Google-proof dataset of 448 multiple choice questions written by “domain experts,” wherein “highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web.”

Question: Identify the correct sequence of reagents for the synthesis of [1,1'-bi(cyclopentylidene)]-2-one starting from 1,5-dichloropentane.

Answer:

1. Zn, ether

2. Cl2/hv

3. Aq. KOH

4. Pyridine + CrO3 + HCl

5. Aq. NaOH

Question: While solving higher dimensional heat equations subject to suitable initial and boundary conditions through higher order finite difference approximations and parallel splitting, the matrix exponential function is approximated by a fractional approximation. The key factor of converting a sequential algorithm into a parallel algorithm is…

Answer: …linear partial fraction of fractional approximation.

A grade is calculated using string similarity (if free form text), exact match is in MMLU (if multiple choice), and manual/validator based (where humans mark correct and incorrect answers).

MATH 500

MATH 500 is self-explanatory as it is a dataset of 500 math questions:

Question: Simplify (−k+4)+(−2+3k)(-k + 4) + (-2 + 3k)(−k+4)+(−2+3k).

Answer: 2k+2

Question: The polynomial x3−3x2+4x−1x^3 - 3x^2 + 4x - 1x3−3x2+4x−1 is a factor of x9+px6+qx3+rx^9 + px^6 + qx^3 + rx9+px6+qx3+r. Find the ordered triple (p,q,r)(p, q, r)(p,q,r).

Answer: (6,31,-1)

Now I feel we can fully appreciate DeepSeek. Its scores are impressive, but OpenAI’s o1 is close. It scores in the nineties on MMLU; 67% on MATH500; and 67% GPQA-D. This is considered “grad-level” reasoning. OpenAI’s next release, o3, reportedly achieves 87.7% on GPQA-D. That would put it in the PhD range…

For further reading, check out these benchmark datasets from Hugging Face. Maybe try to solve a few!

Chinese start-up DeepSeek threatens American AI dominance

cais/mmlu · Datasets at Hugging Face 🤗

Idavidrein/gpqa · Datasets at Hugging Face 🤗

HuggingFaceH4/MATH-500 · Datasets at Hugging Face 🤗

Learning to Reason with LLMs | OpenAI

AI Model & API Providers Analysis | Artificial Analysis