• Playlist
  • Seattle Startup Toolkit
  • Portfolio
  • About
  • Job Board
  • Blog
  • Token Talk
  • News
Menu

Ascend.vc

  • Playlist
  • Seattle Startup Toolkit
  • Portfolio
  • About
  • Job Board
  • Blog
  • Token Talk
  • News

OpenAI’s o3 model performs well on benchmarks. But it’s still unclear on how it all works.

Token Talk 2: The Rise in Test Time Compute and Its Hidden Costs

February 5, 2025

By: Thomas Stahura

Reasoning models are branded as the next evolution of large language models (LLMs). And for good reason.

These models, like OpenAI’s o3 and High-Flyer’s DeepSeek, rely on test-time compute. Essentially, they think before speaking by writing their train of thought before producing a final answer. (This type of LLM is called a “reasoning model.”)

Reasoning models are showing terrific benchmark improvements! AI researchers (and the public at large) demand better performing models, and there are five ways to do so: data, training, scale, architecture, and inference. At this point, almost all public internet data is exhausted, models are trained at every size and scale, and transformers have dominated most architectures since 2017. This leaves inference, which, for the time being, seems to be improving AI test scores. 

OpenAI’s o3 nails an 87% on GPQA-D and achieves 75.5% on the ARC Prize (at a $10,000 compute limit). However, the true costs remain (as of Jan 2025) a topic of much discussion and speculation. Discussion on OpenAIs Dev Forum suggests, per query, roughly $60 for o3-mini and $600 for o3. Seems fair; however, whatever the costs are at the moment, OpenAIs research will likely be revealed, fueling competition, eventually lowering costs for all.

One question still lingers: How exactly did OpenAI make o3?

There exists no dataset on the internet of questions, logically sound steps, and correct answers. (Ok, maybe Chegg, but they might be going out of business.) Anyways, much of the data is theorized to be synthetic.

Image credit

StaR (Self-Taught Reasoner) is the subject of a research paper that suggests a technique to turn a regular LLM into a reasoning model. The paper calls for using an LLM to generate a dataset of rationals, then use that dataset to fine-tune the same LLM to become a reasoning model. StaR relies on a simple loop to make the dataset: generate rationales to answer many questions; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; and repeat.

It's now 2025 and the AI world moves FAST. Many in the research community believe the future are models that can think outside of language. This is cutting-edge research as of today.

I plan to cover more as these papers progress, so stay tuned!

Tags Test Time Compute

Token Talk 1: DeepSeek and the ways to evaluate new models

January 8, 2025

By: Thomas Stahura

DeepSeek 3.0 debuted to a lot of hubbub. 

The open-weight large language model (LLM) developed by Chinese quantitative trading firm High-Flyer Capital Management outperformed benchmarks set by leading American companies like OpenAI, all while operating on a reported budget of just $6 million. (I anticipate Meta’s next Llama release to surpass DeepSeek as the top-performing open-source LLM.)

Here’s how DeepSeek performed on leading benchmarks: 76% on MMLU, 56% on GPQA-D, and 85% on MATH 500.

As more and more AI competition hits the internet, the question of how we evaluate these models becomes all the more pressing. Although various benchmarks exist, for simplicity, let’s focus on the three mentioned above: MMLU, GPQA-D, and MATH 500.

MMLU 

MMLU, which stands for Massive Multitask Language Understanding, is essentially a large-scale, ACT-style multiple-choice exam. It spans 57 subjects, ranging from abstract algebra to world religions, testing a model’s ability to handle diverse and complex topics.

Question: Compute the product (12)(16) in Z sub 24.

Choices: 

A) 0
B) 1
C) 4
D) 6

Answer: A) 0

Question: In his final work, Laws, Plato shifted from cosmology to which of the following issues?

Choices: 

A) Epistemology
B) Morality
C) Religion
D) Aesthetics

Answer: B) Morality

An AI is prompted to select the correct option given a question and a list of choices. If the model’s answer matched the correct choice, it gets a point for that question. Otherwise, no points. The final score is typically calculated as the equal weight average across all 57 subjects.

GPQA-D

GPQA-D is a little more complicated. It’s designed by Google to be, ironically, a Google-proof dataset of 448 multiple choice questions written by “domain experts,” wherein “highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web.” 

Question: Identify the correct sequence of reagents for the synthesis of [1,1'-bi(cyclopentylidene)]-2-one starting from 1,5-dichloropentane.

Answer: 

1. Zn, ether 

2. Cl2/hv 

3. Aq. KOH 

4. Pyridine + CrO3 + HCl 

5. Aq. NaOH

Question: While solving higher dimensional heat equations subject to suitable initial and boundary conditions through higher order finite difference approximations and parallel splitting, the matrix exponential function is approximated by a fractional approximation. The key factor of converting a sequential algorithm into a parallel algorithm is…

Answer: …linear partial fraction of fractional approximation.

A grade is calculated using string similarity (if free form text), exact match is in MMLU (if multiple choice), and manual/validator based (where humans mark correct and incorrect answers).

 MATH 500 

MATH 500 is self-explanatory as it is a dataset of 500 math questions:

Question: Simplify (−k+4)+(−2+3k)(-k + 4) + (-2 + 3k)(−k+4)+(−2+3k).

Answer: 2k+2

Question: The polynomial x3−3x2+4x−1x^3 - 3x^2 + 4x - 1x3−3x2+4x−1 is a factor of x9+px6+qx3+rx^9 + px^6 + qx^3 + rx9+px6+qx3+r. Find the ordered triple (p,q,r)(p, q, r)(p,q,r). 

Answer: (6,31,-1)


Now I feel we can fully appreciate DeepSeek. Its scores are impressive, but OpenAI’s o1 is close. It scores in the nineties on MMLU; 67% on MATH500; and 67% GPQA-D. This is considered “grad-level” reasoning. OpenAI’s next release, o3, reportedly achieves 87.7% on GPQA-D. That would put it in the PhD range… 

For further reading, check out these benchmark datasets from Hugging Face. Maybe try to solve a few!

Chinese start-up DeepSeek threatens American AI dominance

cais/mmlu · Datasets at Hugging Face 🤗

Idavidrein/gpqa · Datasets at Hugging Face 🤗

HuggingFaceH4/MATH-500 · Datasets at Hugging Face 🤗

Learning to Reason with LLMs | OpenAI

AI Model & API Providers Analysis | Artificial Analysis

Tags Token Talk
← Newer Posts

FEATURED

Featured
Subscribe to Token Talk
Subscribe to Token Talk
You Let AI Help Build Your Product. Can You Still Own It?
You Let AI Help Build Your Product. Can You Still Own It?
Startup-Comp (1).jpg
Early-Stage Hiring, Decoded: What 60 Seattle Startups Told Us
Booming: An Inside Look at Seattle's AI Startup Scene
Booming: An Inside Look at Seattle's AI Startup Scene
SEATTLE AI MARKET MAP V2 - EDITED (V4).jpg
Mapping Seattle's Enterprise AI Startups
Our 2025 Predictions: AI, space policy, and hoverboards
Our 2025 Predictions: AI, space policy, and hoverboards
Mapping Seattle's Active Venture Firms
Mapping Seattle's Active Venture Firms
PHOTOS: Founders Bash 2024
PHOTOS: Founders Bash 2024
VC for the rest of us: A big tech employee’s guide to becoming startup advisors
VC for the rest of us: A big tech employee’s guide to becoming startup advisors
Valley VCs.jpg
Event Recap: Valley VCs Love Seattle Startups
VC for the rest of us: The ultimate guide to investing in venture capital funds for tech employees
VC for the rest of us: The ultimate guide to investing in venture capital funds for tech employees
Seattle VC Firms Led Just 11% of Early-Stage Funding Rounds in 2023
Seattle VC Firms Led Just 11% of Early-Stage Funding Rounds in 2023
Seattle AI Market Map (1).jpg
Mapping the Emerald City’s Growing AI Dominance
SaaS 3.0: Why the Software Business Model Will Continue to Thrive in the Age of In-House AI Development
SaaS 3.0: Why the Software Business Model Will Continue to Thrive in the Age of In-House AI Development
3b47f6bc-a54c-4cf3-889d-4a5faa9583e9.png
Best Practices for Requesting Warm Intros From Your Investors
 

Powered by Squarespace