• Playlist
  • Seattle Startup Toolkit
  • Portfolio
  • About
  • Job Board
  • Blog
  • Token Talk
  • News
Menu

Ascend.vc

  • Playlist
  • Seattle Startup Toolkit
  • Portfolio
  • About
  • Job Board
  • Blog
  • Token Talk
  • News

Token Talk 1: DeepSeek and the ways to evaluate new models

January 8, 2025

By: Thomas Stahura

DeepSeek 3.0 debuted to a lot of hubbub. 

The open-weight large language model (LLM) developed by Chinese quantitative trading firm High-Flyer Capital Management outperformed benchmarks set by leading American companies like OpenAI, all while operating on a reported budget of just $6 million. (I anticipate Meta’s next Llama release to surpass DeepSeek as the top-performing open-source LLM.)

Here’s how DeepSeek performed on leading benchmarks: 76% on MMLU, 56% on GPQA-D, and 85% on MATH 500.

As more and more AI competition hits the internet, the question of how we evaluate these models becomes all the more pressing. Although various benchmarks exist, for simplicity, let’s focus on the three mentioned above: MMLU, GPQA-D, and MATH 500.

MMLU 

MMLU, which stands for Massive Multitask Language Understanding, is essentially a large-scale, ACT-style multiple-choice exam. It spans 57 subjects, ranging from abstract algebra to world religions, testing a model’s ability to handle diverse and complex topics.

Question: Compute the product (12)(16) in Z sub 24.

Choices: 

A) 0
B) 1
C) 4
D) 6

Answer: A) 0

Question: In his final work, Laws, Plato shifted from cosmology to which of the following issues?

Choices: 

A) Epistemology
B) Morality
C) Religion
D) Aesthetics

Answer: B) Morality

An AI is prompted to select the correct option given a question and a list of choices. If the model’s answer matched the correct choice, it gets a point for that question. Otherwise, no points. The final score is typically calculated as the equal weight average across all 57 subjects.

GPQA-D

GPQA-D is a little more complicated. It’s designed by Google to be, ironically, a Google-proof dataset of 448 multiple choice questions written by “domain experts,” wherein “highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web.” 

Question: Identify the correct sequence of reagents for the synthesis of [1,1'-bi(cyclopentylidene)]-2-one starting from 1,5-dichloropentane.

Answer: 

1. Zn, ether 

2. Cl2/hv 

3. Aq. KOH 

4. Pyridine + CrO3 + HCl 

5. Aq. NaOH

Question: While solving higher dimensional heat equations subject to suitable initial and boundary conditions through higher order finite difference approximations and parallel splitting, the matrix exponential function is approximated by a fractional approximation. The key factor of converting a sequential algorithm into a parallel algorithm is…

Answer: …linear partial fraction of fractional approximation.

A grade is calculated using string similarity (if free form text), exact match is in MMLU (if multiple choice), and manual/validator based (where humans mark correct and incorrect answers).

 MATH 500 

MATH 500 is self-explanatory as it is a dataset of 500 math questions:

Question: Simplify (−k+4)+(−2+3k)(-k + 4) + (-2 + 3k)(−k+4)+(−2+3k).

Answer: 2k+2

Question: The polynomial x3−3x2+4x−1x^3 - 3x^2 + 4x - 1x3−3x2+4x−1 is a factor of x9+px6+qx3+rx^9 + px^6 + qx^3 + rx9+px6+qx3+r. Find the ordered triple (p,q,r)(p, q, r)(p,q,r). 

Answer: (6,31,-1)


Now I feel we can fully appreciate DeepSeek. Its scores are impressive, but OpenAI’s o1 is close. It scores in the nineties on MMLU; 67% on MATH500; and 67% GPQA-D. This is considered “grad-level” reasoning. OpenAI’s next release, o3, reportedly achieves 87.7% on GPQA-D. That would put it in the PhD range… 

For further reading, check out these benchmark datasets from Hugging Face. Maybe try to solve a few!

Chinese start-up DeepSeek threatens American AI dominance

cais/mmlu · Datasets at Hugging Face 🤗

Idavidrein/gpqa · Datasets at Hugging Face 🤗

HuggingFaceH4/MATH-500 · Datasets at Hugging Face 🤗

Learning to Reason with LLMs | OpenAI

AI Model & API Providers Analysis | Artificial Analysis

Tags Token Talk
← Token Talk 2: The Rise in Test Time Compute and Its Hidden Costs

FEATURED

Featured
Subscribe to Token Talk
Subscribe to Token Talk
You Let AI Help Build Your Product. Can You Still Own It?
You Let AI Help Build Your Product. Can You Still Own It?
Startup-Comp (1).jpg
Early-Stage Hiring, Decoded: What 60 Seattle Startups Told Us
Booming: An Inside Look at Seattle's AI Startup Scene
Booming: An Inside Look at Seattle's AI Startup Scene
SEATTLE AI MARKET MAP V2 - EDITED (V4).jpg
Mapping Seattle's Enterprise AI Startups
Our 2025 Predictions: AI, space policy, and hoverboards
Our 2025 Predictions: AI, space policy, and hoverboards
Mapping Seattle's Active Venture Firms
Mapping Seattle's Active Venture Firms
PHOTOS: Founders Bash 2024
PHOTOS: Founders Bash 2024
VC for the rest of us: A big tech employee’s guide to becoming startup advisors
VC for the rest of us: A big tech employee’s guide to becoming startup advisors
Valley VCs.jpg
Event Recap: Valley VCs Love Seattle Startups
VC for the rest of us: The ultimate guide to investing in venture capital funds for tech employees
VC for the rest of us: The ultimate guide to investing in venture capital funds for tech employees
Seattle VC Firms Led Just 11% of Early-Stage Funding Rounds in 2023
Seattle VC Firms Led Just 11% of Early-Stage Funding Rounds in 2023
Seattle AI Market Map (1).jpg
Mapping the Emerald City’s Growing AI Dominance
SaaS 3.0: Why the Software Business Model Will Continue to Thrive in the Age of In-House AI Development
SaaS 3.0: Why the Software Business Model Will Continue to Thrive in the Age of In-House AI Development
3b47f6bc-a54c-4cf3-889d-4a5faa9583e9.png
Best Practices for Requesting Warm Intros From Your Investors
 

Powered by Squarespace