By: Thomas Stahura

Reasoning models are branded as the next evolution of large language models (LLMs). And for good reason.

These models, like OpenAI’s o3 and High-Flyer’s DeepSeek, rely on test-time compute. Essentially, they think before speaking by writing their train of thought before producing a final answer. (This type of LLM is called a “reasoning model.”)

Reasoning models are showing terrific benchmark improvements! AI researchers (and the public at large) demand better performing models, and there are five ways to do so: data, training, scale, architecture, and inference. At this point, almost all public internet data is exhausted, models are trained at every size and scale, and transformers have dominated most architectures since 2017. This leaves inference, which, for the time being, seems to be improving AI test scores.

OpenAI’s o3 nails an 87% on GPQA-D and achieves 75.5% on the ARC Prize (at a $10,000 compute limit). However, the true costs remain (as of Jan 2025) a topic of much discussion and speculation. Discussion on OpenAIs Dev Forum suggests, per query, roughly $60 for o3-mini and $600 for o3. Seems fair; however, whatever the costs are at the moment, OpenAIs research will likely be revealed, fueling competition, eventually lowering costs for all.

One question still lingers: How exactly did OpenAI make o3?

There exists no dataset on the internet of questions, logically sound steps, and correct answers. (Ok, maybe Chegg, but they might be going out of business.) Anyways, much of the data is theorized to be synthetic.

StaR (Self-Taught Reasoner) is the subject of a research paper that suggests a technique to turn a regular LLM into a reasoning model. The paper calls for using an LLM to generate a dataset of rationals, then use that dataset to fine-tune the same LLM to become a reasoning model. StaR relies on a simple loop to make the dataset: generate rationales to answer many questions; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; and repeat.

It's now 2025 and the AI world moves FAST. Many in the research community believe the future are models that can think outside of language. This is cutting-edge research as of today.

I plan to cover more as these papers progress, so stay tuned!

By: Thomas Stahura

DeepSeek 3.0 debuted to a lot of hubbub.

The open-weight large language model (LLM) developed by Chinese quantitative trading firm High-Flyer Capital Management outperformed benchmarks set by leading American companies like OpenAI, all while operating on a reported budget of just $6 million. (I anticipate Meta’s next Llama release to surpass DeepSeek as the top-performing open-source LLM.)

Here’s how DeepSeek performed on leading benchmarks: 76% on MMLU, 56% on GPQA-D, and 85% on MATH 500.

As more and more AI competition hits the internet, the question of how we evaluate these models becomes all the more pressing. Although various benchmarks exist, for simplicity, let’s focus on the three mentioned above: MMLU, GPQA-D, and MATH 500.

MMLU

MMLU, which stands for Massive Multitask Language Understanding, is essentially a large-scale, ACT-style multiple-choice exam. It spans 57 subjects, ranging from abstract algebra to world religions, testing a model’s ability to handle diverse and complex topics.

Question: Compute the product (12)(16) in Z sub 24.

Choices:

A) 0
B) 1
C) 4
D) 6

Answer: A) 0

Question: In his final work, Laws, Plato shifted from cosmology to which of the following issues?

Choices:

A) Epistemology
B) Morality
C) Religion
D) Aesthetics

Answer: B) Morality

An AI is prompted to select the correct option given a question and a list of choices. If the model’s answer matched the correct choice, it gets a point for that question. Otherwise, no points. The final score is typically calculated as the equal weight average across all 57 subjects.

GPQA-D

GPQA-D is a little more complicated. It’s designed by Google to be, ironically, a Google-proof dataset of 448 multiple choice questions written by “domain experts,” wherein “highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web.”

Question: Identify the correct sequence of reagents for the synthesis of [1,1'-bi(cyclopentylidene)]-2-one starting from 1,5-dichloropentane.

Answer:

1. Zn, ether

2. Cl2/hv

3. Aq. KOH

4. Pyridine + CrO3 + HCl

5. Aq. NaOH

Question: While solving higher dimensional heat equations subject to suitable initial and boundary conditions through higher order finite difference approximations and parallel splitting, the matrix exponential function is approximated by a fractional approximation. The key factor of converting a sequential algorithm into a parallel algorithm is…

Answer: …linear partial fraction of fractional approximation.

A grade is calculated using string similarity (if free form text), exact match is in MMLU (if multiple choice), and manual/validator based (where humans mark correct and incorrect answers).

MATH 500

MATH 500 is self-explanatory as it is a dataset of 500 math questions:

Question: Simplify (−k+4)+(−2+3k)(-k + 4) + (-2 + 3k)(−k+4)+(−2+3k).

Answer: 2k+2

Question: The polynomial x3−3x2+4x−1x^3 - 3x^2 + 4x - 1x3−3x2+4x−1 is a factor of x9+px6+qx3+rx^9 + px^6 + qx^3 + rx9+px6+qx3+r. Find the ordered triple (p,q,r)(p, q, r)(p,q,r).

Answer: (6,31,-1)

Now I feel we can fully appreciate DeepSeek. Its scores are impressive, but OpenAI’s o1 is close. It scores in the nineties on MMLU; 67% on MATH500; and 67% GPQA-D. This is considered “grad-level” reasoning. OpenAI’s next release, o3, reportedly achieves 87.7% on GPQA-D. That would put it in the PhD range…

For further reading, check out these benchmark datasets from Hugging Face. Maybe try to solve a few!

Chinese start-up DeepSeek threatens American AI dominance

cais/mmlu · Datasets at Hugging Face 🤗

Idavidrein/gpqa · Datasets at Hugging Face 🤗

HuggingFaceH4/MATH-500 · Datasets at Hugging Face 🤗

Learning to Reason with LLMs | OpenAI

AI Model & API Providers Analysis | Artificial Analysis