Matthew McConaughey stars in Salesforce’s Superbowl commercial promoting Agentforce.

Token Talk 6: Everyone's got something to say about agents

February 11, 2025

By: Thomas Stahura

AGENTS! AGENTS!! AGENTS!!!

Big tech can’t get enough of them! Google’s got Mariner. Microsoft’s got Copilot. Salesforce rolled out Agentforce. OpenAI’s cooking up Operator. And Anthropic has Computer Use. (Naming is hard.)

You’ve heard the hype. Maybe you’re already sick of it. They even got Matthew McConaughey to say it during the Super Bowl — America's most sacred Sunday ritual.

But have you actually used one? Probably not. And funny enough, most of the “agents” I just listed aren’t even real agents.

So what is an agent, anyway?

An agent, put simply, is a Large Language Model (LLM) in a reasoning loop that has access to tools (like a browser, code interpreter, or calculator). The LLM is prompted to break down tasks into steps and to use tools to autonomously accomplish its given goal. The tools then provide feedback from the digital environment and the LLM continues to its next step until the task is complete.

A browser agent is given a task: “Book a flight from San Francisco to Seattle.” First, it runs an “open browser” command, and the browser confirms: “Browser is open,” with a screenshot. Next, it types “San Francisco to Seattle flights” into the search bar, hits enter, and waits for results. It scans the listings, picks a booking site, clicks through, and follows the prompts— step by step. Each action generates feedback to keep it on track until the task is complete.

Most agents have a litany of specific tools, but all you really need is to move the mouse, click, type, and scroll. After all, that's all humans need to use a computer.

So what, then, makes me say that most agents out there aren't actually agents? For starters, Mariner is on a waitlist, Copilot doesn't have access to any tools, and Agentforce only has access to Salesforce-specific tools. OpenAI’s Operator and Anthropic’s Computer Use are what I’d call actual agents. But Operator is $200/month and Computer use is in beta.

Open source is not far behind. Browser use (YC W25) exploded onto the scene about a month ago and already has 27k github stars. I’ve used browser-use for my AI bias hackathon project. Works with any LLM in only 15 lines of code. Totally free.

Autogen, a Microsoft agent framework, is also open source with 39k stars. Along with Skyvern (12k stars YC S23) and Stagehand (7.5k stars). And these are just browser agents! There are also coding agents that live within an integrated development environment (IDE) like the closed-source Replit, GitHub Copilot, and Cursor, and the open-source Cline (28k stars), Continue.dev (23k stars), and Void (10k stars/YC S24).

Agents, at the end of the day, are about autonomous control. Whether it's a browser or a calculator, the more tools, control, and thus access you give an LLM, the more it can do on your behalf. In that respect, not all agents are created equal.

When I use my computer, I don't just use the browser or IDE. Sure, I spend a bunch of time online (who doesn't?), and coding (so much), but I control my computer on the OS level. I’m able to jump between different applications and navigate my file system with my keyboard and mouse, so shouldn't my agent, too?

Many thought an OS-level agent was impossible a few months ago. Now it seems inevitable. Imagine a future where we interact with our devices in the same way Tony Stark interacts with Jarvis in Iron Man (2008). This is an entirely new human-computer interaction paradigm that is set to completely change the industry.

Big tech knows this. Apple has enabled developers to write custom tools for Apple Intelligence to interact with. And MS Copilot Recall automatically records your screen to automate tasks (that is before it was recalled over privacy issues).

In the open community, Open Interpreter (58k stars) is an OS-level agent that can write and execute commands in the command line. It has limitations (no vision capabilities) but is impressive and the first of its kind. Other models such as OS-Atlas and UI-TARS exist but are not nearly as popular as browser or IDE agents. (We invested in Moondream, a startup building vision “pointing” capabilities for agent developers.)

The OS agent wars are existential for big tech. Any agent that exists within Windows or MacOS will get hamstrung by permissions requirements enshittifying the experience of alternatives while Microsoft and Apple keep their control over the industry. If these companies own and control the software that controls your computer, is it really your computer? I think not.

Regardless, agents still have a long way to go. Reliability remains a large issue along with handling authentication (to email, social media, and other sites). These, however, are solvable problems. Meta has already set up GAIA, a general AI assistant benchmark, that if solved “would represent a milestone in AI research.” And Okta, owners of Auth0, invested in Browserbase to help the agent company manage web authentication.

It's only a matter of time at this point.

P.S. If you have any questions or just want to talk about AI, email me! thomas@ascend.vc

Token Talk 5: Big Models Teach, Small Models Catch Up.

February 5, 2025

By: Thomas Stahura

O3-mini is amazing and totally free. OpenAI achieved this through distillation from the yet-released larger o3 model.

Right now, the model ranks second globally — beating DeepSeek R1 but trailing the massive o1. Estimates put o1 at 200-300 billion parameters, DeepSeek at 671 billion, and o3-mini at just 3-30 billion. (The only reasoning models to top the benchmarks this week.)

What’s remarkable is that o3-mini achieves intelligence close to o1 while being just one-hundredth its size, thanks to distillation.

There are a variety of distillation techniques; but, at a high level, distillation involves using a larger teacher model to teach a smaller student model.

For example, GPT-4 (1.4 trillion parameter model) was trained on a million GBs of public internet data (one petabyte). GPT-4 was trained to represent that data, to represent the internet.

The resulting 1.4 trillion parameter model, if downloaded, would occupy 5,600 GB, or 5.6 terabytes of space. In a sense, you can think of GPT-4 (or any LLM) as a highly compressed queryable representation of the training set, in this case the internet. After all, going from 1 petabyte to 5.6 terabytes is a 99.45% reduction.

So, how does this apply to distillation? If you think about models in terms of compression of the training dataset, then you can “uncompress” that training dataset by querying the larger teacher model, in this case GPT-4. Until you generate 1 petabyte of synthetic data, then use that dataset to train or fine-tune a smaller student (3-10 billion parameter) model to mimic the larger teacher model in performance.

This remains an active area of research today.

Of course, distilling from a closed-source model is strictly against OpenAI’s terms of service. Though, that didn’t stop DeepSeek, which is currently being probed by Microsoft over synthetic data training allegations.

The cats out of the bag. OpenAI themselves distilled o3-mini from o3, and Microsoft distilled phi-3.5-mini-instruct from phi-3.5. It seems like from now on, whatever model performs the best will become the “teacher” for all the “student” models, which will be fine-tuned to quickly catch up to it in performance. This new paradigm shifted the AI industry's focus from LLMs to AI applications with the main one being agents.

OpenAI (in addition to launching o3-mini) debuted a new web agent called deep research (only available at the $200 / month tier). I’ve used many web agents and browsers like browser base, browser-use, and computer-use. I have buddies who are building CopyCat (YC W25), and I’ve even built my own browser agent. All this to say the AI application space is heating up!

Stay tuned because I’ll talk more about agents next week!

P.S. If you have any questions or just want to talk about AI, email me: thomas @ ascend dot vc

Token Talk 4: Open source won the AI race

February 5, 2025

By: Thomas Stahura

If it wasn’t clear already, open source won the AI race.

To recap: Deepseek R1 is an open-source reasoning model that was quietly launched during the 14 hours TikTok was banned. The reasoning version of Deepseek V3, Deepseek R1 performs at o1 levels on most benchmarks. Very impressive and was reportedly trained for just $6 million, though many are skeptical on those numbers.

By Monday, a week after R1 launched, the model caused a massive market selloff. Nvidia lost $500 billion in value (-17%), the biggest one-day selloff in US history, as the market adjusts to our new open-source reality.

So, what does this mean?

For starters, models have been commoditized. Well-performing open-source models at every scale are available. But that’s besides the point. Deepseek is trained on synthetic data generated by ChatGPT. Essentially extracting the weights of a closed model and open sourcing them. This eliminates the moats of OpenAI, Anthropic, and the other closed source AI labs.

What perplexes me is why Nvidia got hit the hardest. The takes I’ve heard seem to suggest it’s the lower costs it took to train Deepseek that spooked the market. The thinking goes: LLMs become cheaper to train, so hyperscalers need fewer GPUs.

The bulls, on the other hand, cite Jevons’ paradox. Wherein, the cheaper a valuable commodity becomes, the more it gets used.

I seem to be somewhere in the middle. Lower costs are great for developers! But I have yet to see a useful token-heavy application. Well maybe web agents… I’ll cover those in another edition!

I suspect the simple fact the model came out of China is what caused it to blow up. After all, there seems to be such moral panic over the implications on US AI sovereignty. And for good reasons.

Over the weekend, I attended a hackathon hosted by Menlo where I built a browser agent. I had different LLMs take the pew research center political topology quiz.

Anthropic’s claude-sonnet-3.5, gpt-4o, o1, and llama got outsider left. Deepseek R1 and V3 got establishment liberals. Notably, R1 answered, “It would be acceptable if another country became as militarily powerful as the U.S.”

During my testing, I found that Deepseek’s models would refuse to answer questions about Taiwan or Tiananmen square. In all fairness, most American models won’t answer questions about Palestine. Still, as these models are open and widely used and used by developers, there is fear that these biases will leak into AI products and services.

I’d like to think that this problem is solvable with fine-tuning. I suppose developers are playing with Deepseek’s weights as we speak! We’ll just have to find out in the next few weeks…

Token Talk 3: Decentralizing AI Compute for Scalable Intelligence

February 5, 2025

By: Thomas Stahura

Compute is king in the age of AI. At least, that's what big tech wants you to believe. The truth is a little more complicated.

When you boil it down, AI inference is simply a very large set of multiplications. All computers do this kind of math all the time, so why can't any computer run a LLM or diffusion model?

It's all about scale. Model scale is the number of parameters (tunable neurons) in a model. Thanks to platforms like Hugging Face, developers now have access to very well performing open source models at every scale. From the small models like moondream2 (1.93b), and llama 3.2 (3b), to medium range ones like phi-4 (14b), and then the largest models like bloom (176b). These models can run on anything from a Raspberry pi to an A100 GPU server.

Sure, the smaller models take a performance hit, but only by 10-20% on most benchmarks. I got llama 3.2 (1b) to flawlessly generate and run a snake game in python. So why, then, do most developers rely on big tech to generate their tokens? The short answer is speed in performance.

Models at the largest scale (100b+ like gpt4o and the such) perform best and cost the most. That will probably be true for a long time but maybe not forever. In my opinion, it would be good if everyone could contribute their compute to collectively run models at the largest scale.

I am by no means the first person to have this idea.

Folding@home, launched October 2000 as a first-of-its-kind distributed computing project, aimed at simulating protein folding. The project reached its peak in 2020 during the pandemic, achieving 2.43 exaflops of compute by April of that year. That made it the first exaflop computing system ever.

This also exists in the generative AI community. Petals, a project made by BigScience (the same team behind bloom 176b), enables developers to run and finetune their large model in a distributed fashion. (Check out the live network here.) Nous Research has its DisTrO system (distributed training over the internet). (Check its status here.) And there are plenty of others like hivemind and exo.

While there are so many examples of distributed compute systems, none have taken off for the reason that it's too difficult to join the network.

I’ve done some experimenting, and I think a solution to this could be using the browser to join the network and running inference using webllm in pure javascript. I will write more about my findings, so stay tuned.

If you are interested in this topic, email me! Thomas @ ascend dot vc

OpenAI’s o3 model performs well on benchmarks. But it’s still unclear on how it all works.

Token Talk 2: The Rise in Test Time Compute and Its Hidden Costs

February 5, 2025

By: Thomas Stahura

Reasoning models are branded as the next evolution of large language models (LLMs). And for good reason.

These models, like OpenAI’s o3 and High-Flyer’s DeepSeek, rely on test-time compute. Essentially, they think before speaking by writing their train of thought before producing a final answer. (This type of LLM is called a “reasoning model.”)

Reasoning models are showing terrific benchmark improvements! AI researchers (and the public at large) demand better performing models, and there are five ways to do so: data, training, scale, architecture, and inference. At this point, almost all public internet data is exhausted, models are trained at every size and scale, and transformers have dominated most architectures since 2017. This leaves inference, which, for the time being, seems to be improving AI test scores.

OpenAI’s o3 nails an 87% on GPQA-D and achieves 75.5% on the ARC Prize (at a $10,000 compute limit). However, the true costs remain (as of Jan 2025) a topic of much discussion and speculation. Discussion on OpenAIs Dev Forum suggests, per query, roughly $60 for o3-mini and $600 for o3. Seems fair; however, whatever the costs are at the moment, OpenAIs research will likely be revealed, fueling competition, eventually lowering costs for all.

One question still lingers: How exactly did OpenAI make o3?

There exists no dataset on the internet of questions, logically sound steps, and correct answers. (Ok, maybe Chegg, but they might be going out of business.) Anyways, much of the data is theorized to be synthetic.

StaR (Self-Taught Reasoner) is the subject of a research paper that suggests a technique to turn a regular LLM into a reasoning model. The paper calls for using an LLM to generate a dataset of rationals, then use that dataset to fine-tune the same LLM to become a reasoning model. StaR relies on a simple loop to make the dataset: generate rationales to answer many questions; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; and repeat.

It's now 2025 and the AI world moves FAST. Many in the research community believe the future are models that can think outside of language. This is cutting-edge research as of today.

I plan to cover more as these papers progress, so stay tuned!

Token Talk 1: DeepSeek and the ways to evaluate new models

January 8, 2025

By: Thomas Stahura

DeepSeek 3.0 debuted to a lot of hubbub.

The open-weight large language model (LLM) developed by Chinese quantitative trading firm High-Flyer Capital Management outperformed benchmarks set by leading American companies like OpenAI, all while operating on a reported budget of just $6 million. (I anticipate Meta’s next Llama release to surpass DeepSeek as the top-performing open-source LLM.)

Here’s how DeepSeek performed on leading benchmarks: 76% on MMLU, 56% on GPQA-D, and 85% on MATH 500.

As more and more AI competition hits the internet, the question of how we evaluate these models becomes all the more pressing. Although various benchmarks exist, for simplicity, let’s focus on the three mentioned above: MMLU, GPQA-D, and MATH 500.

MMLU

MMLU, which stands for Massive Multitask Language Understanding, is essentially a large-scale, ACT-style multiple-choice exam. It spans 57 subjects, ranging from abstract algebra to world religions, testing a model’s ability to handle diverse and complex topics.

Question: Compute the product (12)(16) in Z sub 24.

Choices:

A) 0
B) 1
C) 4
D) 6

Answer: A) 0

Question: In his final work, Laws, Plato shifted from cosmology to which of the following issues?

Choices:

A) Epistemology
B) Morality
C) Religion
D) Aesthetics

Answer: B) Morality

An AI is prompted to select the correct option given a question and a list of choices. If the model’s answer matched the correct choice, it gets a point for that question. Otherwise, no points. The final score is typically calculated as the equal weight average across all 57 subjects.

GPQA-D

GPQA-D is a little more complicated. It’s designed by Google to be, ironically, a Google-proof dataset of 448 multiple choice questions written by “domain experts,” wherein “highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web.”

Question: Identify the correct sequence of reagents for the synthesis of [1,1'-bi(cyclopentylidene)]-2-one starting from 1,5-dichloropentane.

Answer:

1. Zn, ether

2. Cl2/hv

3. Aq. KOH

4. Pyridine + CrO3 + HCl

5. Aq. NaOH

Question: While solving higher dimensional heat equations subject to suitable initial and boundary conditions through higher order finite difference approximations and parallel splitting, the matrix exponential function is approximated by a fractional approximation. The key factor of converting a sequential algorithm into a parallel algorithm is…

Answer: …linear partial fraction of fractional approximation.

A grade is calculated using string similarity (if free form text), exact match is in MMLU (if multiple choice), and manual/validator based (where humans mark correct and incorrect answers).

MATH 500

MATH 500 is self-explanatory as it is a dataset of 500 math questions:

Question: Simplify (−k+4)+(−2+3k)(-k + 4) + (-2 + 3k)(−k+4)+(−2+3k).

Answer: 2k+2

Question: The polynomial x3−3x2+4x−1x^3 - 3x^2 + 4x - 1x3−3x2+4x−1 is a factor of x9+px6+qx3+rx^9 + px^6 + qx^3 + rx9+px6+qx3+r. Find the ordered triple (p,q,r)(p, q, r)(p,q,r).

Answer: (6,31,-1)

Now I feel we can fully appreciate DeepSeek. Its scores are impressive, but OpenAI’s o1 is close. It scores in the nineties on MMLU; 67% on MATH500; and 67% GPQA-D. This is considered “grad-level” reasoning. OpenAI’s next release, o3, reportedly achieves 87.7% on GPQA-D. That would put it in the PhD range…

For further reading, check out these benchmark datasets from Hugging Face. Maybe try to solve a few!

Chinese start-up DeepSeek threatens American AI dominance

cais/mmlu · Datasets at Hugging Face 🤗

Idavidrein/gpqa · Datasets at Hugging Face 🤗

HuggingFaceH4/MATH-500 · Datasets at Hugging Face 🤗

Learning to Reason with LLMs | OpenAI

AI Model & API Providers Analysis | Artificial Analysis