Ben Heidemann Limited

OpenAI Just Demonstrated What's Not Possible By Going Beyond Deminishing Returns

Author | Ben Heidemann
3 min read
OpenAI LLM

This week OpenAI announced their O3 model, which posts some impressive numbers on the ARC-AGI benchmark. However, while these results are impressive, they come with a sizeable price tag. We discuss what this means for the current generation of AI agents and the goal of achieving human level performance on tasks requiring complex reasoning.


Introduction

This week OpenAI announced their O3 model, which boasts a new high score on the ARC-AGI benchmark. With a score of 75.7% OpenAI have dethroned the previous record holder, Jeremy Berman, who posted a score of 53.6% on the semi-private eval by using an architecture based on Sonnet 3.5.

In fact, their high end O3 model clocks in even higher, with a score of 87.5%. Although undeniably impressive, this is still significantly short of the 98% posted by human STEM graduates. The record breaking score also comes at record breaking price of $3,300 per task.

Cost per Task vs. Score for OpenAI Models

Logarithmic Scaling and Diminishing Returs

Previous research into large language model scaling has indicated a power law relationship between compute and cross-entropy loss. Analyzing the OpenAI ARC-AGI benchmark results as a power law yields a best fit R^2 value of 0.729.

However, if we exclude the high end O3 results from our analysis, we get a much better R^2 value of 0.983. In addition, excluding this data point yields significantly more favourable results for OpenAI. As such, we’ll exclude the datapoint as an outlier until more data becomes available.

Cost per Task vs. Score for OpenAI Models

Based on this interpolation function, we can project that a score of 98% would cost approximately $27.40 per task, approximately 3 times higher than the cost per task of a STEM graduate. Note that if we include the O3 datapoint, the projected cost would be an order of magnitude greater.

Conclusion

The results reported by OpenAI confirm the findings of previous research, showing a power law relationship between compute and cross-entropy loss. This implies that state of the art large language models, such as the new O3 models, are cost ineffective for tasks requiring complex reasoning.

In order to develop AI agents capable of achieving cost effectiveness on complex reasoning tasks, further breakthroughs will be required. These may involve reducing the cost of running language models, or improving the efficiency of AI agents on such tasks.

In the latter case, current machine learning techniques may still underpin future AI agents, but it seems likely that additional techniques, or more complex architectures will need to be develop to achieve real world progress on reasoning tasks.

References

  1. O3 ARC-AGI benchmark results
  2. ARC Prize 2024 Results
  3. How I came in first on ARC-AGI-Pub using Sonnet 3.5 with Evolutionary Test-time Compute
  4. Scaling Laws for Neural Language Models

Previous Post

Are Optimising Compilers "Detrimental in Typescript"?

7 min read
Compilers Web JavaScript

Optimising compilers (like Clang/LLVM or GCC) have developed amazingly sophisticated methods for re-writing code that is optimized for readability into code that is optimized for performance. These are standard tools in the compiled language space, but are rarely used in the web. In this article, we discuss whether it’s worth re-visiting optimising compilers for the web, and what performance benefits potentially remain untapped.


Privacy Policy

Ben Heidemann Limited does not track, store or utilise any information relating to your visit to this website. The website may contain links to other websites run by other organisations. This Privacy Notice applies only to our website, so we encourage you to read the privacy statements on the other websites you visit. We cannot be responsible for the privacy policies and practices of other sites even if you access them using links from our website. In addition, if you linked to our website from a third-party site, we cannot be responsible for the privacy policies and practices of the owners and operators of that third-party site and recommend that you check the Privacy Notice of that third-party site.

Contact Details

0/2 23 Bolton Drive
Glasgow
G42 9DX
tel: +44 7472 564288
email: ben@heidemann.dev
© Copyright Ben Heidemann Limited All Rights Reserved