Debatates about AI Benchmarks – and as they have reported from AI lab – are played in public view.
This week, a Opea Employees to connected Eon musek’s ai company, xai, from the release of the release result results for his last Ai Model, large, e of the employee of xai, igorderhin, insists that the company was to the right.
The truth is situated somewhere in between.
In one Post on Xai’s blogThe company has a chart a graphic programming of performance on AIME 2025, a collection of the failed math of a recent invitation player. Some experts have asked Aime’s validity as ai benchmarkIn the. Nain, anime in 201525 and older versions of the test are generally used to receive a model skillfulness.
Xai’s graphics has two varieties of gray 3, cocke 3 reasoning and grok 3 mini-permanential available O3-mini-highOn Aime 2025. But Openai employees on x were fast for the XAI, don’t the Graphi-Mini-Mini-mini-high-high-high-high-high-high-high. ”
What is the consa 64, can you ask? Where the time, there is a few consenside of 64-billion, “and a memory of a disappointment. As you can imagine? 22nd To be bodels of benchmark scores pretty much a bit, and remove it from a chart can be issued in reality it is not the case.
Groom 1 Breeringsa and Bide 3 mini use of a Treight 2025 and “” Remain the First Score the used by the used by the used by the used by the use. Grok 3 ration of beta will still trust lightly behind Openai’s o1 model Put on “medium” computers. But Xai is Advertising grok 3 Like the “World Smartest Ai.”
Babuskinin argued on x The Openi has similarly understood the benchmark charts chartered in the past – Alibit Charts compare the performance of his own models. A neutral party in the debate to debate a more “correct value” an accurate Grindes, where almost every character of the Co. 64:
Funny like some people see my plot as attack on Openai and others as attack on gray while it is in reality.
(Ech gleewen eigentlech Grok gesäit gutt do, an Openai Ttc Chicany hannert O3-Mini- * héich * -Pass @ “” “” “” “” “” “” “” “” “” “” “” “” “” ” https://t.co/djqljpcjh8 pic.twitter.com/3Wh8 -Fouf– Tertaxes ▶ ️ (DEF FAELKEKEK TWEFITE PROTER LIGHER PROPERT 2023 – ∞portax Statex) 20. February, 2025
But as ai researchers Nathan Lambert indicated in a postMaybe the most important crisis is staying a mystery: The counter audulent (and currencies) cost it to achieve each model. That’s just for showing how much AI Benchmarks about models of models – and their strengths.