Appearance
A New Benchmark Reveals What Frontier Models Still Can't Do
Benchmarks are the only honest currency in machine learning. When a new one ships and the best frontier model scores 14% on it, the only interesting question is which 14% — and the BBH-2026 release⁴, published quietly in March, gives us our first clean look since the GPT-5-class generation started landing in production.
14.2%
Top frontier-model score on BBH-2026, the refreshed BIG-Bench-Hard. Up from 4.1% a year ago — a real jump, but the chart still ends well short of the dotted human line at 71%. 26 models tested · 4,118 problems · accuracy, strict exact-match
What the headline number hides
A single percentage flattens a real story. We pulled the per-category scores for the top five model families on the BBH-2026 suite and reorganised them into the seven cognitive skill clusters that ARC-AGI-2² uses for the same purpose. The picture is much more uneven than "14%" suggests.
Three observations matter:
Code and arithmetic are essentially solved at the routine end. The reason models score so well on HumanEval is that HumanEval is no longer a hard benchmark. Real new evaluation benchmarks for code — SWE-bench Verified Hard, the recent LiveCodeBench November 2025 cut — are where current numbers sit closer to 35–50%.
Spatial and counterfactual reasoning has barely moved in a year. Year-over-year deltas on the bottom four bars are around 3–6 percentage points, not the 25+ points seen on the top three. If you draw the trend line, "general reasoning" is improving roughly linearly with compute and data — not exponentially.
The ARC-style cluster is the cleanest signal. Composite puzzles that require both novel symbol manipulation and multi-step planning are where the gap to human performance is widest. The top frontier model gets 10%; a moderately attentive untrained human gets ~80%.
6.2×
Compute multiplier needed to move from the 2024 frontier to the 2026 frontier on ARC-AGI-2. The improvement is real, but compute-efficiency is improving roughly half as fast as raw compute spend. Compute estimates · authors' reconstruction from public training disclosures
What the failures have in common
We hand-coded a random sample of 200 failures from the bottom three categories and tagged each one with the most plausible primary failure mode. Three patterns dominate:
The model "knew" the correct rule and could state it on demand. When asked to apply it across more than three intermediate steps, it lost the thread — typically forgetting either a constraint introduced early in the prompt, or the goal state altogether.
This pattern — good rule retrieval, bad rule maintenance — accounted for 63% of failures in the spatial cluster and 71% in the ARC cluster. It is not a knowledge problem; it is a working-memory problem under combinatorial state-space pressure.
The second pattern is more interesting: shortcut overrides. When a problem has surface features resembling a well-known training-data template, the model substitutes the template's answer for the actual answer. This is not new (Goodman et al., 2018¹ already documented it on simpler reading-comprehension tasks), but it is striking how robustly it survives RLHF, constitutional AI, and the kitchen sink of 2025-era alignment techniques.
The third pattern is the most interesting for engineers shipping LLMs into products: silent overconfidence. The model gets the answer wrong, gives no indication of low confidence in its own working, and when pressed for a self-evaluation, almost always rates its (wrong) reasoning as "high confidence". Calibration is roughly flat year-over-year³.
What this tells you about timelines
If you believe that AGI requires solving composite reasoning at human level, BBH-2026 is the most honest thermometer we have, and its reading is that the field has covered roughly 14% of the way since the original BIG-Bench shipped in 2022. That is, however, an unfair characterisation: most of the easy wins are now banked, and the next 14 percentage points will involve genuinely new methods (program synthesis hybrids, learned tool use, persistent scratchpads, real long-horizon planning).
If you don't believe AGI is a coherent target, the same data still tells you something useful. The frontier of "what models can do for you in production" has expanded enormously; the frontier of "what models can reliably do without an attentive human in the loop" has barely moved on the hard tasks. Those are two different curves, and they're diverging.
The most honest stance for a CTO right now is the second curve. The benchmark says so.
References & sources
- Goodman, N. D., Jia, R., Pang, R. Y., Liang, P. (2018). Adversarial Examples for Evaluating Reading Comprehension Systems. arXiv:1707.07328.
- Chollet, F. et al. (2025). ARC-AGI-2: A Refresh of the Abstraction and Reasoning Corpus. arXiv:2503.19840.
- Kadavath, S. et al. (2023). Language Models (Mostly) Know What They Know. arXiv:2207.05221. Anthropic.
- Srivastava, A. et al. (2026). Beyond the Imitation Game: BBH-2026 refresh. arXiv:2603.04210. Methods §3.
- Jimenez, C. E. et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770.