Skip to content
Back to Blog

What benchmarks miss about LLMs

Benchmark scores for new models keep going up, but that does not always match what I feel when I am actually working with them.

Benchmarks measure something useful. They are not the same as real usage. A model might score well on coding tasks because it memorized a test set rather than because it can reason through novel problems, and it might crush a math benchmark while still failing to follow a simple instruction in my editor.

The gap between benchmark scores and real experience comes from all the stuff around the model. Your system prompt changes a lot, and so do the agents you run, the MCPs you plug in, the skills you load, and most of all, how you prompt. Two people can use the same model and walk away with completely different experiences.

GPT 5.5 is a good example. I have been testing it recently, and it fits my workflow well, though it is quite sticky. Once you prompt it on something, it has a hard time letting go. Tell it to commit a change, and it will keep committing every change after that. That behavior might be perfect for some people and deeply frustrating for others because it depends entirely on how you prompt and what you expect.

Kimi K2.6 performs similarly in my actual work, but I have not found a benchmark or leaderboard that captures that similarity or explains why one feels better than the other on a given day.

Opus 4.7 is supposed to be top tier. When I tried it, I found it got things confidently wrong, which is one of the worst sins a model can commit in my eyes. I would 100% rather have a model say it does not know than hallucinate the wrong answer with confidence. It performs well on some tasks, but I do not want that tradeoff.

There is another problem: models are sometimes trained on benchmark data itself. One large-scale study measured contamination across 17 frontier models and 18 public benchmarks, finding a 57.3% overall contamination rate. Another LMSYS blog post described a 13B-parameter model trained on contaminated data that matched GPT-4 on affected benchmarks purely through memorization. OpenAI acknowledged that portions of BIG-bench were inadvertently mixed into GPT-4's training set.

The benchmark I trust is my own work. Run the model on your tasks with your setup, and pay attention to what happens. The numbers are a starting point, but for now that is mostly it.