A new paper is forcing AI developers to rethink how they evaluate the cost of reasoning models — and the finding is counterintuitive enough that it deserves wider attention.
Researchers published "Price Reversal Phenomenon," a study showing that AI reasoning models with lower per-token costs can actually be more expensive in practice. The reason: cheaper models often need significantly more tokens to reach the same quality of answer as their pricier counterparts.
How Reasoning Models Are Priced
Most AI API pricing works per token — you pay for every piece of text the model processes or generates. A model priced at $1 per million tokens sounds like a bargain compared to one at $5 per million tokens.
But reasoning models are different. These models "think" before answering — they generate extended internal reasoning chains, weighing options, working through problems step by step, before producing a final response. That thinking costs tokens.
The Reversal
Here's the problem the paper identifies: a cheaper reasoning model might need 3,000 tokens of internal reasoning to answer a hard question correctly. A more expensive model might answer the same question correctly in 800 tokens.
At $1/million vs. $5/million, the math seems clear. But 3,000 tokens × $1 = $0.003, while 800 tokens × $5 = $0.004. The "expensive" model is slightly more expensive per question in this example — but close. Scale that across thousands of API calls, and the pattern can flip entirely.
The paper demonstrates cases where this reversal is dramatic, not marginal.
Why This Matters for Real Deployments
Developers building on reasoning models typically see per-token pricing when they sign up and budget accordingly. The Price Reversal paper shows that budget estimation based on per-token pricing alone can be significantly wrong — sometimes by multiples.
The more relevant metric is cost per correct answer (or cost per task completion), not cost per token. That requires benchmarking your specific workload against candidate models before committing to one.
The Actionable Takeaway
Before selecting a reasoning model for production use, run your actual tasks — not synthetic benchmarks — and measure total token consumption, not just per-token price. The model that looks cheapest on the pricing page may not be cheapest in your specific use case.
The paper provides code and an interactive demo to help teams calculate this for their own workloads.
It's a small methodological shift, but one that could save significant money at scale — which is exactly the kind of thing that tends to get ignored until someone publishes a paper about it.



