Intelligence on Tap: The economics, architecture, and future of AI

4. Tokens, thinking, and speed

The reasoning paradox

The clearest illustration of the pricing problem is the reasoning model.

‍

Until 2024, AI models generated output autoregressively: one token at a time, as fast as the hardware allowed. Thinking and generating were the same process. The result was fast responses that could be confidently wrong on problems requiring multi-step logic.

‍

The reasoning models introduced by OpenAI (o1, o3), Anthropic (extended thinking in Claude), and Google (Gemini with deep research) work differently. Before generating a visible response, they produce internal 'thinking' tokens — chains of reasoning that explore the problem, check assumptions, and identify errors before anything appears in the output. The response is shorter and more accurate because it is preceded by reasoning the user never sees, but the hardware still fully processes.

‍

The results are striking. On mathematical benchmarks, complex coding problems, and multi-step reasoning tasks, reasoning models substantially outperform their non-reasoning counterparts. The cost profile is equally striking. A simple factual query might consume 500 tokens. A complex reasoning task on the same model can consume 50,000 — a 100x difference in compute for a single interaction. At frontier reasoning model pricing, that one interaction costs several dollars. The same query routed to a fast, cheap model costs a fraction of a cent.

‍

This is not an anomaly to be corrected. It is a signal about how different AI tasks actually are from each other, and why treating all tokens as equivalent units has become the defining pricing error of the current AI era.

‍

The tier explosion

The foundation labs have recognized the heterogeneity problem and responded with a proliferation of model tiers designed to match capability to cost. The market now has at least four distinct levels, each with its own price point and appropriate use case.

*Sources: OpenAI, Anthropic, Google API pricing pages; blended estimates based on 80/20 input/output split*

The tier structure is the right strategic response to query heterogeneity. It is also, in practice, applied clumsily. Most applications today route every query to one model — the one the developer originally integrated — rather than selecting dynamically by query type. A customer service chatbot handles simple returns questions and escalated billing disputes through the same model, overpaying for simple interactions and potentially underserving complex ones. The tiers exist. Systematic use of them does not.

‍

The tier structure addresses heterogeneity on the supply side. The demand side has tiers of its own, and they are uncorrelated with the model tiers above. A consumer trivia app and a clinical decision support tool can route the same query to the same model and the lab will receive identical revenue from both. The customer captures wildly different values. The lab captures none of the differences.

The routing problem

Matching each query to the right model tier in real time is what the AI industry calls the routing problem. It is one of the highest-leverage unsolved problems in the entire stack.

‍

Done well, intelligent routing reduces effective inference cost by 5 to 10x without any reduction in output quality, by directing simple queries to cheap models and complex ones to capable models. The efficiency gain follows directly from the pricing differential between tiers. An application that currently routes all queries through Claude Sonnet at $3 per million tokens, and could route 80% of its volume through Claude Haiku at $0.25 per million tokens without quality loss, is paying roughly 3x more than it needs to. At enterprise query volumes, that gap is measured in millions of dollars annually.

‍

Martian, one of a small number of startups working on the routing layer, has built infrastructure designed to classify query complexity in real time and dispatch to the appropriate model. The approach requires solving a hard sub-problem: assessing the difficulty of a query before routing it, using a classifier cheap enough that its cost does not eliminate the savings. Unify.ai is pursuing the same space from a slightly different angle, abstracting across model providers and optimizing for cost, quality, and latency simultaneously.

‍

The challenge is not unlike the one Google solved with PageRank — the value lies in the ranking and routing function, not in the content being served. The company that makes intelligent routing reliable and deployable at scale will have built one of the most structurally important positions in the supply side of the AI stack: a layer that captures margin from every AI interaction regardless of which model answers it. Routing optimizes what the lab earns per query. It does not determine what the customer can be charged. That problem sits one layer up.

The efficiency levers beyond routing

Model routing is the highest-leverage lever, but not the only one. A set of inference optimization techniques has demonstrated efficiency gains of 90% or more for specific tasks, and they are increasingly composable.

‍

Distillation trains a smaller, cheaper model to replicate the outputs of a larger frontier model for a defined task. The result is a model that costs a fraction of the original to serve but performs comparably on the specific distribution of queries it was trained on — a technique that works best when the task is narrow and the query distribution predictable, conditions that apply to most enterprise AI deployments.

‍

Quantization reduces the numerical precision of model weights, typically from 16-bit to 8-bit or 4-bit representations. Quality loss is minimal for most tasks while memory requirements fall by 50-75%, enabling larger models to run on cheaper hardware or smaller models to run on edge devices entirely.

‍

Speculative decoding uses a small, fast model to generate draft responses that a large, accurate model then verifies and corrects. Because most draft tokens require only minor correction, the large model does a fraction of the work it would if generating from scratch — delivering near-frontier output quality at substantially lower latency and compute cost.

‍

Caching and retrieval-augmented generation avoid redundant computation entirely. Caching stores computed outputs of repeated or similar queries. RAG retrieves relevant context from external knowledge bases at query time rather than encoding all knowledge in model weights, enabling smaller and cheaper models to answer questions that would otherwise require much larger ones.

‍

None of these techniques is exotic. All are in production use at leading AI companies. What is not yet standard is applying them systematically across an application's full query distribution, which requires exactly the kind of query classification and routing infrastructure that the market is still building.

Two pricing models, two wrong answers

The efficiency tools exist. The tier structure exists. The routing problem is identified. And yet the two dominant pricing models in the AI market create systematic incentives against using any of them well.

‍

Flat-rate subscriptions create what OpenAI has acknowledged publicly: gym membership economics. The pricing model assumes average usage. Heavy users consume 10 to 100 times average compute. OpenAI has confirmed it loses money on ChatGPT Pro at $200 per month for its most capable models. The subscription model pits user incentives (consume as much as possible) directly against lab economics (cost is variable and scales with usage). One side of that equation wins at the expense of the other.

‍

Per-token API pricing solves the margin problem but creates a different misalignment. A developer building a medical diagnosis tool pays the same per-token rate as one building a trivia chatbot. The pricing model cannot distinguish between a token that helped a physician catch a drug interaction and a token that answered a question about celebrity birthdays. Both are priced identically. The high-value use case captures no premium and subsidizes nothing.

The token is not the unit of value. The task is closer. The customer and the moment are closer still. AI is pricing the unit the hardware counts, not the unit the buyer experiences.

The direction the market must move is toward outcome-aligned pricing: tiered by the capability deployed, metered by the complexity of the interaction, and ultimately calibrated against the value of the task completed. This is how professional services are priced. It is how usage-based SaaS has evolved. It is the only model that can simultaneously sustain the economics of frontier AI development and price access broadly enough to support diffusion into healthcare, legal, education, and financial services — the industries where AI's most durable value will be created.

The segmentation layer

Routing addresses the supply side of the pricing problem. Segmentation addresses the demand side. They are different problems, and the AI industry has barely begun to recognize the second.

‍

Every prior wave of enterprise software has been priced this way. Bloomberg charges roughly $30,000 per terminal per year for what is, at the technical layer, a data feed and a chat interface. The compute cost is trivial. The pricing reflects a single fact: the buyer is a trader, and a thirty-second information edge is worth a fortune. Veeva charges life sciences companies a multiple of generic CRM pricing for substantially the same primitives, because the segment values FDA-compliant workflow differently than a real estate brokerage does. ServiceTitan does the same for the trades. The capability underneath is commoditizing. The segment redefines the price.

‍

The same pattern is now emerging in AI, and almost none of it lives at the model layer or the routing layer. Harvey prices by the legal seat, calibrated against the partner-hour saved. Abridge prices by clinical encounter, calibrated to documentation throughput. Hippocratic prices by nursing role and shift coverage. None of these companies meters tokens to its customers. They meter the unit the customer actually buys — an hour, an encounter, a shift, a covered patient. The token cost paid upstream to the lab is an input cost, not a customer-facing price. The margin between what is paid per token and what is charged per outcome is where the segmentation surplus lives.

Routing optimizes which capability gets used. Segmentation determines what that capability is worth.

This is the layer missing from most discussions of AI economics. The lab sells capability by the token. The router optimizes which capability gets used. Neither captures what the same capability is worth in the hands of a cardiologist versus a content marketer. That difference is the entire commercial opportunity, and it is being captured today by the companies that own a customer segment and wrap the underlying model with the workflow, accountability, and trust the segment requires. Routing is leverage. Segmentation is destiny.

‍

The token trap

The AI industry is caught in a trap of its own construction. It built pricing around the token because the token is what the hardware counts. It is a convenient unit of measurement that has almost no relationship to the unit of value — the task completed, the decision supported, the outcome improved.

‍

The path out runs through both layers. Routing solves the supply side, matching each query to the cheapest model that can answer it. Segmentation solves the demand side, matching each customer to a price that reflects what the answer is worth. Routing earns its margin on volume. Segmentation earns its margin on value. Neither alone closes the gap between what the hardware counts and what the buyer experiences. Together they do.

‍

Two structural positions emerge from this. The first is the router. The company that classifies queries by complexity and dispatches them to the right model at scale will sit between every user and every model, on every interaction, permanently. PageRank did not create web content; it organized the surplus. Routing does not create AI capability; it organizes the surplus of deploying it.

‍

The second position is the segment owner. In every prior platform shift, the vertical application that owned a customer segment captured more economic value than the horizontal infrastructure beneath it. The application that owns the surgeon, the lawyer, or the underwriter will likely capture more of AI's surplus than any router ever does. The router is leverage. The segment is the customer.

‍

The token is cheap and getting cheaper. The task is closer to value, but still not the unit. The customer is. The infrastructure that prices intelligence by what it is worth in the hands of the buyer, not by what the hardware counts, is the most valuable thing in AI that has not yet been built. Two layers, not one. The routing layer is being built now. The segmentation layer is where the next decade of AI value will be captured.

‍

Sources and data notes

‍

Model pricing: OpenAI, Anthropic, and Google API pricing pages; blended estimates follow Andrew Ng methodology (80% input / 20% output).

‍

Reasoning model token consumption: OpenAI o1 and o3 technical documentation; Anthropic extended thinking model card; community benchmarks on reasoning token budgets.

‍

ChatGPT Pro economics: Sam Altman public statements on ChatGPT Pro margin; The Information reporting on OpenAI 2024-2025 financials.

‍

Routing efficiency gains: Martian and Unify.ai technical documentation; academic literature on LLM routing and cascade inference.

‍

Inference optimization techniques: Hugging Face documentation on quantization and speculative decoding; academic papers on RAG and knowledge retrieval; industry benchmarks on distillation efficiency.

Download the full report

The economics, architecture, and future of AI, and what must change for ubiquitous, on-demand intelligence to become a sustainable, long-term reality.