Uber's CTO told The Information in April 2026 that AI coding tools had already exhausted the company's entire 2026 AI budget just months into the year. Uber is not a startup miscalculating a line item. Uber is one of the most sophisticated engineering organizations on the planet, and they still didn't see it coming.

This is not an isolated incident. It is the leading edge of a wave. Enterprise after enterprise is discovering the same thing: AI inference costs are scaling faster than value, and no one has a clean explanation for why.

I want to offer one.

The problem is not that AI is expensive. The problem is that it's stateless.

When you send a query to an LLM, the model has no memory of what it answered five minutes ago — or five seconds ago for a different user on the same team. Every request is billed as a fresh inference. Every token in your system prompt, your document context, and your question is counted as if it's the first time it's ever been processed.

In most consumer applications, this is fine. Questions are genuinely novel. Users explore different territory. The statelessness of LLM inference is an acceptable tradeoff for simplicity.

Enterprise workflows are different.

"In document-heavy work — contract review, lease abstraction, due diligence, research synthesis — the same questions appear constantly across teams, sessions, and documents."

Legal teams ask the same questions about every contract they review. CRE teams extract the same fields from every lease. Insurance underwriters ask the same risk questions about every policy. Financial analysts run the same queries across every earnings document in their coverage universe.

Without a memory layer, every one of those queries triggers full inference. At full cost. Every time.

The market has already priced this problem. The providers know.

Here is something that does not get discussed enough: the major model providers have already acknowledged this problem — by pricing around it.

In 2024, Anthropic introduced prompt caching. Repeated long prompts can be served at 10% of the base input token cost — a 90% discount. OpenAI followed with automatic caching enabled by default across production API traffic at a 50% discount on repeated inputs.

These are not minor pricing adjustments. These are structural discounts that only make economic sense if a significant share of production AI traffic is repeated context. The providers are telling you, through their pricing, that repeated queries are a real and measurable phenomenon in enterprise AI workloads.

90%
Cost reduction from Anthropic prompt caching on repeated long prompts
50%
Savings from OpenAI automatic caching, enabled by default on all API traffic
31%
Share of LLM queries that exhibit semantic similarity to previous requests

Provider caching helps, but it only goes so far. It catches exact or near-exact repeated prompts. It does not catch semantic equivalence — the same question asked in different words by different people on different days. For that, you need a memory routing layer.

What memory routing actually is.

Memory routing is not a cache in the traditional sense. A cache stores exact outputs for exact inputs. Memory routing stores the meaning of an answer and retrieves it when a semantically equivalent question is asked — even if the wording is completely different.

The routing decision happens in three stages:

01

Semantic match check

The incoming query is scored against stored memory using similarity matching. If a high-confidence match is found, the stored answer is returned instantly. Zero tokens consumed. Zero API cost.

02

Confirmation for uncertain matches

If the match score falls in an uncertainty band, a lightweight confirmation call — approximately 20 tokens — verifies semantic equivalence before serving the stored answer.

03

Full inference with memory storage

If no match exists, the query escalates to the model. The answer is stored for future reuse. Every future similar query becomes a free memory hit.

The effect compounds over time. Early in a workflow, most queries are genuinely new. As the system builds memory, more and more queries can be served from stored answers. The cost curve bends — not because the model gets cheaper, but because you call it less.

The research is consistent.

This is not a theoretical claim. Independent research on semantic caching of LLM responses has produced consistent results across multiple studies and architectures.

A 2024 paper published on arXiv (GPT Semantic Cache, arXiv 2411.05276) found API call reductions of up to 68.8%, with cache hit rates of 61.6–68.8% and positive hit accuracy above 97%. A real-world deployment at Alura found that 45.1% of LLM requests could be served from cache, with response times 4–12x faster on cache hits. The Cache Saver framework, published at EMNLP 2025, found up to 60% cost and carbon emission reduction in LLM reasoning workflows.

Across all of these studies, the finding is the same: semantic similarity between queries is common enough in real workloads to make memory routing economically significant — and accuracy remains high when the routing thresholds are calibrated carefully.

What you should actually expect in production.

Based on external research and the structural economics of enterprise AI workloads, a well-designed memory routing layer in a repetitive document workflow should reasonably target:

68%
Reduction in API calls in repetitive query categories
GPT Semantic Cache, arXiv 2024
45–72%
Share of requests servable from memory in real deployments
Alura + DEV Community 2026
97%+
Answer accuracy on memory hits when thresholds are calibrated
GPT Semantic Cache, arXiv 2024
10x
Speed improvement on memory hits vs. fresh model inference
Alura deployment, ResearchGate 2024
These figures reflect published research benchmarks and provider-disclosed pricing economics. MEMStorage is currently in pilot with enterprise customers. First-party results will be published as pilots complete.

What this means for how you think about AI spend.

The standard mental model for AI cost is: cost = queries × price per query. Under that model, the only levers are to reduce query volume (which reduces value) or negotiate a lower rate (which requires scale you may not have).

Memory routing introduces a third lever: treat AI answers as durable assets, not disposable outputs. The first time a question is answered, the cost is normal. Every subsequent time the same question is asked — in any phrasing, by any user — the cost approaches zero.

This reframes the economics entirely. Your AI budget is not a recurring expense that scales with usage. It's an investment that gets more efficient the more your team uses it.

Uber's AI bill is not an inference problem. It's a memory problem. And memory problems have memory solutions.