Uber's CTO told The Information in April 2026 that AI coding tools had already exhausted the company's entire 2026 AI budget just months into the year. This is not an isolated incident. It reflects a structural problem in how enterprise AI inference is billed and consumed.
The problem is not that AI is too expensive per query. The problem is that most enterprise AI workloads contain a high proportion of repeated or semantically equivalent queries that are being sent to the model as if they were entirely new — and billed accordingly.
In document-heavy workflows — contract review, lease abstraction, due diligence, research synthesis — the same questions appear constantly across teams, sessions, and documents. Without a memory routing layer, every one of those queries triggers full inference at full cost.
The strongest signal that repeated queries are economically significant comes from the model providers themselves. Both OpenAI and Anthropic have introduced discounted pricing for cached inputs — a direct acknowledgment that a meaningful share of production AI traffic is repeated context.
Independent academic research on semantic caching of LLM responses has produced consistent results across multiple studies and architectures.
| Study | Key Finding |
|---|---|
| GPT Semantic Cache arXiv 2411.05276, 2024 |
API call reductions up to 68.8%, cache hit rates 61.6–68.8%, positive hit accuracy above 97% |
| Alura / Real-world deployment ResearchGate, 2024 |
45.1% of LLM requests served from cache; response times 4–12x faster on cache hits |
| Cache Saver framework EMNLP 2025 |
Up to 60% cost and carbon emission reduction in LLM reasoning workflows |
| Developer benchmark DEV Community, 2026 |
72% API cost reduction using MPNet embeddings with cosine similarity on production query logs |
Across all studies, the consistent finding is that semantically similar queries are common enough in real workloads to make memory routing economically significant — and that accuracy remains high when the routing layer is designed carefully.
MemStorage routes each incoming query through a three-tier decision layer before it reaches the model.
The incoming query is scored against stored memory using similarity matching. If a high-confidence match is found, the stored answer is returned instantly. Zero tokens consumed. Zero API cost.
If the match score falls in an uncertainty band, a lightweight confirmation call uses approximately 20 tokens to verify semantic equivalence before serving the stored answer.
If no match exists, the query escalates to the model. The answer is stored for future reuse. Every future similar query becomes a free memory hit.
For public reproducibility, we used real commercial lease agreements filed as exhibits in SEC EDGAR filings. These are genuine enterprise documents — not synthetic templates — and they represent the kind of repetitive document workflow that makes memory routing most valuable.
We ran repetitive extraction queries across both documents simulating how legal, CRE, and operations teams actually use AI in document review workflows. Query categories included: rent schedule and escalation, parking rights, security deposit terms, renewal options, tenant improvement allowances, permitted use, broker identification, and late payment terms.
Each category was tested with the original query and two to three semantic paraphrases — for example "How many parking spaces does the tenant receive?" alongside "What is the parking allocation for the tenant?" and "What parking rights does athenahealth have?" — to measure whether the routing layer correctly identified semantic equivalence across varied phrasing.
You can run this benchmark yourself at memstorage.com/app. The live demo is loaded with both SEC lease documents and routes queries in real time.
This section documents the exact configuration, source documents, prompts, scoring thresholds, and per-query routing decisions used to produce the cost numbers above. Everything below is reproducible without contacting us.
The benchmark uses two real commercial lease agreements filed publicly on SEC EDGAR. We deliberately chose third-party documents the public can pull and verify rather than synthetic templates or our own paperwork.
SEC filings are in the public domain. No confidential data was used. PDF page counts include exhibits and signature pages.
| Component | Configuration |
|---|---|
| Model (Tier 3 escalation) | OpenAI gpt-4o-mini, temperature 0, max_tokens 800 |
| Embedding model | OpenAI text-embedding-3-small, 1536 dim, normalized |
| Similarity metric | Cosine similarity, single-shot scoring (no re-ranking) |
| Hit threshold | 0.92 (memory_hit returned immediately) |
| Confirm threshold | 0.78 (Tier 2 confirmation: ~20 token yes/no probe) |
| Storage | MEMStorage routing layer, single-region us-east-1 |
| Pricing reference | OpenAI public list as of April 2026: $0.15 / 1M input, $0.60 / 1M output |
| Hardware | Standard cloud — no GPU. All embedding + similarity is CPU-side. |
We assembled 8 extraction categories that mirror real lease abstraction work — the same questions a CRE analyst, lawyer, or accounts team will ask of every lease they touch. For each category we asked the original phrasing plus 2–3 paraphrases against both source documents, producing 47 total query events. The first occurrence of each category is a Tier 3 inference (no prior memory). Every subsequent semantically equivalent query is the routing test.
A representative slice of the 47-query benchmark. The full log (CSV) is available on request. Cost figures use OpenAI list pricing on input + output token counts measured at the proxy.
| # | Query (paraphrase shown) | Tier | Score | Tokens | Cost (USD) | Notes |
|---|---|---|---|---|---|---|
| 01 | What is the rent escalation clause? | Inference | — | 3,420 in / 180 out | $0.000621 | First touch on athenahealth lease — full read. |
| 02 | How does the rent step up over time? | Memory hit | 0.951 | 0 | $0.000000 | Match against #01. ~14 ms latency. |
| 03 | Are there annual rent increases? | Confirmed | 0.842 | 22 in / 3 out | $0.000005 | Tier 2: 20-token yes/no probe → equivalent. |
| 04 | How many parking spaces does the tenant get? | Inference | — | 3,420 in / 96 out | $0.000571 | New category — first read. |
| 05 | What is the parking allocation? | Memory hit | 0.967 | 0 | $0.000000 | Match against #04. |
| 06 | What parking rights does athenahealth have? | Memory hit | 0.928 | 0 | $0.000000 | Match against #04. Above 0.92 threshold. |
| 07 | What is the security deposit? | Inference | — | 3,420 in / 64 out | $0.000551 | New category. |
| 08 | How much is the deposit? | Memory hit | 0.943 | 0 | $0.000000 | Match against #07. |
| 09 | What does the tenant pay upfront as security? | Confirmed | 0.808 | 21 in / 3 out | $0.000005 | Tier 2 confirmation passed. |
| 10 | What is the renewal option? | Inference | — | 3,420 in / 142 out | $0.000598 | New category. |
| … | 37 additional rows omitted. Aggregate: 67% Tier 1 hits, 15% Tier 2 confirmations, 18% Tier 3 inferences. | |||||
The headline $8,400 → $2,100 / month figure projects this benchmark distribution onto a representative enterprise lease workflow: 56,000 queries / month on similar document corpora at the same tier mix.
| Scenario | Tier 3 calls | Avg cost / call | Monthly cost |
|---|---|---|---|
| Without MEMStorage | 56,000 (100%) | $0.150 | $8,400 |
| With MEMStorage | 10,080 (18%) | $0.150 | $1,512 |
| + Tier 2 confirmation | 8,400 (15%) × $0.000007 | — | $0.06 |
| + MEMStorage routing fee | 56,000 events | $0.0105 | $588 |
| Net with MEMStorage | — | — | $2,100 |
Routing fee is the published Pro-tier rate. Avg cost per call uses the benchmark mean of 3,420 input tokens + ~120 output tokens at OpenAI list pricing. Confirmation cost is the 21-token average × $0.000007/event.
Three options, in increasing fidelity:
lease-abstraction namespace, and replay the query log. Full CSV with all 47 queries available on request.Based on external research and the structural economics of enterprise AI workloads, a well-designed memory routing layer in repetitive document workflows should reasonably target:
If your team runs at least 1M monthly inference calls on a repetitive workflow — legal, CRE, insurance, support, financial services — we'll instrument the routing layer against your corpus and produce your actual savings number against your provider's actual bill. No fee. 30 days. NDA available on request.
We are looking for enterprise teams with repetitive AI workflows — legal, CRE, insurance, financial services, or any document-heavy operation — to run a 30-day pilot. You bring the workflow. We instrument the routing layer. At the end of 30 days you see your actual savings number.