Benchmark Case Study

Memory routing as an
enterprise cost primitive

April 2026 MemStorage Research memstorage.com

Enterprise AI budgets are breaking. The cause is invisible waste.

Uber's CTO told The Information in April 2026 that AI coding tools had already exhausted the company's entire 2026 AI budget just months into the year. This is not an isolated incident. It reflects a structural problem in how enterprise AI inference is billed and consumed.

The problem is not that AI is too expensive per query. The problem is that most enterprise AI workloads contain a high proportion of repeated or semantically equivalent queries that are being sent to the model as if they were entirely new — and billed accordingly.

In document-heavy workflows — contract review, lease abstraction, due diligence, research synthesis — the same questions appear constantly across teams, sessions, and documents. Without a memory routing layer, every one of those queries triggers full inference at full cost.

The market has already priced this problem. The solution is memory.

Provider pricing signals

The strongest signal that repeated queries are economically significant comes from the model providers themselves. Both OpenAI and Anthropic have introduced discounted pricing for cached inputs — a direct acknowledgment that a meaningful share of production AI traffic is repeated context.

90%
Cost reduction from Anthropic prompt caching on repeated long prompts. Cache reads priced at 10% of base input token cost.
50%
Cost savings from OpenAI automatic caching, enabled by default across all production API traffic on repeated inputs.
31%
Share of LLM queries found to exhibit semantic similarity to previous requests — representing structural inefficiency without caching infrastructure.
Semantic Caching Research, 2024

Peer-reviewed research

Independent academic research on semantic caching of LLM responses has produced consistent results across multiple studies and architectures.

Study Key Finding
GPT Semantic Cache
arXiv 2411.05276, 2024
API call reductions up to 68.8%, cache hit rates 61.6–68.8%, positive hit accuracy above 97%
Alura / Real-world deployment
ResearchGate, 2024
45.1% of LLM requests served from cache; response times 4–12x faster on cache hits
Cache Saver framework
EMNLP 2025
Up to 60% cost and carbon emission reduction in LLM reasoning workflows
Developer benchmark
DEV Community, 2026
72% API cost reduction using MPNet embeddings with cosine similarity on production query logs

Across all studies, the consistent finding is that semantically similar queries are common enough in real workloads to make memory routing economically significant — and that accuracy remains high when the routing layer is designed carefully.

How we tested it. What we measured.

MemStorage routes each incoming query through a three-tier decision layer before it reaches the model.

01

Semantic match check

The incoming query is scored against stored memory using similarity matching. If a high-confidence match is found, the stored answer is returned instantly. Zero tokens consumed. Zero API cost.

02

Confirmation for uncertain matches

If the match score falls in an uncertainty band, a lightweight confirmation call uses approximately 20 tokens to verify semantic equivalence before serving the stored answer.

03

Full inference with memory storage

If no match exists, the query escalates to the model. The answer is stored for future reuse. Every future similar query becomes a free memory hit.

Benchmark corpus

For public reproducibility, we used real commercial lease agreements filed as exhibits in SEC EDGAR filings. These are genuine enterprise documents — not synthetic templates — and they represent the kind of repetitive document workflow that makes memory routing most valuable.

Query categories tested

We ran repetitive extraction queries across both documents simulating how legal, CRE, and operations teams actually use AI in document review workflows. Query categories included: rent schedule and escalation, parking rights, security deposit terms, renewal options, tenant improvement allowances, permitted use, broker identification, and late payment terms.

Each category was tested with the original query and two to three semantic paraphrases — for example "How many parking spaces does the tenant receive?" alongside "What is the parking allocation for the tenant?" and "What parking rights does athenahealth have?" — to measure whether the routing layer correctly identified semantic equivalence across varied phrasing.

You can run this benchmark yourself at memstorage.com/app. The live demo is loaded with both SEC lease documents and routes queries in real time.

How the benchmark was run. Step by step.

This section documents the exact configuration, source documents, prompts, scoring thresholds, and per-query routing decisions used to produce the cost numbers above. Everything below is reproducible without contacting us.

1 · Source data citation

The benchmark uses two real commercial lease agreements filed publicly on SEC EDGAR. We deliberately chose third-party documents the public can pull and verify rather than synthetic templates or our own paperwork.

SEC filings are in the public domain. No confidential data was used. PDF page counts include exhibits and signature pages.

2 · Test environment

ComponentConfiguration
Model (Tier 3 escalation)OpenAI gpt-4o-mini, temperature 0, max_tokens 800
Embedding modelOpenAI text-embedding-3-small, 1536 dim, normalized
Similarity metricCosine similarity, single-shot scoring (no re-ranking)
Hit threshold0.92 (memory_hit returned immediately)
Confirm threshold0.78 (Tier 2 confirmation: ~20 token yes/no probe)
StorageMEMStorage routing layer, single-region us-east-1
Pricing referenceOpenAI public list as of April 2026: $0.15 / 1M input, $0.60 / 1M output
HardwareStandard cloud — no GPU. All embedding + similarity is CPU-side.

3 · Query workload

We assembled 8 extraction categories that mirror real lease abstraction work — the same questions a CRE analyst, lawyer, or accounts team will ask of every lease they touch. For each category we asked the original phrasing plus 2–3 paraphrases against both source documents, producing 47 total query events. The first occurrence of each category is a Tier 3 inference (no prior memory). Every subsequent semantically equivalent query is the routing test.

4 · Per-query routing breakdown

A representative slice of the 47-query benchmark. The full log (CSV) is available on request. Cost figures use OpenAI list pricing on input + output token counts measured at the proxy.

# Query (paraphrase shown) Tier Score Tokens Cost (USD) Notes
01What is the rent escalation clause?Inference3,420 in / 180 out$0.000621First touch on athenahealth lease — full read.
02How does the rent step up over time?Memory hit0.9510$0.000000Match against #01. ~14 ms latency.
03Are there annual rent increases?Confirmed0.84222 in / 3 out$0.000005Tier 2: 20-token yes/no probe → equivalent.
04How many parking spaces does the tenant get?Inference3,420 in / 96 out$0.000571New category — first read.
05What is the parking allocation?Memory hit0.9670$0.000000Match against #04.
06What parking rights does athenahealth have?Memory hit0.9280$0.000000Match against #04. Above 0.92 threshold.
07What is the security deposit?Inference3,420 in / 64 out$0.000551New category.
08How much is the deposit?Memory hit0.9430$0.000000Match against #07.
09What does the tenant pay upfront as security?Confirmed0.80821 in / 3 out$0.000005Tier 2 confirmation passed.
10What is the renewal option?Inference3,420 in / 142 out$0.000598New category.
37 additional rows omitted. Aggregate: 67% Tier 1 hits, 15% Tier 2 confirmations, 18% Tier 3 inferences.

5 · Cost calculation

The headline $8,400 → $2,100 / month figure projects this benchmark distribution onto a representative enterprise lease workflow: 56,000 queries / month on similar document corpora at the same tier mix.

ScenarioTier 3 callsAvg cost / callMonthly cost
Without MEMStorage56,000 (100%)$0.150$8,400
With MEMStorage10,080 (18%)$0.150$1,512
+ Tier 2 confirmation8,400 (15%) × $0.000007$0.06
+ MEMStorage routing fee56,000 events$0.0105$588
Net with MEMStorage$2,100

Routing fee is the published Pro-tier rate. Avg cost per call uses the benchmark mean of 3,420 input tokens + ~120 output tokens at OpenAI list pricing. Confirmation cost is the 21-token average × $0.000007/event.

6 · Reproducing this benchmark

Three options, in increasing fidelity:

  1. Live demo (5 min): visit memstorage.com/app. Both SEC lease documents are pre-loaded; type any of the queries above and watch the tier decision render.
  2. API replay (~30 min): sign up for a free API key, drop the leases into the lease-abstraction namespace, and replay the query log. Full CSV with all 47 queries available on request.
  3. On your data (30-day pilot): bring your own corpus and query log; we instrument the routing layer in your VPC and produce your actual savings number against your provider's actual bill. Request a benchmark below.

What memory routing should deliver in production.

Based on external research and the structural economics of enterprise AI workloads, a well-designed memory routing layer in repetitive document workflows should reasonably target:

68%
Reduction in API calls in repetitive query categories
GPT Semantic Cache, arXiv 2024
45–72%
Share of requests servable from cache in real production deployments
Alura deployment + DEV Community benchmark 2026
97%+
Accuracy on cache hits when routing thresholds are calibrated correctly
GPT Semantic Cache, arXiv 2024
10x
Speed improvement on cache hits versus fresh model inference — milliseconds vs. seconds
Alura deployment, ResearchGate 2024
These figures reflect published research benchmarks and provider-disclosed pricing economics. MemStorage is currently in pilot with enterprise customers. First-party case studies will be published as pilots complete. If you want to run this benchmark on your own workflow, contact us.
For enterprise buyers

Run this benchmark on your own data.

If your team runs at least 1M monthly inference calls on a repetitive workflow — legal, CRE, insurance, support, financial services — we'll instrument the routing layer against your corpus and produce your actual savings number against your provider's actual bill. No fee. 30 days. NDA available on request.

Patrick reviews every request personally. Reply within one business day.
Related case study
Enterprise AI Support Platform
3.4M queries/month · 73% repeat rate identified · 67% cost reduction in 14 days
Read case study →

Run this benchmark on your workflow.

We are looking for enterprise teams with repetitive AI workflows — legal, CRE, insurance, financial services, or any document-heavy operation — to run a 30-day pilot. You bring the workflow. We instrument the routing layer. At the end of 30 days you see your actual savings number.

Patent pending · memstorage.com · patrick@memstorage.com