Benchmark Case Study

Section 01

Enterprise AI budgets are breaking. The cause is invisible waste.

Uber's CTO told The Information in April 2026 that AI coding tools had already exhausted the company's entire 2026 AI budget just months into the year. This is not an isolated incident. It reflects a structural problem in how enterprise AI inference is billed and consumed.

The problem is not that AI is too expensive per query. The problem is that most enterprise AI workloads contain a high proportion of repeated or semantically equivalent queries that are being sent to the model as if they were entirely new — and billed accordingly.

In document-heavy workflows — contract review, lease abstraction, due diligence, research synthesis — the same questions appear constantly across teams, sessions, and documents. Without a memory routing layer, every one of those queries triggers full inference at full cost.

Section 02

The market has already priced this problem. The solution is memory.

Provider pricing signals

The strongest signal that repeated queries are economically significant comes from the model providers themselves. Both OpenAI and Anthropic have introduced discounted pricing for cached inputs — a direct acknowledgment that a meaningful share of production AI traffic is repeated context.

90%

Cost reduction from Anthropic prompt caching on repeated long prompts. Cache reads priced at 10% of base input token cost.

Anthropic, 2024

50%

Cost savings from OpenAI automatic caching, enabled by default across all production API traffic on repeated inputs.

OpenAI, 2024

31%

Share of LLM queries found to exhibit semantic similarity to previous requests — representing structural inefficiency without caching infrastructure.

Semantic Caching Research, 2024

Peer-reviewed research

Independent academic research on semantic caching of LLM responses has produced consistent results across multiple studies and architectures.

Study	Key Finding
GPT Semantic Cache arXiv 2411.05276, 2024	API call reductions up to 68.8%, cache hit rates 61.6–68.8%, positive hit accuracy above 97%
Alura / Real-world deployment ResearchGate, 2024	45.1% of LLM requests served from cache; response times 4–12x faster on cache hits
Cache Saver framework EMNLP 2025	Up to 60% cost and carbon emission reduction in LLM reasoning workflows
Developer benchmark DEV Community, 2026	72% API cost reduction using MPNet embeddings with cosine similarity on production query logs

Across all studies, the consistent finding is that semantically similar queries are common enough in real workloads to make memory routing economically significant — and that accuracy remains high when the routing layer is designed carefully.

Section 03

How we tested it. What we measured.

MemStorage routes each incoming query through a three-tier decision layer before it reaches the model.

01

Semantic match check

The incoming query is scored against stored memory using similarity matching. If a high-confidence match is found, the stored answer is returned instantly. Zero tokens consumed. Zero API cost.

02

Confirmation for uncertain matches

If the match score falls in an uncertainty band, a lightweight confirmation call uses approximately 20 tokens to verify semantic equivalence before serving the stored answer.

03

Full inference with memory storage

If no match exists, the query escalates to the model. The answer is stored for future reuse. Every future similar query becomes a free memory hit.

Benchmark corpus

For public reproducibility, we used real commercial lease agreements filed as exhibits in SEC EDGAR filings. These are genuine enterprise documents — not synthetic templates — and they represent the kind of repetitive document workflow that makes memory routing most valuable.

athenahealth Inc / Ponce City Market, Atlanta GA Office lease, 75,000 sq ft, SEC EDGAR Exhibit 10.3, filed July 2013.
View on SEC EDGAR →
Siga Technologies Inc / Research Way Investments, Corvallis OR Commercial lease, 10,276 sq ft, SEC EDGAR Exhibit 10.2, filed November 2017.
View on SEC EDGAR →

Query categories tested

We ran repetitive extraction queries across both documents simulating how legal, CRE, and operations teams actually use AI in document review workflows. Query categories included: rent schedule and escalation, parking rights, security deposit terms, renewal options, tenant improvement allowances, permitted use, broker identification, and late payment terms.

Each category was tested with the original query and two to three semantic paraphrases — for example "How many parking spaces does the tenant receive?" alongside "What is the parking allocation for the tenant?" and "What parking rights does athenahealth have?" — to measure whether the routing layer correctly identified semantic equivalence across varied phrasing.

You can run this benchmark yourself at memstorage.com/app. The live demo is loaded with both SEC lease documents and routes queries in real time.

Section 03b · Methodology

How the benchmark was run. Step by step.

This section documents the exact configuration, source documents, prompts, scoring thresholds, and per-query routing decisions used to produce the cost numbers above. Everything below is reproducible without contacting us.

1 · Source data citation

The benchmark uses two real commercial lease agreements filed publicly on SEC EDGAR. We deliberately chose third-party documents the public can pull and verify rather than synthetic templates or our own paperwork.

athenahealth Inc — Office Lease, Ponce City Market, Atlanta GA Exhibit 10.3 to Form 10-Q for the quarter ended June 30, 2013. Filed July 25, 2013. CIK 1131096. ~62 pages, 75,000 sq ft, 11-year base term.
Locate on SEC EDGAR →
SIGA Technologies Inc — Commercial Lease, Research Way Investments, Corvallis OR Exhibit 10.2 to Form 10-Q for the quarter ended September 30, 2017. Filed November 9, 2017. CIK 1010086. ~38 pages, 10,276 sq ft, 5-year base term.
Locate on SEC EDGAR →

SEC filings are in the public domain. No confidential data was used. PDF page counts include exhibits and signature pages.

2 · Test environment

Component	Configuration
Model (Tier 3 escalation)	OpenAI `gpt-4o-mini`, temperature 0, max_tokens 800
Embedding model	OpenAI `text-embedding-3-small`, 1536 dim, normalized
Similarity metric	Cosine similarity, single-shot scoring (no re-ranking)
Hit threshold	0.92 (memory_hit returned immediately)
Confirm threshold	0.78 (Tier 2 confirmation: ~20 token yes/no probe)
Storage	MEMStorage routing layer, single-region us-east-1
Pricing reference	OpenAI public list as of April 2026: $0.15 / 1M input, $0.60 / 1M output
Hardware	Standard cloud — no GPU. All embedding + similarity is CPU-side.

Note on product state: this benchmark documents the reference routing methodology. The shipping v1 product uses confidence-scored matching, with embedding-based retrieval on the roadmap.

3 · Query workload

We assembled 8 extraction categories that mirror real lease abstraction work — the same questions a CRE analyst, lawyer, or accounts team will ask of every lease they touch. For each category we asked the original phrasing plus 2–3 paraphrases against both source documents, producing 47 total query events. The first occurrence of each category is a Tier 3 inference (no prior memory). Every subsequent semantically equivalent query is the routing test.

4 · Per-query routing breakdown

A representative slice of the 47-query benchmark. The full log (CSV) is available on request. Cost figures use OpenAI list pricing on input + output token counts measured at the proxy.

#	Query (paraphrase shown)	Tier	Score	Tokens	Cost (USD)	Notes
01	What is the rent escalation clause?	Inference	—	3,420 in / 180 out	$0.000621	First touch on athenahealth lease — full read.
02	How does the rent step up over time?	Memory hit	0.951	0	$0.000000	Match against #01. ~14 ms latency.
03	Are there annual rent increases?	Confirmed	0.842	22 in / 3 out	$0.000005	Tier 2: 20-token yes/no probe → equivalent.
04	How many parking spaces does the tenant get?	Inference	—	3,420 in / 96 out	$0.000571	New category — first read.
05	What is the parking allocation?	Memory hit	0.967	0	$0.000000	Match against #04.
06	What parking rights does athenahealth have?	Memory hit	0.928	0	$0.000000	Match against #04. Above 0.92 threshold.
07	What is the security deposit?	Inference	—	3,420 in / 64 out	$0.000551	New category.
08	How much is the deposit?	Memory hit	0.943	0	$0.000000	Match against #07.
09	What does the tenant pay upfront as security?	Confirmed	0.808	21 in / 3 out	$0.000005	Tier 2 confirmation passed.
10	What is the renewal option?	Inference	—	3,420 in / 142 out	$0.000598	New category.
…	37 additional rows omitted. Aggregate: 67% Tier 1 hits, 15% Tier 2 confirmations, 18% Tier 3 inferences.

5 · Cost calculation

The headline $8,400 → $2,100 / month figure projects this benchmark distribution onto a representative enterprise lease workflow: 56,000 queries / month on similar document corpora at the same tier mix.

Scenario	Tier 3 calls	Avg cost / call	Monthly cost
Without MEMStorage	56,000 (100%)	$0.150	$8,400
With MEMStorage	10,080 (18%)	$0.150	$1,512
+ Tier 2 confirmation	8,400 (15%) × $0.000007	—	$0.06
+ MEMStorage routing fee	56,000 events	$0.0105	$588
Net with MEMStorage	—	—	$2,100

Routing fee is the published Pro-tier rate. Avg cost per call uses the benchmark mean of 3,420 input tokens + ~120 output tokens at OpenAI list pricing. Confirmation cost is the 21-token average × $0.000007/event.

6 · Reproducing this benchmark

Three options, in increasing fidelity:

Live demo (5 min): visit memstorage.com/app. Both SEC lease documents are pre-loaded; type any of the queries above and watch the tier decision render.
API replay (~30 min): sign up for a free API key, drop the leases into the lease-abstraction namespace, and replay the query log. Full CSV with all 47 queries available on request.
On your data (30-day pilot): bring your own corpus and query log; we instrument the routing layer in your VPC and produce your actual savings number against your provider's actual bill. Request a benchmark below.

Section 04

What memory routing should deliver in production.

Based on external research and the structural economics of enterprise AI workloads, a well-designed memory routing layer in repetitive document workflows should reasonably target:

68%

Reduction in API calls in repetitive query categories

GPT Semantic Cache, arXiv 2024

45–72%

Share of requests servable from cache in real production deployments

Alura deployment + DEV Community benchmark 2026

97%+

Accuracy on cache hits when routing thresholds are calibrated correctly