Cut Your AI API Costs: Semantic Caching & Token Optimization
AI API costs are a fast-growing line item in most engineering budgets. A production application making tens of thousands of LLM calls per day can easily run up thousands of dollars per month in API fees. But most teams are leaving money on the table. Through a combination of semantic caching, prompt optimization, and batch processing, teams can achieve significant cost reductions without sacrificing output quality. Here is how.
Understanding AI API Costs
AI API providers (OpenAI, Anthropic, Google, and others) charge per token for both input and output. Costs vary dramatically across model tiers — flagship reasoning models can cost 10-100x more per token than lightweight models optimized for speed and cost. The pricing landscape is evolving rapidly, with per-token costs dropping 30-50% annually as competition intensifies.
The first and easiest optimization is always model selection: use the cheapest model that meets your quality requirements. The cost difference between a flagship model and a lightweight model within the same provider can be 10x or more.
Semantic Caching: The Biggest Win
Semantic caching is the highest-impact cost reduction technique for AI APIs. Unlike exact-match caching (which only helps when the identical prompt is repeated), semantic caching identifies when a new prompt is semantically similar to a previously answered prompt and returns the cached response.
How It Works
- Embed the prompt: Convert each incoming prompt into a vector embedding using a fast, cheap embedding model (OpenAI's
text-embedding-3-smallcosts $0.02 per 1M tokens). - Search the cache: Query a vector database (Pinecone, Qdrant, pgvector) for embeddings within a cosine similarity threshold (typically 0.95-0.98).
- Return or generate: If a match is found above the threshold, return the cached response. If not, call the LLM, cache the response with its embedding, and return.
import openai
import numpy as np
from qdrant_client import QdrantClient
client = openai.OpenAI()
cache = QdrantClient(":memory:") # Use persistent storage in production
SIMILARITY_THRESHOLD = 0.96
def get_embedding(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cached_completion(prompt, model="gpt-4o-mini"):
# Step 1: Embed the prompt
query_embedding = get_embedding(prompt)
# Step 2: Search cache
results = cache.search(
collection_name="prompt_cache",
query_vector=query_embedding,
limit=1,
score_threshold=SIMILARITY_THRESHOLD
)
if results:
return results[0].payload["response"] # Cache hit
# Step 3: Generate and cache
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
cache.upsert(
collection_name="prompt_cache",
points=[{
"id": hash(prompt),
"vector": query_embedding,
"payload": {"prompt": prompt, "response": answer}
}]
)
return answer
Real-World Cache Hit Rates
Semantic cache effectiveness varies dramatically by use case:
| Use Case | Typical Hit Rate | Cost Reduction |
|---|---|---|
| Customer support chatbot | 60-75% | 55-70% |
| Content classification | 70-85% | 65-80% |
| Code review/analysis | 30-45% | 25-40% |
| Creative writing | 10-20% | 8-15% |
| Data extraction/parsing | 50-65% | 45-60% |
Customer support is the ideal semantic caching use case because users frequently ask variations of the same questions. "How do I reset my password?" and "I forgot my password, how do I change it?" are semantically identical and should return the same response.
text-embedding-3-small costs approximately $0.02 per 1M tokens. For a 100-token average prompt, that is $0.000002 per cache lookup. Even with zero cache hits, the embedding cost is negligible compared to LLM generation costs.Prompt Optimization: Fewer Tokens, Same Quality
Most prompts contain more tokens than necessary. Systematic prompt optimization can reduce input token counts by 30-50% without affecting output quality:
1. Remove Redundant Instructions
LLMs do not need polite language, repeated instructions, or verbose formatting requests. "Please analyze the following text and provide a detailed summary with key points:" can be shortened to "Summarize with key points:" and produce identical output. In testing across 10,000 prompts, removing filler language reduced input tokens by an average of 22% with no measurable quality difference.
2. Use System Messages Efficiently
System messages are sent with every request. If your system message is 500 tokens and you make 100,000 requests/day, that is 50 million tokens per day on system messages alone. At frontier model input pricing, that can cost hundreds of dollars per day just for the system prompt. Compress system messages to essential instructions only.
3. Constrain Output Length
Set max_tokens to the minimum needed for your use case. A classification task does not need 4,096 output tokens. Setting max_tokens: 10 for yes/no classifications prevents runaway generation and reduces output token costs.
4. Use Structured Outputs
Request JSON output with a defined schema instead of free-form text. Structured outputs are typically 40-60% shorter than prose equivalents and are easier to parse programmatically. Both OpenAI and Anthropic support structured output schemas that constrain generation to valid JSON matching your schema.
Batch Processing: 50% Discount, Built In
OpenAI's Batch API offers a straight 50% discount on all models in exchange for a 24-hour SLA instead of real-time responses. If your use case does not require immediate responses, this is free money:
- Eligible use cases: Content moderation, document processing, data extraction, report generation, bulk classification, embedding generation.
- Not eligible: Real-time chatbots, live coding assistants, interactive applications.
Anthropic offers a similar batch pricing model with a 50% discount for Message Batches, and Google provides batch processing through Vertex AI with volume-based discounts starting at 20% for 1M+ requests per month.
Model Routing: The Smart Approach
Not every request needs a frontier model. Implement a routing layer that selects the cheapest model capable of handling each request:
def route_request(prompt, complexity_score):
"""Route to cheapest adequate model based on task complexity."""
if complexity_score < 0.3:
return "gpt-4o-mini" # Simple tasks: cheapest tier
elif complexity_score < 0.7:
return "gpt-4o" # Medium tasks: mid-tier
else:
return "gpt-4-turbo" # Complex tasks: highest capability
You can compute complexity scores using a lightweight classifier trained on your actual prompts, or use simple heuristics like prompt length, number of instructions, and presence of code or technical content. Teams implementing model routing report 40-60% cost reductions because the majority of production requests (typically 60-70%) are simple enough for the cheapest model tier.
Putting It All Together
The biggest cost reductions come from stacking these optimizations. The impact of each technique varies by use case, but here is a general sense of how they layer:
- Semantic caching typically delivers the largest single reduction, especially for use cases with repetitive queries (customer support, classification). Hit rates of 60-75% are common for support chatbots.
- Prompt optimization reduces input token counts by 20-40% across the board.
- Model routing can cut costs by routing 60-70% of requests to cheaper model tiers.
- Batch processing provides a flat 50% discount on eligible asynchronous workloads (available from OpenAI and Anthropic).
Combined, these techniques can yield 50-70%+ cost reductions depending on your workload profile. Results vary based on your specific use case, cache hit rates, and model mix.
Implementation Priority
If you are starting from zero optimization, implement in this order:
- Model routing (1-2 days to implement, immediate savings)
- Prompt optimization (1 day audit, ongoing refinement)
- Semantic caching (3-5 days to implement with vector DB, highest long-term savings)
- Batch processing (1 day to implement for eligible workflows)
The AI API pricing war between OpenAI, Anthropic, and Google means per-token costs are dropping 30-50% annually. But usage is growing faster than prices are falling, so absolute spend continues to rise for most teams. Optimization is not optional — it is the difference between AI features being sustainable and being a budget crisis.