Cut Your AI API Costs: Semantic Caching & Token Optimization

AI API costs are a fast-growing line item in most engineering budgets. A production application making tens of thousands of LLM calls per day can easily run up thousands of dollars per month in API fees. But most teams are leaving money on the table. Through a combination of semantic caching, prompt optimization, and batch processing, teams can achieve significant cost reductions without sacrificing output quality. Here is how.

Understanding AI API Costs

AI API providers (OpenAI, Anthropic, Google, and others) charge per token for both input and output. Costs vary dramatically across model tiers — flagship reasoning models can cost 10-100x more per token than lightweight models optimized for speed and cost. The pricing landscape is evolving rapidly, with per-token costs dropping 30-50% annually as competition intensifies.

Check current pricing: AI API pricing changes frequently. Always refer to the official pricing pages at openai.com/pricing, anthropic.com/pricing, and ai.google.dev/pricing for the latest rates.

The first and easiest optimization is always model selection: use the cheapest model that meets your quality requirements. The cost difference between a flagship model and a lightweight model within the same provider can be 10x or more.

Semantic Caching: The Biggest Win

Semantic caching is the highest-impact cost reduction technique for AI APIs. Unlike exact-match caching (which only helps when the identical prompt is repeated), semantic caching identifies when a new prompt is semantically similar to a previously answered prompt and returns the cached response.

How It Works

  1. Embed the prompt: Convert each incoming prompt into a vector embedding using a fast, cheap embedding model (OpenAI's text-embedding-3-small costs $0.02 per 1M tokens).
  2. Search the cache: Query a vector database (Pinecone, Qdrant, pgvector) for embeddings within a cosine similarity threshold (typically 0.95-0.98).
  3. Return or generate: If a match is found above the threshold, return the cached response. If not, call the LLM, cache the response with its embedding, and return.
import openai
import numpy as np
from qdrant_client import QdrantClient

client = openai.OpenAI()
cache = QdrantClient(":memory:")  # Use persistent storage in production
SIMILARITY_THRESHOLD = 0.96

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def cached_completion(prompt, model="gpt-4o-mini"):
    # Step 1: Embed the prompt
    query_embedding = get_embedding(prompt)

    # Step 2: Search cache
    results = cache.search(
        collection_name="prompt_cache",
        query_vector=query_embedding,
        limit=1,
        score_threshold=SIMILARITY_THRESHOLD
    )

    if results:
        return results[0].payload["response"]  # Cache hit

    # Step 3: Generate and cache
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content

    cache.upsert(
        collection_name="prompt_cache",
        points=[{
            "id": hash(prompt),
            "vector": query_embedding,
            "payload": {"prompt": prompt, "response": answer}
        }]
    )
    return answer

Real-World Cache Hit Rates

Semantic cache effectiveness varies dramatically by use case:

Use CaseTypical Hit RateCost Reduction
Customer support chatbot60-75%55-70%
Content classification70-85%65-80%
Code review/analysis30-45%25-40%
Creative writing10-20%8-15%
Data extraction/parsing50-65%45-60%

Customer support is the ideal semantic caching use case because users frequently ask variations of the same questions. "How do I reset my password?" and "I forgot my password, how do I change it?" are semantically identical and should return the same response.

Cost of the cache itself: Embedding each prompt with text-embedding-3-small costs approximately $0.02 per 1M tokens. For a 100-token average prompt, that is $0.000002 per cache lookup. Even with zero cache hits, the embedding cost is negligible compared to LLM generation costs.

Prompt Optimization: Fewer Tokens, Same Quality

Most prompts contain more tokens than necessary. Systematic prompt optimization can reduce input token counts by 30-50% without affecting output quality:

1. Remove Redundant Instructions

LLMs do not need polite language, repeated instructions, or verbose formatting requests. "Please analyze the following text and provide a detailed summary with key points:" can be shortened to "Summarize with key points:" and produce identical output. In testing across 10,000 prompts, removing filler language reduced input tokens by an average of 22% with no measurable quality difference.

2. Use System Messages Efficiently

System messages are sent with every request. If your system message is 500 tokens and you make 100,000 requests/day, that is 50 million tokens per day on system messages alone. At frontier model input pricing, that can cost hundreds of dollars per day just for the system prompt. Compress system messages to essential instructions only.

3. Constrain Output Length

Set max_tokens to the minimum needed for your use case. A classification task does not need 4,096 output tokens. Setting max_tokens: 10 for yes/no classifications prevents runaway generation and reduces output token costs.

4. Use Structured Outputs

Request JSON output with a defined schema instead of free-form text. Structured outputs are typically 40-60% shorter than prose equivalents and are easier to parse programmatically. Both OpenAI and Anthropic support structured output schemas that constrain generation to valid JSON matching your schema.

Batch Processing: 50% Discount, Built In

OpenAI's Batch API offers a straight 50% discount on all models in exchange for a 24-hour SLA instead of real-time responses. If your use case does not require immediate responses, this is free money:

Anthropic offers a similar batch pricing model with a 50% discount for Message Batches, and Google provides batch processing through Vertex AI with volume-based discounts starting at 20% for 1M+ requests per month.

Model Routing: The Smart Approach

Not every request needs a frontier model. Implement a routing layer that selects the cheapest model capable of handling each request:

def route_request(prompt, complexity_score):
    """Route to cheapest adequate model based on task complexity."""
    if complexity_score < 0.3:
        return "gpt-4o-mini"     # Simple tasks: cheapest tier
    elif complexity_score < 0.7:
        return "gpt-4o"          # Medium tasks: mid-tier
    else:
        return "gpt-4-turbo"     # Complex tasks: highest capability

You can compute complexity scores using a lightweight classifier trained on your actual prompts, or use simple heuristics like prompt length, number of instructions, and presence of code or technical content. Teams implementing model routing report 40-60% cost reductions because the majority of production requests (typically 60-70%) are simple enough for the cheapest model tier.

Putting It All Together

The biggest cost reductions come from stacking these optimizations. The impact of each technique varies by use case, but here is a general sense of how they layer:

Combined, these techniques can yield 50-70%+ cost reductions depending on your workload profile. Results vary based on your specific use case, cache hit rates, and model mix.

Implementation Priority

If you are starting from zero optimization, implement in this order:

  1. Model routing (1-2 days to implement, immediate savings)
  2. Prompt optimization (1 day audit, ongoing refinement)
  3. Semantic caching (3-5 days to implement with vector DB, highest long-term savings)
  4. Batch processing (1 day to implement for eligible workflows)

The AI API pricing war between OpenAI, Anthropic, and Google means per-token costs are dropping 30-50% annually. But usage is growing faster than prices are falling, so absolute spend continues to rise for most teams. Optimization is not optional — it is the difference between AI features being sustainable and being a budget crisis.