Prompt Caching

Leverage provider-side prompt caching for significant cost and latency savings on large, repeated prompts.

To save on inference costs, you can leverage prompt caching on supported providers and models. When a provider supports it, LangDB will make a best-effort to route subsequent requests to the same provider to make use of the warm cache.

Most providers automatically enable prompt caching for large prompts, but some, like Anthropic, require you to enable it on a per-message basis.

How Caching Works

Automatic Caching

Providers like OpenAI, Grok, DeepSeek, and (soon) Google Gemini enable caching by default once your prompt exceeds a certain length (e.g. 1024 tokens).

  • Activation: No change needed. Any prompt over the length threshold is written to cache.

  • Best Practice: Put your static content (system prompts, RAG context, long instructions) first in the message so it can be reused.

  • Pricing:

    • Cache Write: Mostly free or heavily discounted.

    • Cache Read: Deep discounts vs. fresh inference.

Example Trace of Cache HIT

Manual Caching:

Anthropic’s Claude family requires you to mark which parts of the message are cacheable by adding a cache_control object. You can also set a TTL to control how long the block stays in cache.

  • Activation: You must wrap static blocks in a content array and give them a cache_control entry.

  • TTL: Use {"ttl": "5m"} or {"ttl": "1h"} to control expiration (default 5 minutes).

  • Best For: Huge documents, long backstories, or repeated system instructions.

  • Pricing:

    • Cache Write: 1.25× the normal per-token rate

    • Cache Read: 0.1× (10%) of the normal per-token rate

  • Limitations: Ephemeral (expires after TTL), limited number of blocks.

Cache write with Anthropic Prompt Caching
In this run you’ll see “Prompt Caching: 99.9% Write,” a small cost increase (~25%).

Caching Example ( Anthropic)

Here is an example of caching a large document. This can be done in either the system or user message.

{
  "model": "anthropic/claude-3.5-sonnet",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful assistant that analyzes legal documents. The following is a terms of service document:"
        },
        {
          "type": "text",
          "text": "HUGE DOCUMENT TEXT...",
          "cache_control": {
            "type": "ephemeral",
            "ttl": "1h"
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Summarize the key points about data privacy."
        }
      ]
    }
  ]
}

Provider Support Matrix

Provider
Auto-cache?
Manual flag?
TTL
Write cost
Read cost

OpenAI

N/A

standard

0.25x or 0.5x

Grok

N/A

standard

0.25x

DeepSeek

N/A

standard

0.25x

Anthropic Claude

cache_control + TTL

5 m / 1 h

1.25×

0.1×


For the most up-to-date information on a specific model or provider's caching policy, pricing, and limitations, please refer to the model page on LangDB

Last updated

Was this helpful?