Prompt Caching

Leverage provider-side prompt caching for significant cost and latency savings on large, repeated prompts.

To save on inference costs, you can leverage prompt caching on supported providers and models. When a provider supports it, LangDB will make a best-effort to route subsequent requests to the same provider to make use of the warm cache.

Most providers automatically enable prompt caching for large prompts, but some, like Anthropic, require you to enable it on a per-message basis.

How Caching Works

Automatic Caching

Providers like OpenAI, Grok, DeepSeek, and (soon) Google Gemini enable caching by default once your prompt exceeds a certain length (e.g. 1024 tokens).

Activation: No change needed. Any prompt over the length threshold is written to cache.
Best Practice: Put your static content (system prompts, RAG context, long instructions) first in the message so it can be reused.
Pricing:
- Cache Write: Mostly free or heavily discounted.
- Cache Read: Deep discounts vs. fresh inference.

Manual Caching:

Anthropic’s Claude family requires you to mark which parts of the message are cacheable by adding a cache_control object. You can also set a TTL to control how long the block stays in cache.

Activation: You must wrap static blocks in a content array and give them a cache_control entry.
TTL: Use {"ttl": "5m"} or {"ttl": "1h"} to control expiration (default 5 minutes).
Best For: Huge documents, long backstories, or repeated system instructions.
Pricing:
- Cache Write: 1.25× the normal per-token rate
- Cache Read: 0.1× (10%) of the normal per-token rate
Limitations: Ephemeral (expires after TTL), limited number of blocks.

Cache write with Anthropic Prompt Caching

Caching Example ( Anthropic)

Here is an example of caching a large document. This can be done in either the system or user message.

{
  "model": "anthropic/claude-3.5-sonnet",
  "messages": [
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": "You are a helpful assistant that analyzes legal documents. The following is a terms of service document:"
        },
        {
          "type": "text",
          "text": "HUGE DOCUMENT TEXT...",
          "cache_control": {
            "type": "ephemeral",
            "ttl": "1h"
          }
        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Summarize the key points about data privacy."
        }
      ]
    }
  ]
}

Provider Support Matrix

Provider

Auto-cache?

Manual flag?

TTL

Write cost

Read cost

OpenAI

✅

❌

N/A

standard

0.25x or 0.5x

Grok

✅

❌

N/A

standard

0.25x

DeepSeek

✅

❌

N/A

standard

0.25x

Anthropic Claude

❌

cache_control + TTL

5 m / 1 h

1.25×

0.1×

For the most up-to-date information on a specific model or provider's caching policy, pricing, and limitations, please refer to the model page on LangDB

PreviousResponse Caching NextModel Access Control

Last updated 3 months ago

Was this helpful?