1 of 8

Self Hosted

Getting Started

Run LangDB AI Gateway locally.

LangDB AI gateway is available as an open-source repo that you can configure locally. Own your LLM data and route to 250+ models.

Here is the link to the repo - https://github.com/langdb/ai-gateway

Running Locally

docker run -it \
    -p 8080:8080 \
    langdb/ai-gateway login

Start Server

docker run -it \
    -p 8080:8080 \
    langdb/ai-gateway serve

Make your first request

# Chat completion with GPT-4
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

# Or try Claude
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "claude-3-opus",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }'

The gateway provides the following OpenAI-compatible endpoints:

POST /v1/chat/completions - Chat completions
GET /v1/models - List available models
POST /v1/embeddings - Generate embeddings
POST /v1/images/generations - Generate images

Advanced Configuration

LangDB allows advanced configuration options to customize its functionality. The three main configuration areas are:

Limits – Control API usage with rate limiting and cost control.
Routing – Define how requests are routed across multiple LLM providers.
Observability – Enable logging and tracing to monitor API performance.

These configurations can be set up using a configuration file (config.yaml) or overridden via command line options.

Setting up

Download the sample configuration from our repo.

Copy the example config file:

curl -sL https://raw.githubusercontent.com/langdb/ai-gateway/main/config.sample.yaml -o config.sample.yaml

cp config.sample.yaml config.yaml

Command line options will override corresponding config file settings when both are specified.

Visit for more details.

Connecting to OSS Models

Connect to open-source models using Ollama or vLLM with LangDB AI Gateway.

LangDB AI Gateway supports connecting to open-source models through providers like Ollama and vLLM. This allows you to use locally hosted models while maintaining the same OpenAI-compatible API interface.

Configuration

To use Ollama or vLLM, you need to provide a list of models with their endpoints. By default, ai-gateway loads models from ~/.langdb/models.yaml. You can define your models there in the following format:

- model: gpt-oss
  model_provider: ollama
  inference_provider:
    provider: ollama
    model_name: gpt-oss
    endpoint: https://my-ollama-server.localhost
  price:
    per_input_token: 0.0
    per_output_token: 0.0
  input_formats:
  - text
  output_formats:
  - text
  limits:
    max_context_size: 128000
  capabilities: ['tools']
  type: completions
  description: OpenAI's open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases.

Configuration Fields

Field

Description

Required

model

The model identifier used in API requests

Yes

model_provider

The provider type (e.g., ollama, vllm)

Yes

inference_provider

Provider-specific configuration

Yes

price

Token pricing (set to 0.0 for local models)

Yes

input_formats

Supported input formats

Yes

output_formats

Supported output formats

Yes

limits

Model limitations (context size, etc.)

Yes

capabilities

Model capabilities array (e.g., ['tools'] for function calling)

Yes

type

Model type (e.g., completions)

Yes

description

Human-readable model description

Yes

Example Usage

Once configured, you can use your OSS models through the standard OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Supported Providers

Ollama

Provider: ollama
Endpoint: URL to your Ollama server
Model Name: The model name as configured in Ollama

vLLM

Provider: vllm
Endpoint: URL to your vLLM server
Model Name: The model name as configured in vLLM

Best Practices

Local Development: Use localhost or 127.0.0.1 for local Ollama/vLLM instances
Production: Use proper domain names or IP addresses for remote instances
Security: Ensure your OSS model endpoints are properly secured
Performance: Consider the network latency between ai-gateway and your model servers
Monitoring: Use the observability features to monitor OSS model performance

Rate Limiting

Apply rate limiting, cost control and more.

Rate limiting is an essential mechanism to prevent API abuse by controlling the number of requests allowed within a specific time frame. You can configure rate limits by setting hourly, daily and monthly total limits

This ensures fair usage and helps maintain system performance and stability.

# Limit to 1000 requests per hour
ai-gateway serve \
    --rate-hourly 1000
    --rate-daily 1000
    --rate-monthly 1000

Or in config.yaml:

rate_limit:
  hourly: 100
  daily: 1000
  monthly: 10000

When a rate limit is exceeded, the API will return a 429 (Too Many Requests) response.

Why Rate Limiting Matters

Prevents excessive LLM API usage: Controls the number of requests per user to avoid resource exhaustion.
Optimizes model inference efficiency: Ensures that LLM requests are processed smoothly without congestion.

Cost Control

Apply cost control using configuration.

Cost control helps manage API spending by setting daily, monthly, or total cost limits. Configure cost limits using:

# Set daily and monthly limits
ai-gateway serve \
  --cost-daily 100.0 \
  --cost-monthly 1000.0 \
  --cost-total 5000.0

Or in config.yaml:

cost_control:
  daily: 100.0   # $100 per day
  monthly: 1000.0  # $1000 per month
  total: 5000.0    # $5000 total

When a cost limit is reached, the API will return a 429 response indicating the limit has been exceeded.

Benefits of Cost Control

Prevents overspending: Ensures budgets are adhered to.
Optimizes usage: Encourages efficient API consumption.

Routing

Configure dynamic routing to manage LLM traffic intelligently with fallback, script, and latency strategies in LangDB.

Dynamic Model Routing

LangDB AI Gateway enables sophisticated routing strategies for LLM requests. You can optimize AI traffic by implementing fallback routing, script-based routing, and latency-based routing.

Self hosted option enables routing through configuration. Checkout the full for more details.

Example Configuration:

This configuration allows multiple targets with specific parameters, ensuring that requests are handled efficiently.

Observability with Clickhouse

Configure your tracing data and store them in Clickhouse

The gateway supports OpenTelemetry tracing with ClickHouse as the storage backend. All traces are stored in the langdb.traces table.

Setting up

Create the traces table in ClickHouse:

# Create langdb database if it doesn't exist
clickhouse-client --query "CREATE DATABASE IF NOT EXISTS langdb"

# Import the traces table schema
clickhouse-client --query "$(cat sql/traces.sql)"

Enable tracing by providing the ClickHouse URL when running the server:

ai-gateway serve --clickhouse-url "clickhouse://localhost:9000"

Or in config.yaml:

clickhouse:
  url: "http://localhost:8123"

Querying Traces

Traces are stored in the langdb.traces table. Example query:

-- Get recent traces
SELECT
    trace_id,
    operation_name,
    start_time_us,
    finish_time_us,
    (finish_time_us - start_time_us) as duration_us
FROM langdb.traces
WHERE finish_date >= today() - 1
ORDER BY finish_time_us DESC
LIMIT 10;

Leveraging LangDB APIs within ClickHouse

LangDB APIs can be called directly within ClickHouse. Check out our UDF documentation to learn how to use LLMs in SQL queries.

Running with Docker Compose

For a complete setup, including ClickHouse for analytics and tracing, follow these steps:

Start the services using Docker Compose:

docker-compose up -d

This will start:

ClickHouse server on ports 8123 (HTTP)
All necessary configurations loaded from docker/clickhouse/server/config.d

Build and run the gateway:

ai-gateway run

The gateway will now be running with full analytics and logging capabilities, storing data in ClickHouse.

Clickhouse UDFs

Leveraging AI functions directly in your Clickhouse environment

langdb_udf adds support for AI operations directly within ClickHouse through User Defined Functions (UDFs). This enables running AI completions and embeddings natively in your SQL queries. You can access 250+ models directly in Clickhouse.

Check the full list of models supported here
You can find the full instructions in our AI Gateway repository.

Features

ai_completions: Generate AI completions from various models
ai_embed: Create embeddings from text

Why Use LangDB UDFs in ClickHouse?

LangDB UDFs are particularly powerful for running LLM-based evaluations and analysis directly within your ClickHouse environment:

Native Integration: Run AI operations directly in SQL queries without data movement
Batch Processing: Efficiently process and analyze large datasets with LLMs
Real-time Analysis: Perform content moderation, sentiment analysis, and other AI tasks as part of your data pipeline
Model Comparison: Easily compare results across different LLM models in a single query
Scalability: Leverage ClickHouse's distributed architecture for parallel AI processing

Prerequisites

Get your LangDB credentials:
- Sign up at LangDB
- Get your LANGDB_PROJECT_ID and LANGDB_API_KEY
- Download the latest landb_udf binary
Set up environment variables:

export LANGDB_PROJECT_ID=your_project_id
export LANGDB_API_KEY=your_api_key

Installation

# Clone the repository
git clone [email protected]:langdb/ai-gateway.git
cd ai-gateway

# Create directory for ClickHouse user scripts
mkdir -p docker/clickhouse/user_scripts

# Download the latest UDF
curl -sL https://github.com/langdb/ai-gateway/releases/download/0.1.0/langdb_udf \
  -o docker/clickhouse/user_scripts/langdb_udf

# Start ClickHouse with LangDB UDF
docker compose up -d

Usage Examples

Using `ai_completions`

Basic example with system prompt:

-- Set system prompt
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value sentiment score between 1 and 5 for every input and nothing else.';

-- Run completion
SELECT ai_completions
('{"model": "gpt-4o-mini", "max_tokens": 1000}') 
({system_prompt:String}, 'You are very rude') as score

Advanced Parameters

You can specify additional parameters like thread_id and run_id:

-- Set parameters
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value sentiment score between 1 and 5 for every input and nothing else.';

-- Generate UUIDs for tracking
SELECT generateUUIDv4();
SET param_thread_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';
SET param_run_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';

-- Run completion with parameters
SELECT ai_completions
('{"model": "gpt-4o-mini", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
({system_prompt:String}, 'You are very rude') as score

Using `ai_embed`

Generate embeddings from text:

SELECT ai_embed
('{"model":"text-embedding-3-small"}')
('Life is beautiful') as embed_text

Real-world Example: Content Moderation

This example shows how to score HackerNews comments for harmful content:

-- Create and populate table
CREATE TABLE hackernews
ENGINE = MergeTree
ORDER BY id
SETTINGS allow_nullable_key = 1 EMPTY AS
SELECT *
FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet', 'Parquet');

-- Insert sample data
INSERT INTO hackernews SELECT *
FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet', 'Parquet') 
LIMIT 100;

-- Set up parameters
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value score between 1 and 5 for every input and nothing else based on malicious behavior. 0 being ok, 5 being the most harmful';
SET param_thread_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';
SET param_run_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';

-- Score content using multiple models
WITH tbl as ( select * from hackernews limit 5)
SELECT  
    id, 
    left(text, 100) as text_clip, 
    ai_completions
    ('{"model": "gpt-4o-mini", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
    ({system_prompt:String}, text) as gpt_4o_mini_score,
    ai_completions
    ('{"model": "gemini/gemini-1.5-flash-8b", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
    ({system_prompt:String}, text) as gemini_15flash_score
FROM tbl 
FORMAT PrettySpace

id   text_clip                                            gpt_4o_mini_score   gemini_15flash_score
1.  7544833   This is a project for people who like to read and    2                   2
                    
2.  7544834   I appreciate your efforts to set the facts straigh   2                   2
                    
3.  7544835   Here in Western Europe, earning $100,000 per year    1                   2
                    
4.  7544836   Haha oh man so true. This is why I&#x27;ve found i   3                   2
                    
5.  7544837   The thing is it&#x27;s gotten more attention from    1                   2

Observability

If tracing is enabled you ll be able to view several metrics about the request such as cost, time, Time to First Token etc on https://app.langdb.ai/

References

API Reference

API Reference for LangDB

Connecting to OSS Models

Connect to open-source models using Ollama or vLLM with LangDB AI Gateway.

Configuration

- model: gpt-oss
  model_provider: ollama
  inference_provider:
    provider: ollama
    model_name: gpt-oss
    endpoint: https://my-ollama-server.localhost
  price:
    per_input_token: 0.0
    per_output_token: 0.0
  input_formats:
  - text
  output_formats:
  - text
  limits:
    max_context_size: 128000
  capabilities: ['tools']
  type: completions
  description: OpenAI's open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases.

Configuration Fields

Field

Description

Required

model

The model identifier used in API requests

Yes

model_provider

The provider type (e.g., ollama, vllm)

Yes

inference_provider

Provider-specific configuration

Yes

price

Token pricing (set to 0.0 for local models)

Yes

input_formats

Supported input formats

Yes

output_formats

Supported output formats

Yes

limits

Model limitations (context size, etc.)

Yes

capabilities

Model capabilities array (e.g., ['tools'] for function calling)

Yes

type

Model type (e.g., completions)

Yes

description

Human-readable model description

Yes

Example Usage

Once configured, you can use your OSS models through the standard OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss",
    "messages": [{"role": "user", "content": "What is the capital of France?"}]
  }'

Supported Providers

Ollama

Provider: ollama
Endpoint: URL to your Ollama server
Model Name: The model name as configured in Ollama

vLLM

Provider: vllm
Endpoint: URL to your vLLM server
Model Name: The model name as configured in vLLM

Best Practices

Local Development: Use localhost or 127.0.0.1 for local Ollama/vLLM instances
Production: Use proper domain names or IP addresses for remote instances
Security: Ensure your OSS model endpoints are properly secured
Performance: Consider the network latency between ai-gateway and your model servers
Monitoring: Use the observability features to monitor OSS model performance

Clickhouse UDFs

Leveraging AI functions directly in your Clickhouse environment

Check the full list of models supported here
You can find the full instructions in our AI Gateway repository.

Features

ai_completions: Generate AI completions from various models
ai_embed: Create embeddings from text

Why Use LangDB UDFs in ClickHouse?

LangDB UDFs are particularly powerful for running LLM-based evaluations and analysis directly within your ClickHouse environment:

Native Integration: Run AI operations directly in SQL queries without data movement
Batch Processing: Efficiently process and analyze large datasets with LLMs
Real-time Analysis: Perform content moderation, sentiment analysis, and other AI tasks as part of your data pipeline
Model Comparison: Easily compare results across different LLM models in a single query
Scalability: Leverage ClickHouse's distributed architecture for parallel AI processing

Prerequisites

Get your LangDB credentials:
- Sign up at LangDB
- Get your LANGDB_PROJECT_ID and LANGDB_API_KEY
- Download the latest landb_udf binary
Set up environment variables:

export LANGDB_PROJECT_ID=your_project_id
export LANGDB_API_KEY=your_api_key

Installation

# Clone the repository
git clone [email protected]:langdb/ai-gateway.git
cd ai-gateway

# Create directory for ClickHouse user scripts
mkdir -p docker/clickhouse/user_scripts

# Download the latest UDF
curl -sL https://github.com/langdb/ai-gateway/releases/download/0.1.0/langdb_udf \
  -o docker/clickhouse/user_scripts/langdb_udf

# Start ClickHouse with LangDB UDF
docker compose up -d

Usage Examples

Using `ai_completions`

Basic example with system prompt:

-- Set system prompt
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value sentiment score between 1 and 5 for every input and nothing else.';

-- Run completion
SELECT ai_completions
('{"model": "gpt-4o-mini", "max_tokens": 1000}') 
({system_prompt:String}, 'You are very rude') as score

Advanced Parameters

You can specify additional parameters like thread_id and run_id:

-- Set parameters
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value sentiment score between 1 and 5 for every input and nothing else.';

-- Generate UUIDs for tracking
SELECT generateUUIDv4();
SET param_thread_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';
SET param_run_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';

-- Run completion with parameters
SELECT ai_completions
('{"model": "gpt-4o-mini", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
({system_prompt:String}, 'You are very rude') as score

Using `ai_embed`

Generate embeddings from text:

SELECT ai_embed
('{"model":"text-embedding-3-small"}')
('Life is beautiful') as embed_text

Real-world Example: Content Moderation

This example shows how to score HackerNews comments for harmful content:

-- Create and populate table
CREATE TABLE hackernews
ENGINE = MergeTree
ORDER BY id
SETTINGS allow_nullable_key = 1 EMPTY AS
SELECT *
FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet', 'Parquet');

-- Insert sample data
INSERT INTO hackernews SELECT *
FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet', 'Parquet') 
LIMIT 100;

-- Set up parameters
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value score between 1 and 5 for every input and nothing else based on malicious behavior. 0 being ok, 5 being the most harmful';
SET param_thread_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';
SET param_run_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';

-- Score content using multiple models
WITH tbl as ( select * from hackernews limit 5)
SELECT  
    id, 
    left(text, 100) as text_clip, 
    ai_completions
    ('{"model": "gpt-4o-mini", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
    ({system_prompt:String}, text) as gpt_4o_mini_score,
    ai_completions
    ('{"model": "gemini/gemini-1.5-flash-8b", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
    ({system_prompt:String}, text) as gemini_15flash_score
FROM tbl 
FORMAT PrettySpace

id   text_clip                                            gpt_4o_mini_score   gemini_15flash_score
1.  7544833   This is a project for people who like to read and    2                   2
                    
2.  7544834   I appreciate your efforts to set the facts straigh   2                   2
                    
3.  7544835   Here in Western Europe, earning $100,000 per year    1                   2
                    
4.  7544836   Haha oh man so true. This is why I&#x27;ve found i   3                   2
                    
5.  7544837   The thing is it&#x27;s gotten more attention from    1                   2

Observability

If tracing is enabled you ll be able to view several metrics about the request such as cost, time, Time to First Token etc on https://app.langdb.ai/

Self Hosted

Getting Started

Running Locally

Run Docker and Login

Start Server

Advanced Configuration

Setting up

Connecting to OSS Models

Configuration

Configuration Fields

Example Usage

Supported Providers

Ollama

vLLM

Best Practices

Rate Limiting

Why Rate Limiting Matters

Cost Control

Benefits of Cost Control

Routing

Dynamic Model Routing

Observability with Clickhouse

Setting up

Querying Traces

Leveraging LangDB APIs within ClickHouse

Running with Docker Compose

Clickhouse UDFs

Features

Why Use LangDB UDFs in ClickHouse?

Prerequisites

Installation

Usage Examples

Using ai_completions

Advanced Parameters

Using ai_embed

Real-world Example: Content Moderation

Observability

References

API Reference

Getting Started

Running Locally

Run Docker and Login

Start Server

Advanced Configuration

Setting up

Observability with Clickhouse

Setting up

Querying Traces

Leveraging LangDB APIs within ClickHouse

Running with Docker Compose

Connecting to OSS Models

Configuration

Configuration Fields

Example Usage

Supported Providers

Ollama

vLLM

Best Practices

API Reference

Cost Control

Benefits of Cost Control

Routing

Dynamic Model Routing

Rate Limiting

Why Rate Limiting Matters

Clickhouse UDFs

Features

Why Use LangDB UDFs in ClickHouse?

Prerequisites

Installation

Usage Examples

Using ai_completions

Advanced Parameters

Using ai_embed

Real-world Example: Content Moderation

Observability

References

Retrieve pricing information

List models

Fetch analytics data

Using `ai_completions`

Using `ai_embed`

Using `ai_completions`

Using `ai_embed`