Skip to main content

Clickhouse UDFs

langdb_udf adds support for AI operations directly within ClickHouse through User Defined Functions (UDFs). This enables running AI completions and embeddings natively in your SQL queries. You can access 250+ models directly in Clickhouse.

Features

  • ai_completions: Generate AI completions from various models
  • ai_embed: Create embeddings from text
Using LangDB AI gateway through clickhouse using UDFs

Why Use LangDB UDFs in ClickHouse?

LangDB UDFs are particularly powerful for running LLM-based evaluations and analysis directly within your ClickHouse environment:

  • Native Integration: Run AI operations directly in SQL queries without data movement
  • Batch Processing: Efficiently process and analyze large datasets with LLMs
  • Real-time Analysis: Perform content moderation, sentiment analysis, and other AI tasks as part of your data pipeline
  • Model Comparison: Easily compare results across different LLM models in a single query
  • Scalability: Leverage ClickHouse's distributed architecture for parallel AI processing

Prerequisites

  1. Get your LangDB credentials:
  2. Set up environment variables:
export LANGDB_PROJECT_ID=your_project_id
export LANGDB_API_KEY=your_api_key

Installation

# Clone the repository
git clone git@github.com:langdb/ai-gateway.git
cd ai-gateway

# Create directory for ClickHouse user scripts
mkdir -p docker/clickhouse/user_scripts

# Download the latest UDF
curl -sL https://github.com/langdb/ai-gateway/releases/download/0.1.0/langdb_udf \
-o docker/clickhouse/user_scripts/langdb_udf

# Start ClickHouse with LangDB UDF
docker compose up -d

Usage Examples

Using ai_completions

Basic example with system prompt:

-- Set system prompt
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value sentiment score between 1 and 5 for every input and nothing else.';

-- Run completion
SELECT ai_completions
('{"model": "gpt-4o-mini", "max_tokens": 1000}')
({system_prompt:String}, 'You are very rude') as score

Advanced Parameters

You can specify additional parameters like thread_id and run_id:

-- Set parameters
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value sentiment score between 1 and 5 for every input and nothing else.';

-- Generate UUIDs for tracking
SELECT generateUUIDv4();
SET param_thread_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';
SET param_run_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';

-- Run completion with parameters
SELECT ai_completions
('{"model": "gpt-4o-mini", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
({system_prompt:String}, 'You are very rude') as score

Using ai_embed

Generate embeddings from text:

SELECT ai_embed
('{"model":"text-embedding-3-small"}')
('Life is beautiful') as embed_text

Real-world Example: Content Moderation

This example shows how to score HackerNews comments for harmful content:

-- Create and populate table
CREATE TABLE hackernews
ENGINE = MergeTree
ORDER BY id
SETTINGS allow_nullable_key = 1 EMPTY AS
SELECT *
FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet', 'Parquet');

-- Insert sample data
INSERT INTO hackernews SELECT *
FROM url('https://datasets-documentation.s3.eu-west-3.amazonaws.com/hackernews/hacknernews.parquet', 'Parquet')
LIMIT 100;

-- Set up parameters
SET param_system_prompt = 'You are a helpful assistant. You will return only a single value score between 1 and 5 for every input and nothing else based on malicious behavior. 0 being ok, 5 being the most harmful';
SET param_thread_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';
SET param_run_id = '06b66882-e42e-4b17-ba93-4b5260a10ad8';

-- Score content using multiple models
WITH tbl as ( select * from hackernews limit 5)
SELECT
id,
left(text, 100) as text_clip,
ai_completions
('{"model": "gpt-4o-mini", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
({system_prompt:String}, text) as gpt_4o_mini_score,
ai_completions
('{"model": "gemini/gemini-1.5-flash-8b", "max_tokens": 1000, "thread_id": "' || {thread_id:String} || '", "run_id": "' || {run_id:String} || '"}')
({system_prompt:String}, text) as gemini_15flash_score
FROM tbl
FORMAT PrettySpace
id   text_clip                                            gpt_4o_mini_score   gemini_15flash_score
1. 7544833 This is a project for people who like to read and 2 2

2. 7544834 I appreciate your efforts to set the facts straigh 2 2

3. 7544835 Here in Western Europe, earning $100,000 per year 1 2

4. 7544836 Haha oh man so true. This is why I've found i 3 2

5. 7544837 The thing is it's gotten more attention from 1 2

Observability

If tracing is enabled you ll be able to view several metrics about the request such as cost, time, Time to First Token etc on https://app.langdb.ai/

References