Configuring Data Retention

Control trace data retention in LangDB with scalable, cost-effective strategies using ClickHouse background TTL processes and tiered materialized views.

Overview

This document outlines LangDB's data retention strategy for tracing information stored in ClickHouse. The strategy employs materialized views to manage data retention periods based on user subscription tiers efficiently. Data eviction is implemented using ClickHouse's TTL (Time-To-Live) mechanisms and background processes:

  • TTL Definitions: Each table includes TTL expressions that specify when data should expire based on timestamp fields

  • Background Merge Process: ClickHouse automatically runs background processes that merge data parts and remove expired data during these merge operations

  • Resource-Efficient: The eviction process runs asynchronously during system low-load periods, minimizing impact on query performance

Tracing Data Architecture

LangDB uses a robust system for storing and analyzing trace data:

  • Primary Storage: All trace data is initially stored in the langdb.traces table in ClickHouse

  • Materialized Views: Tier-specific materialized views filter and retain data based on user subscription levels

  • Retention Policies: Automated TTL (Time-To-Live) mechanisms enforce retention periods

Implementation using Materialized Views

Tier-Specific Materialized Views

Professional Tier View

CREATE MATERIALIZED VIEW langdb.traces_professional_mv
TO langdb.traces_professional
AS SELECT *
FROM langdb.traces;

CREATE TABLE langdb.traces_professional (
    /* Same structure as base table */
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id)
TTL timestamp + toIntervalDay(30);

Enterprise Tier View

CREATE MATERIALIZED VIEW langdb.traces_enterprise_mv
TO langdb.traces_enterprise
AS SELECT *
FROM langdb.traces;

CREATE TABLE langdb.traces_enterprise (
    /* Same structure as base table */
) ENGINE = MergeTree()
ORDER BY (timestamp, user_id)
TTL timestamp + toIntervalDay(90);

Data Access Flow

  1. New trace data is inserted into the base langdb.traces table

  2. Materialized views automatically filter and copy relevant data to tier-specific tables

  3. TTL mechanisms automatically remove data older than the specified retention period

  4. Data access APIs query the appropriate table based on the user's subscription tier

Benefits of This Approach

  • Efficiency: Only store data for the period necessary based on customer tier

  • Performance: Queries run against smaller, tier-specific tables rather than the entire dataset

  • Compliance: Clear retention boundaries help with regulatory compliance

  • Cost-Effective: Optimizes storage costs by aligning retention with customer value

Backup and Disaster Recovery

While the retention strategy focuses on operational access to trace data, a separate backup strategy ensures data can be recovered in case of system failures:

  • Daily snapshots of ClickHouse data

  • Backup retention aligned with the longest tier retention period (365 days)

  • Geo-redundant storage of backups

Monitoring and Management

The retention system includes:

  • Monitoring dashboards for data volume by tier

  • Alerts for unexpected growth or retention failures

  • Regular audits to ensure compliance with retention policies

Future Enhancements

  • Implementation of custom retention periods for specific enterprise customers

  • Cold storage options for extended archival needs

  • Advanced sampling techniques to retain representative trace data beyond standard periods

Last updated

Was this helpful?