Configuring Data Retention
Overview
This document outlines LangDB's data retention strategy for tracing information stored in ClickHouse. The strategy employs materialized views to manage data retention periods based on user subscription tiers efficiently. Data eviction is implemented using ClickHouse's TTL (Time-To-Live) mechanisms and background processes:
TTL Definitions: Each table includes TTL expressions that specify when data should expire based on timestamp fields
Background Merge Process: ClickHouse automatically runs background processes that merge data parts and remove expired data during these merge operations
Resource-Efficient: The eviction process runs asynchronously during system low-load periods, minimizing impact on query performance
Tracing Data Architecture
LangDB uses a robust system for storing and analyzing trace data:
Primary Storage: All trace data is initially stored in the
langdb.traces
table in ClickHouseMaterialized Views: Tier-specific materialized views filter and retain data based on user subscription levels
Retention Policies: Automated TTL (Time-To-Live) mechanisms enforce retention periods
Implementation using Materialized Views
Tier-Specific Materialized Views
Professional Tier View
Enterprise Tier View
Data Access Flow
New trace data is inserted into the base
langdb.traces
tableMaterialized views automatically filter and copy relevant data to tier-specific tables
TTL mechanisms automatically remove data older than the specified retention period
Data access APIs query the appropriate table based on the user's subscription tier
Benefits of This Approach
Efficiency: Only store data for the period necessary based on customer tier
Performance: Queries run against smaller, tier-specific tables rather than the entire dataset
Compliance: Clear retention boundaries help with regulatory compliance
Cost-Effective: Optimizes storage costs by aligning retention with customer value
Backup and Disaster Recovery
While the retention strategy focuses on operational access to trace data, a separate backup strategy ensures data can be recovered in case of system failures:
Daily snapshots of ClickHouse data
Backup retention aligned with the longest tier retention period (365 days)
Geo-redundant storage of backups
Monitoring and Management
The retention system includes:
Monitoring dashboards for data volume by tier
Alerts for unexpected growth or retention failures
Regular audits to ensure compliance with retention policies
Future Enhancements
Implementation of custom retention periods for specific enterprise customers
Cold storage options for extended archival needs
Advanced sampling techniques to retain representative trace data beyond standard periods
Last updated
Was this helpful?