Routing
LangDB AI Gateway optimizes LLM selection based on cost, speed, and availability, ensuring efficient request handling. This guide covers the various dynamic routing strategies available in the system, including fallback, script-based, optimized, percentage-based, and latency-based routing.
This ensures efficient request handling and optimal model selection tailored to specific application needs.
Understanding Targets
Before diving into routing strategies, it's essential to understand targets in LangDB AI Gateway. A target refers to a specific model or endpoint to which requests can be directed. Each target represents a potential processing unit within the routing logic, enabling optimal performance and reliability.
Target Parameters
Each target in the routing configuration can have custom parameters that define its behavior. These parameters allow fine-tuning of model outputs to align with specific requirements.
Common parameters include:
model: The model identifier (e.g.,
openai/gpt-4o
,deepseek/deepseek-chat
).temperature: Controls the randomness of responses (higher values make responses more creative).
max_tokens: Limits the number of tokens in the response.
top_p: Determines the probability mass for nucleus sampling.
frequency_penalty: Reduces repetition by penalizing frequent tokens.
presence_penalty: Encourages diversity by discouraging token reuse.
Customizing Model Parameters
You can customize parameters for each target model to fine-tune the behavior and output of the models. Parameters such as temperature
, max_tokens
, and frequency_penalty
can be adjusted to meet specific requirements.
Example of customizing model parameters:
Routing Strategies
LangDB AI Gateway supports multiple routing strategies that can be combined and customized to meet your specific needs:
Sequentially routes requests through multiple models in case of
Uses custom JavaScript logic to determine the best model dynamically.
Selects the best model based on real-time performance metrics.
Distributes traffic between multiple models using predefined weightings.
Chooses the model with the lowest response time for real-time applications.
Combines multiple routing strategies for flexible traffic management.
Fallback Routing
Fallback routing allows sequential attempts to different model targets in case of failure or unavailability. It ensures robustness by cascading through a list of models based on predefined logic.
Script-Based Routing
LangDB AI allows executing custom JavaScript scripts to determine the best model dynamically. The script runs at request time and evaluates multiple parameters, including pricing, latency, and model availability.
Optimized Routing
Optimized routing automatically selects the best model based on real-time performance metrics such as latency, response time, and cost-efficiency.
Here, the request is routed to the model with the lowest Time-to-First-Token (TTFT) among gpt-3.5-turbo and gpt-4o-mini.
Metrics:
Requests – Total number of requests sent to the model.
InputTokens – Number of tokens provided as input to the model.
OutputTokens – Number of tokens generated by the model in response.
TotalTokens – Combined count of input and output tokens.
RequestsDuration – Total duration taken to process requests.
Ttft (Time-to-First-Token) (Default) – Time taken by the model to generate its first token after receiving a request.
LlmUsage – The total computational cost of using the model, often used for cost-based routing.
Percentage-Based Routing
Percentage-based routing distributes requests between models according to predefined weightings, allowing load balancing, A/B testing, or controlled experimentation with different configurations. Each model can have distinct parameters while sharing the request load.
Latency-Based Routing
Latency-based routing selects the model with the lowest response time, ensuring minimal delay for real-time applications like chatbots and interactive AI systems.
Nested Routing
LangDB AI allows nesting of routing strategies, enabling combinations like fallback within script-based selection. This flexibility helps refine model selection based on dynamic business needs.
Last updated
Was this helpful?