Load Balancer Routing

Distribute requests across multiple models to ensure high availability, balance load, and optimize performance. Use real-time metrics to select the best available model.

Use Case

High availability requirements
Load distribution across models
Performance optimization
Failover scenarios

Configuration

{
  "model": "router/dynamic",
  "router": {
    "type": "conditional",
    "routes": [
      {
        "name": "Balanced",
        "targets": {
          "$any": [
            "openai/gpt-4.1-nano",
            "gemini/gemini-2.0-flash",
            "bedrock/llama3-2-3b-instruct-v1.0"
          ],
          "sort_by": "requests",
          "sort_order": "min"
        }
      }
    ]
  }
}

How It Works

Model Pool: Defines three models for load distribution (GPT-4.1-nano, Gemini-2.0-flash, Llama3-2-3b)
Load Balancing: Automatically selects the model with the least current load (requests)
Automatic Distribution: Requests are distributed across the available models based on their current usage

Variables Used

requests: Current load metric (used for sorting)

Customization

Adjust health thresholds
Add more models to the pool
Use different sorting strategies (ttft, price, etc.)
Implement weighted load balancing
Add geographic considerations

PreviousTopic Routing NextGroup Based Routing

Last updated 2 months ago

Was this helpful?