Load Balancer Routing

Distribute requests across multiple models to ensure high availability, balance load, and optimize performance. Use real-time metrics to select the best available model.

Use Case

  • High availability requirements

  • Load distribution across models

  • Performance optimization

  • Failover scenarios

Configuration

{
  "model": "router/dynamic",
  "router": {
    "type": "conditional",
    "routes": [
      {
        "name": "Balanced",
        "targets": {
          "$any": [
            "openai/gpt-4.1-nano",
            "gemini/gemini-2.0-flash",
            "bedrock/llama3-2-3b-instruct-v1.0"
          ],
          "sort_by": "requests",
          "sort_order": "min"
        }
      }
    ]
  }
}

How It Works

  1. Model Pool: Defines three models for load distribution (GPT-4.1-nano, Gemini-2.0-flash, Llama3-2-3b)

  2. Load Balancing: Automatically selects the model with the least current load (requests)

  3. Automatic Distribution: Requests are distributed across the available models based on their current usage

Variables Used

  • requests: Current load metric (used for sorting)

Customization

  • Adjust health thresholds

  • Add more models to the pool

  • Use different sorting strategies (ttft, price, etc.)

  • Implement weighted load balancing

  • Add geographic considerations

Last updated

Was this helpful?