RouteLLM: Optimizing the Cost-Quality Trade-Off in Large Language Model Deployment

5 min readJul 26, 2024

Introduction

Large Language Models (LLMs) like GPT-4 have demonstrated exceptional capabilities across a wide range of natural language processing tasks, from open-ended conversations and question answering to text summarization and code generation. These advancements have been fueled by innovations in model architectures, such as the Transformer, and the ability to scale up data and training infrastructure. However, the deployment of LLMs comes with significant challenges, particularly when balancing performance and cost. Larger, more powerful models tend to deliver higher quality responses but at a prohibitive cost, while smaller models are more cost-effective but less capable.

In practical applications, this trade-off creates a dilemma. Routing all queries to the most powerful model ensures high-quality results but at a high expense. Conversely, routing to smaller models can significantly reduce costs but may compromise the quality of responses for complex queries. This is where LLM routing becomes crucial. By dynamically selecting between a stronger and a weaker model based on the query’s complexity, we can optimize the balance between cost and response quality.

This article explores the development and evaluation of efficient router models designed to intelligently route queries between LLMs, optimizing both cost and performance. Our approach leverages human preference data and data augmentation techniques to enhance router training, demonstrating significant cost savings without compromising response quality.

The Challenge of LLM Routing

The landscape of LLMs is heterogeneous, with models varying widely in size, capability, and cost. Larger models, such as GPT-4, can cost significantly more per query than smaller models like Mixtral-8x7B. The key challenge is to develop a routing system that can infer the intent, complexity, and domain of incoming queries and understand the strengths and weaknesses of candidate models to make the most appropriate routing decision.

Optimal LLM routing involves achieving the highest possible response quality for a given cost target or minimizing costs for a specified quality target. This requires a robust router model that can efficiently process queries and make rapid decisions. Moreover, the router must be adaptive to the evolving model landscape, as new models with improved capabilities are continually introduced.

Proposed Solution: RouteLLM

Our solution, RouteLLM, involves developing a principled framework for query routing between LLMs. The core objective is to minimize costs while achieving a specific performance target, such as 90% of the stronger model’s performance, by routing simpler queries to the weaker model and reserving complex queries for the stronger model.

Router Training Framework

To train our routers, we use human preference data from platforms like Chatbot Arena, where users compare responses from different models and vote for the better one. This data provides valuable insights into the relative strengths and weaknesses of different models. However, due to the sparsity of direct comparisons between certain models, we employ data augmentation techniques to enrich our training dataset. This includes using golden-labeled datasets, such as MMLU, and synthetic preference labels generated by an LLM judge.

Evaluation Metrics

We evaluate our routers using several metrics to capture the trade-off between cost and quality:

Cost Efficiency (CPT): Measures the percentage of queries routed to the stronger model.
Response Quality (APGR): Assesses the average performance of the router compared to the performance gap between the stronger and weaker models.
Performance Gain Recovered (PGR): Quantifies the router’s performance relative to the gap between the strong and weak models.
Call-Performance Threshold (CPT): Determines the minimum percentage of calls to the strong model required to achieve a desired performance level.

Methodology

Preference Data

Our primary dataset consists of preference data from Chatbot Arena, which includes user queries, responses from different models, and comparison labels. To address label sparsity, we cluster models into tiers based on their Elo scores and derive preference labels for training. This clustering helps reduce the sparsity of comparisons between models and improves the robustness of our training data.

Data Augmentation

To further enhance our training dataset, we use two key augmentation methods:

Golden-labeled datasets: We integrate datasets with predefined labels, such as the MMLU benchmark, to provide additional training samples.
LLM-judge-labeled datasets: We generate synthetic preference labels using an LLM judge for a variety of user queries, creating a rich and diverse training dataset.

Routing Approaches

We explore several methods for learning the win prediction model:

Similarity-weighted (SW) ranking: Uses a Bradley-Terry model to predict the probability of a model winning based on the similarity of queries.
Matrix factorization: Leverages matrix factorization techniques to capture the low-rank structure of preference data and predict win probabilities.
BERT classifier: Employs a BERT-based architecture for text classification, predicting win probabilities using a logistic regression head.
Causal LLM classifier: Utilizes a causal LLM (e.g., Llama 3 8B) in an instruction-following paradigm to predict win probabilities based on next-token prediction.

Experiments and Results

Evaluation Benchmarks

We evaluate our routers on three widely-used academic benchmarks: MMLU, MT Bench, and GSM8K. These benchmarks cover a range of tasks, including multiple-choice questions, open-ended questions, and grade school math problems, providing a comprehensive evaluation of router performance.

Results

Our routers demonstrate strong performance across all benchmarks, significantly reducing costs while maintaining high response quality. For instance, on MT Bench, our best-performing router achieves a 75% cost reduction compared to the random baseline while maintaining a performance level close to GPT-4. Similarly, augmenting our training dataset with golden-labeled and LLM-judge-labeled data leads to substantial performance improvements on MMLU and GSM8K.

Cost Analysis

Our routers achieve significant cost savings by intelligently routing queries between models. For example, the average cost per million tokens using GPT-4 and Mixtral 8x7B is estimated to be $24.7 and $0.24, respectively. By optimizing the routing decisions, our approach can reduce costs by up to 3.66 times while maintaining a high level of response quality.

Conclusion

RouteLLM presents a robust framework for optimizing the deployment of LLMs by dynamically routing queries between models based on their complexity. By leveraging human preference data and data augmentation techniques, our routers achieve significant cost savings without compromising quality. This approach provides a scalable and adaptable solution for deploying LLMs in real-world applications, ensuring a balance between performance and cost.