Upstage

View Original

Solar LLM with Predibase: The Best LLM for Fine-Tuning that beats GPT-4

2024/06/18 | This is a joint blog post written by:
- Kasey Roh, Lucy Park, Junyeop Lee at Upstage
- Arnav Garg, Will Van Eaton at Predibase

For organizations with domain-specific data and use cases, fine-tuning is one of the most performant and cost-effective ways to tailor LLMs for production applications. Through fine-tuning small LLMs for specific use cases, teams can achieve performance that surpasses that of massive general models (e.g. GPT-4) for use cases such as:

However, not all LLMs are equally efficient for fine-tuning. Some models are more suitable for fine-tuning than others because each model was developed with different design philosophies (e.g., a model designed to be good with broad, general use cases vs. a model designed to be customized for specific applications).

Predibase is the fastest and most efficient way to fine-tune and serve task-specific LLMs and has deep experience fine-tuning an extensive collection of open-source LLMs and small proprietary LLMs that are ideal for fine-tuning. Upstage and Predibase worked together to discover a faster and more efficient way to fine-tune and serve task-specific LLMs.

After running nearly 500 fine-tuning experiments, we can quantifiably demonstrate that Upstage’s Solar LLM is the most competent model for fine-tuning, and are excited to announce that Solar LLM is now available for teams to fine-tune and serve on Predibase.

Discover in-depth reasons why the Solar LLM excels in fine-tuning and explore the platform, designed to enable developers to fine-tune the Solar model.

[→ Get Started for Free]




Introducing Upstage’s Solar LLMs

  • Why did Upstage build Solar LLM?

    Upstage is a leading enterprise AI company that has a proven track record of providing powerful custom document processing / LLM solutions for global enterprises across various industries such as financial services, healthcare, supply chain, and legal.

    With its deep roots in the enterprise AI space, Upstage developed Solar LLM with the belief that for mainstream enterprise adoption, enterprises need a powerful, purpose-trainable LLM solution that can be easily trained with their private data and securely hosted on their premises.

    As a base model, it is intentionally sized small and light to run on a single GPU, offering good performance (i.e., accuracy and speed) and cost competitiveness, with the potential for even better performance through fine-tuning.

With fine-tuning, Upstage has seen further improved performance in several tasks including translation, math solving, and categorization, which resulted in exceeding the performance of GPT4.

  • How is Solar LLM good for fine-tuning?

    With further customization in mind, Solar LLM has been pre-trained to improve performance on specific downstream tasks through fine-tuning. Specifically, Upstage has invested significant effort in optimizing the balance of the dataset used in pre-training and instruction, and has also evenly regulated the domain distribution to accommodate various fine-tuning scenarios for enterprises.

    This approach differs from other general-purpose LLMs, where fine-tuning may not always result in significant performance improvements, as these models are designed for general use cases.


Predibase's Fine-Tuning and Inference Technology

Want to find out which platform offers a fast, reliable, and cost-effective service for fine-tuning Solar LLM?

Predibase is the leading developer platform for fine-tuning and serving LLMs. Built from the ground up to be fast, reliable, and cost-effective, Predibase has built a best-in-class fine-tuning experience. Predibase manages the compute resources required for fine-tuning so teams don’t need to worry about out of memory (OOM) errors and can trust that the right serverless GPU hardware will be used for the job.

Predibase also offers inference with low latency (0.20 seconds to first token) and lightning fast throughput (200 tokens per second). Plus, teams can serve hundreds of fine-tuned LLMs from a single GPU–whether it’s a high-end A100 or H100 or a commodity A10G–with LoRA eXchange (LoRAX), the open-source serving framework developed by Predibase, making Predibase one of the most cost-effective platforms for serving fine-tuned LLMs.


Evaluation Results

To evaluate Solar-Mini-Chat, we decided to compare its fine-tuned task-specific performance against 13 popular open-source LLMs of a similar weight class and 2 closed source base models: GPT-3.5 Turbo and GPT-4.

  • High-Level Experimentation Methodology

Here's a brief overview of our experimentation setup:

  1. Dataset Selection: We meticulously selected 31 diverse datasets spanning 5 categories - natural language understanding, coding, knowledge, reasoning and math.

  2. Dataset Preparation: Each of the 31 datasets was split into a training set and a held-out evaluation set to ensure robust evaluation.

  3. Model Training: We chose a base model and trained it on each of these datasets, utilizing their respective instruct/chat templates. This process was repeated for every base model included in this experiment.

  4. Batch Evaluation: Post-training, we conducted batch evaluations using the fine-tuned LoRA adapters on the held-out evaluation sets. Depending on the task type, we employed various metrics such as accuracy, ROUGE, HumanEval, among others, to gauge performance effectively.

  5. Results Comparison: Finally, we compiled the results and performed a comparative analysis of the models to identify the top performers.


  • Results

    After tabulating all of these results, we found that Solar-Mini-Chat is the strongest performing model at the ~11B parameter model weight class, outperforming most other open-source models by a significant amount.

    Here's a deeper look at two slices of the metrics that serve as supporting evidence for the observation above.


  • Slice 1: Overall Performance of Solar Fine-Tunes

    This metric quantifies how often a specific model attains the highest score compared to all other models for a given task. This frequency is summed across all 31 tasks to assess the overall effectiveness of each model. In other words, it measures the number of times model X outperforms its peers across all tasks.


    We found that Solar fine-tunes led with the highest score on 16 out of 31 tasks (approximately 51.6%). Following closely, Phi-3, Llama-3-8B, Zephyr-7b, and GPT-4 (base) tied for second place, each achieving a score of 3 out of 31 tasks (approximately 9.7%).

  • Slice 2: Head to Head Performance of Fine-Tuned Solar

    Slice 2 provides insights into the competitive performance of Solar fine-tunes by quantifying the frequency with which they achieve superior results compared to fine-tunes from other base models. Each percentage value indicates the proportion of tasks where Solar fine-tunes prevail over the competing model.

For instance, a win rate of 83.87% against Phi-2 signifies that Solar fine-tunes outperformed Phi-2 on approximately 83.87% (26/31) of the tasks. Interestingly, Zephyr-7b-beta gives Solar-Mini-Chat the closest competition, while Solar-Mini-Chat fine-tunes almost always beat base gpt3.5 turbo.

For more thorough insights on experimentation setup and results, you can check out this paper and Predibase’s fine-tuning index.

Build your tailored LLM with Solar, the ideal choice for fine-tuning

Are you impressed by the exceptional performance of Solar LLM and Predibase in fine-tuning and creating a customized LLM? Experience the fast, reliable, and cost-effective results of fine-tuning Solar models on Predibase! Our Solar LLM has already been leveraged into various industries and services.

With LoRAX, Predibase’s framework to dynamically serve 100s of fine-tuned LoRA models using a single GPU, one can serve all 31 of the fine-tuned Solar-Mini-Chat LoRA adapters for the cost of a single dedicated LLM deployment.

It can be deployed on hardware as small as an A10G with 24GB of VRAM for cost savings. However, it would need to be scaled up to an A100 to handle very high request volume (measured in queries per second) and to get 2x faster throughput.

To get a feel for response latency using LoRAX, check out LoRA Land, a demonstration of fine-tuned models and LoRAX. Also, you can experience its speculative decoding via Medusa at training time, which leads to ~ 3x faster inference throughput for fine-tuned models with no degradation in fine-tuned model performance.

To learn more about Solar by Upstage and Predibase, check out the following resources.
Our upcoming webinar on July 11th will offer a guide to exploring how the fine-tuned Solar LLM can transform your AI initiatives and accelerate your path to production.