Upstage

View Original

Breaking Barriers: Revolutionize Your Work with Our Next-Level Embedding Model

16/May/2024 | Written By: YoungHoon Jeon, Chloe Hyeonju Lee, Seungwon Jeong, Junhyeon Park

Introducing Solar Embedding-1-large!

Superior performance over OpenAI(text-embedding-3-large)? That's just the beginning. We thrive on tackling challenging problems. Innovate your search system with our brand new and fascinating Embedding model and experience the transformative difference firsthand.

How Good Is It?

We've benchmarked its performance across English, Korean, and Japanese languages, boasting superior results over OpenAI's embedding model(text-embedding-3-large) across all three. Experience the exceptional performance of Solar Embedding-1-large at an affordable price point.

What Exactly is Embedding?

Discovering RAG

Understanding embeddings requires grasping RAG in advance. A detailed explanation of RAG was provided in our previous article. For more insights, check it out here.

In short, RAG addresses the persistent hallucination issue of LLMs by significantly enhancing evidence-based responses within documents. The illustration below elucidates this further.

A depiction of a RAG & Groundedness Check pipeline

Exploring Embedding Model

So, what role does embedding play in RAG? It transforms your data into a numerical format understandable by computers. Essentially, it converts human-readable information into a format comprehensible by computers, facilitating the delivery of your data to LLMs.

In the past, word embeddings were common, but they failed to convey contextual information, crucial for understanding natural language. Text embedding, like our Solar Embedding-1-Large, excels at understanding multi-sentence and long text, incorporating context information.

Deep Dive into Our Embedding!

Let's delve deeper into our embedding model. Solar Embeddings API features dual models, solar-embedding-1-large-query for user queries and solar-embedding-1-large-passage for document embedding, within a unified vector space, designed to enhance text processing tasks with a focus on performance.

For developers building search engines or retrieval systems, solar-embedding-1-large-passage is ideal for initially embedding the searchable content. Upon user query submission, leveraging solar-embedding-1-large-query facilitates the efficient and accurate matching of queries with the embedded content, thereby optimizing the information retrieval process.

For a detailed list of available models and parameters, refer to the Upstage Developers Documentation here.

Curious About Specific Performance?

We evaluated Embedding-1-Large using the MTEB Retrieval section and MIRACL. MTEB is a comprehensive dataset covering various text embedding tasks, commonly used to measure English performance. MIRACL serves as a representative of multilingual retrieval benchmark. Our performance evaluation, leveraging both benchmarks, yielded remarkable results.

Not only do we outperform popular embedding models, but we excel particularly in handling challenging tasks.

Solar embedding accuracy comparison graph (provided by corp.G)

The Top 1 accuracy refers to the probability of selecting the correct answer from a single choice. The top 4 signifies the probability of including the correct answer within four candidates. As depicted, our superiority becomes more pronounced as the tasks become more challenging.

Let’s Sum Up!

LLM performance is rapidly advancing, showcasing remarkable results across various NLP tasks. To effectively harness LLM's potential, excellent embedding models are imperative. By mapping the complex structure and meaning of text data into a numerical vector space, solar embedding-1-large is a globally competitive solution adept at capturing the subtle nuances of language. It will be your optimal choice for composing RAG, guaranteeing exceptional performance and efficiency.

Our embedding is primed and ready. Now, it's your turn to leverage this model to build the next generation of AI services. We're excited to see what innovations you'll bring to the table. Shock the world with your revolutionary products!

[Attention] Colleagues currently using our existing embedding model, please migrate to our new model, which has shown significant improvements in English (MTEB) by 4.91 points and Korean (Ko-miracl) by 7.84 points. The old embedding model will be deprecated on June 15th.