Upstage

View Original

Open Source LLM and the Ecosystem of Korean LLM

2023/10/26 | Written by: Sungmin Park

The landscape of the artificial intelligence (AI) market has recently experienced a significant shift with the emergence of various open-source offerings. Since Meta made LLaMa accessible to everyone as an open-source model, a trend has been observed among LLM players, particularly those who are not part of the big tech giants like OpenAI and Google, to release their models as open source. As a result, smaller entities and even individuals now have the opportunity to create and fine-tune AI models for their needs. In this insight blog, we explore the implications of this shift, particularly the emergence of 'open-source LLMs' as the new wave in the era of generative AI, and its potential impact on the LLMs ecosystem.


What is Open-Source?

Open-source software (OSS) has been a driving force behind many of the technologies considered essential to the Fourth Industrial Revolution, including AI, big data, cloud computing, and the Internet of Things (IoT). To understand the emergence of open-source software, we must first look back at the early days of computer software. Initially, when computers were first developed, software was primarily created by academics and researchers who freely shared their code. However, as the commercial software market grew, many companies began to keep their code proprietary, leading to a backlash. This prompted Richard Stallman's "Free Software Movement" in the late 1980s, aimed at restoring the original production and distribution model of software, which is sharing information. Subsequently, relevant foundations and associations were established, leading to the introduction of the term "open-source."


💡 Open-Source

: A term used to describe software whose source code is made available for anyone to study, modify, and distribute, subject to certain restrictions dictated by the open-source license applicable to each.

Open-source software has gained significant momentum in the LLM market, particularly due to cost considerations as the demand for developing language models surged. With open-source models, developers can rapidly build new models by fine-tuning them without large-scale data training or developing their own systems.


💡 Advantages of Open-Source

  • Faster and flexible development environment: Allows for a collaborative environment where multiple individuals can contribute to a project, sharing ideas and solving problems.

  • Scalability: Allows users to customize projects to their needs by modifying or extending the code.

  • Cost-effectiveness: Developers can leverage open-source software without having to develop their own systems, saving time and money.



The Emergence of Open-Source LLMs

Open-source LLMs have rapidly become a key term in the industry, with the emergence of “sLLMs” (small Large Language Models) taking the spotlight since February of this year. Meta's decision to allow academia to access LLaMa paved the way for numerous sLLMs. These models typically have between 6 billion (6B) and 10 billion (10B) parameters, significantly smaller than traditional LLMs, but possess comparable performance, making them a cost-effective solution. For instance, OpenAI's "GPT-3" has 175 billion parameters, Google's "LaMDA" has 137 billion, and "PaLM" has 540 billion.


Mark Zuckerberg, CEO of Meta Platforms, emphasized the importance of open-source when he released LLaMa 2 as open-source in July, stating, "We believe deeply in the power of the open source community and that state-of-the-art AI technology is safer and better aligned when it's open and accessible to everyone.” Companies like Meta are fostering an environment where a wide range of organizations and developers can compete and innovate freely, ultimately driving growth within the industry.



Open-Source LLM Models

Let's explore the major open-source large language models (LLMs) that are widely used today.


1. LLaMA

LLaMA 2-Chat optimized for two-way conversation through reinforcement learning through human feedback (RLHF) and reward modeling (Source: Llama 2: Open Foundation and Fine-Tuned Chat Models)

One notable example is LLaMA by Meta, which has been pivotal in popularizing open-source language models. LLaMA 2, a version that allows for commercial use, was released on July 18th, 2023. It leverages reinforcement learning with human feedback (RLHF) and reward modeling to generate more useful and safer outputs, such as text generation, summarization, and question-answering. LLaMA 2 consists of three different sizes: 7B, 13B, and 70B, each denoted by the number of parameters used in the model. While the time required to complete generation may vary depending on the size of the model, it has demonstrated improvements in accuracy and a focus on preventing the generation of harmful text compared to its predecessor. Additionally, it has been expanded to allow for finetuning on multiple platforms such as Azure and Windows, leading to its broad application in various projects.

2. MPT-7B

Source: MosaicML Blog

MPT-7B (Mosaic Pretrained Transformers) is another open-source language model released by MosaicML, which is a transformer trained with 1 trillion tokens. It is available for commercial use and, in addition to the base model, offers three derivative models (MPT-7B-Instruct, MPT-7B-Chat, and MPT-7B-StoryWriter-65k+) that can be built upon. It is known that the MPT-7B model has equivalent quality to Meta's LLaMA-7B, which has 7 billion parameters.

3. Alpaca

Source: Stanford University

Alpaca is another example of an open-source language model, released for academic research purposes by Stanford University. The students at Stanford noted the potential issues with existing models, such as the potential generation of misinformation or offensive text, despite the advancements made by models like ChatGPT, Claude, and Bing Chat. To address these concerns and promote technological progress, they believed that academic involvement was crucial. Thus, they presented Alpaca as a model that could be used for further research. Alpaca is based on Meta's LLaMA-7B and has been finetuned using Instruction-following data, designed to ensure that the language model responds appropriately to user commands.

4. Vicuna

(Source: LMSYS.org)

Vicuna, another open-source language model, was developed by LMSYS Org and also based on LLaMA. The model was fine-tuned using a training set consisting of 70,000 user-shared conversations collected from ShareGPT.com. According to the Vicuna team, a preliminary evaluation using GPT-4 as the judge demonstrated that Vicuna-13B achieved over 90% of the quality of ChatGPT and Google Bard. Furthermore, the online demo and code provided by the team are available for anyone to use, provided that the usage is non-commercial.

5. Falcon

(Source: Technology Innovation Institute)

The Technology Innovation Institute in the United Arab Emirates has also released an open-source language model called Falcon. Falcon 40B is one of the primary models available for both research and commercial use. Among its models, Falcon 180B stands out, utilizing 180 billion parameters and trained on a massive dataset of 3.5 trillion tokens, resulting in exceptional performance.

The Impact of Open Source on the LLM Ecosystem

As we saw earlier, open-source has been used in various sectors due to its benefits, such as increased accessibility and transparency in AI technology. Although there are disadvantages, such as concerns about misuse, the positive influence on the development of the LLM ecosystem and the capacity to enable organizations with less capital to efficiently study and create new models or services are what make open-source AI sustainable.

With this trend, the largest open-source platform for natural language processing, "Hugging Face," has also gained attention. Hugging Face operates the "Open LLM Leaderboard," which allows various businesses and research institutions worldwide to evaluate and compete the performance of their developed generative AI models. This leaderboard features rankings of 500 open-source generative AI models based on evaluations in four metrics: ARC, HellaSwag, MMLU, and TruthfulQA. The leaderboard is open and updated regularly to reflect the evaluation of newly submitted models.

Source: Hugging Face

Especially among domestic companies, AI startup Upstage, which developed a generation AI model, took the lead in August 2023, surpassing the performance of GPT-3.5, which serves as the foundation for ChatGPT, causing a stir. Upstage unveiled a 30B (30 billion) parameter model through Hugging Face in July 2023, achieving an average score of 67, beating Meta's LLaMA 2 70B model announcement on the same day, and achieving the first domestic LLM (language learning model) ranking first. Additionally, it has released a model fine-tuned based on LLaMA 2 70B (70 billion) parameters, recording a leaderboard evaluation of 72.3 points and aiming to secure the global No. 1 position.

According to the Hugging Face leaderboard, Upstage is the first to surpass the score of GPT-3.5, the benchmark for generation AI models, demonstrating its global competitiveness. Upstage's LLM model 'SOLAR' (Solar) was also selected as the main model for the ‘Poe’ innovative AI platform in September, in recognition of its performance comparable to that of ChatGPT, Google PaLM, Meta Llama, and Anthropic Claude. This serves as a prime example that even startups with limited capital and manpower can develop models at the global top level by utilizing open-source technology.

Upstage's Solar LLM has surpassed ChatGPT to take first place in the Hugging Face Open LLM leaderboard rankings as of August 2023.

Strengthening Korean AI Competitiveness with ‘Open Ko-LLM Leaderboard’

Open-source AI ecosystem movement is expanding in South Korea. Upstage, which was ranked first in the Hugging Face Open LLM Leaderboard, has recently formed a partnership with the National Information Society Agency (NIA) to launch the Open Ko-LLM Leaderboard. This platform evaluates and compares the performance of Korean LLMs in a way that is specific to the Korean language culture and characteristics. It also adds a metric that aims to evaluate the model's ability to generate commonsense knowledge, making it more capable of preventing hallucination. This approach provides a more comprehensive, diverse, and accurate evaluation of Korean LLMs. By using the Open Ko-LLM Leaderboard, users in South Korea can more effectively compare and evaluate Korean LLMs, leading to the creation of better and more reliable AI applications.

Open Ko-LLM Leaderboard has experienced rapid growth, with over 100 registered models within just two weeks of its launch. This platform is quickly becoming the standard for evaluating Korean specialized LLM performance, as prominent Korean open-source models such as ‘Ko-Alpaca’, Korea University’s ‘KULLM’, and ‘Polyglot-Ko’ have joined the platform. This growth, coupled with the ‘1T Club’ program, which incentivizes partners to contribute over 100 million words of Korean data with revenue-sharing models, positions the Open Ko-LLM Leaderboard as a focal point for the open-source LLM ecosystem in South Korea.

As the generative AI market continues to evolve at a rapid pace, the impact of open-source initiatives remains an exciting and influential force within the industry, not just in global markets but also within South Korea. We look forward to the future innovations that the open-source ecosystem will bring as we collaborate to expand the llm ecosystem together.