2023/12/18 | Written By: Jieun Song (People eXperience), Sungmin Park (Content Manager)

Upstage Infusing Vitality into the Korean LLM Ecosystem, "Aiming for 'Open Ko-LLM' to be a Playground for More People to Explore and Enjoy"

The number of models registered on the 'Open Ko-LLM,' a Korean Large Language Model (LLM) evaluation leaderboard, has surpassed 600 in just over three months. This astounding news comes from the Open Ko-LLM leaderboard, which Upstage has been building and operating in conjunction with the National Information Society Agency (NIA) since September!

Moving away from the English-dominated traditional evaluation systems, this leaderboard, based solidly on 'Korean' data, has garnered attention from both academia and industry. It allows anyone to register their Korean LLM and compete with other models.

As the era for LLMs is flourishing, the 'Open Ko-LLM Leaderboard' has become a foundational and empowering force for the Korean LLM ecosystem. We introduce the key figures behind this achievement, Mr. Park Chan-Joon and Mr. Kim Hyun-Woo from the LLM team, as the final guests of Starview in 2023, having played a pivotal role in establishing the 'Open Ko-LLM Leaderboard.

Q: It's a pleasure to meet you. Could you introduce yourselves?

Park Chan-Joon: Hello, I am Park Chan-Joon, currently serving as the Technical Leader in the LLM team. I lead the Data-Centric LLM part, focusing on data and evaluation related to LLMs. I am leading various internal projects related to LLM data, including the 1T Club and the Open Ko-LLM Leaderboard.

Kim Hyun-Woo: Hello, I am Kim Hyun-Woo, a Research Engineer in the LLM team. I am involved in various projects related to LLM, including the development of MathGPT and the Open Ko-LLM Leaderboard.

Q: What was the motivation behind establishing the Open Ko-LLM Leaderboard?

Open Ko-LLM Leaderboard Logo, Homepage, and Examples of Common Sense Data Creation

Park Chan-Joon: We hoped to establish a Korean LLM evaluation ecosystem through the Open Ko-LLM Leaderboard, where research results could be transparently shared and hidden LLM talents discovered. Our goal was to expand the field of Korean LLMs.

As the era of LLMs has begun flourishing, I believe the most crucial keyword is 'ecosystem'. This becomes evident in OpenAI's release of the GPT series and App Builder, Hugging Face's democratization of NLP, and Naver's creation of a Generative AI ecosystem. We wanted to join this trend by building a Korean LLM data and evaluation ecosystem.

After partnering with the National Information Society Agency (NIA), we were able to launch the leaderboard within a month. We focused on reproducing tasks and platforms on par with OpenLLM Leaderboard, known for its credibility in English LLM benchmarks. To delve deeper into common sense issues, we collaborated with Professor Hee-Seok Lim's research team from Korea University and adopted KoCommonGen V2 as a leaderboard task.

To successfully operate the leaderboard, robust infrastructure is key. KT boldly decided to support our GPU-related needs, and recently, Hugging Face generously provided CPU upgrades. It's encouraging that we have established direct communication channels with Hugging Face, a global NLP company, and are staying connected, always looking for new initiatives.

Q: What are the strengths of the Open Ko-LLM Leaderboard compared to Hugging Face's existing system?

Park Chan-Joon: Fundamentally, Hugging Face adopts English-based benchmark data, while we adopt Korean benchmark data. A critical difference is that we do not publish the test set. English leaderboards have open test sets, using benchmark datasets made public since 2021. However, we have built all data from scratch for Open Ko-LLM and operate it entirely privately.

Opening it could significantly impact research and increase the benchmark's value, but for this leaderboard, we decided to operate as a closed set to eliminate test set contamination and ensure fair comparison.

Another strength is the involvement of credible institutions in the operation. Besides Upstage, organizations like NIA, KT, and Korea University are involved, which adds trustworthiness.

Kim Hyun-Woo: As Chan-Joon mentioned, the biggest advantage is the ability to evaluate the performance of Korean LLM models through Korean benchmarks. Additionally, we aim to help participants build their credentials by selecting monthly LLM winners and recognizing them on the leaderboard.

November Award Ceremony for Outstanding Developers - 'This Month's LLM' at the Open Ko-LLM Leaderboard (Source: Artificial Intelligence Newspaper)

Q: How has the Open Ko-LLM Leaderboard fared since its launch?

Park Chan-Joon: In just over three months since its establishment, the leaderboard has seen the participation of over 600 models. Initially, we expected about 200 models by the end of the year, so the actual turnout has exceeded our expectations, and we are grateful for everyone's participation. The competition has been particularly impressive, with participants ranging from individual researchers to various organizations like KT, Lotte Information & Communications, Mind AI, 42Maru, the Electronics and Telecommunications Research Institute (ETRI), KAIST, and Korea University. A notable moment was when KT's 'Faith 7B' model took the top spot in the under 7B model category and became accessible to everyone.

Comparatively, the original Hugging Face leaderboard currently operates over 2200 models. When comparing numbers, we reached a quarter of that level in just two months, which is quite encouraging. Lastly, the Open Ko-LLM has established a direct communication channel with Hugging Face, laying the groundwork for research collaboration and receiving actual CPU infrastructure support.

The number of models contributed to the Open Ko-LLM leaderboard.

Kim Hyun-Woo: There’s a high level of participation, starting from companies to individuals. In the first week, there were fewer than 50 submissions, but this number has been increasing steadily, with the biggest week so far seeing over 100 submissions.

The achievements of individual participants have been impressive, and the leaderboard serving as a platform for public relations, with participants sharing their accomplishments on social media, has been memorable. Personally, I'm happy to see the leaderboard becoming a great opportunity for many, which is exactly what we hoped for.

Q: What are the evaluation criteria for the Open Ko-LLM Leaderboard?

Park Chan-Joon: The Open Ko-LLM Leaderboard adopts these five types of evaluation methods:

ARC (AI2 Reasoning Challenge):

This test assesses scientific thinking and understanding. It measures the reasoning ability needed to solve scientific problems.
It is used to evaluate complex reasoning, problem-solving capabilities, and understanding of scientific knowledge.

HellaSwag:

Evaluates situational understanding and prediction abilities.
Tests the ability to predict the most likely next scenario in a given situation.
Serves as an indicator of the model's understanding of and reasoning about situations.

MMLU (Massive Multitask Language Understanding):

Assesses language understanding across various topics and domains.
It's a comprehensive test that shows how well the model functions across different domains.

Truthful QA:

Evaluates the model's truthfulness and factual accuracy.
The ability to provide truthful answers is an important criterion.

KoCommonGEN V2:

Assesses whether LLMs can create outputs in Korean that are consistent with common sense given certain conditions.

Q: What are the future plans for the Open Ko-LLM Leaderboard?

Park Chan-Joon:

Firstly, we aim to expand our tasks. Currently, we operate five tasks, but more precise evaluations related to ethical aspects and factual grounding must be conducted. We plan to further expand our tasks in collaboration with various companies and academic institutions.

Secondly, expansion of our evaluation targets. We plan to extend beyond Korean to other languages and considering the growing importance of code data, we are also planning to operate evaluations for code language models.

Thirdly, we want to explore new methods of evaluation. Instead of static evaluation methods, we are considering dynamic ways to assess models. We believe that the current leaderboard models have various limitations when considering real-world scenarios:

Outdated Data: Datasets like SQUAD, KLUE are becoming outdated over time. Data continuously evolves and develops, like DNA. However, existing leaderboards remain stuck in a specific era, making it difficult to properly reflect the current state of affairs, as hundreds of data points are created every day in the real world.
Static Data: The data is always static. This is also true for Data-Centric AI. Even if there can be variations, they happen within a confined range. Although there are leaderboards that allow the use of external data, they often face criticism for being unfair. In the real world, data is not static. It continually accumulates, and models evolve through continual learning.
Not Reflecting the Real World: When companies conduct B2B or B2C services, data continuously accumulates through users or industries, and edge cases or outliers constantly emerge. Responding to these effectively is a true corporate competitive edge, but the current leaderboard system lacks a way to measure this capability. Real-world data is constantly created, changing, and evolving.
Is it Truly Meaningful Competition?: Many models are optimized for the test set, which can lead to another form of overfitting within that test set. The current leaderboard system operates in a leaderboard-centric manner, not a real-world-centric manner.

To alleviate these issues, we are considering operating a new paradigm of leaderboard – a self-replicating leaderboard dataset that continuously adds data daily.

Additionally, it would be good to write a paper encapsulating the know-how of the leaderboard. Finally, rather than a new plan, I think continuity is essential. We will continue to internally deliberate on this idea.

Q: What are the Upstage team’s plans for energizing the domestic LLM ecosystem?

Park Chan-Joon: In the Data-Centric LLM organization I lead, our goal is to go beyond Model Centric & Data Centric approaches to achieve Value Driven LLM. We focus on creating value and an ecosystem with LLMs. We're managing the Up 1T Club to build a data-sharing ecosystem, operating the Open Ko-LLM Leaderboard for Korean LLM evaluation, and running an open-source project called Dataverse for establishing a Korean data preprocessing ecosystem. We're considering how LLMs, as a tool, can be utilized to extract value.

Kim Hyun-Woo: We continually strive to develop superior Korean language models within our team, planning to encapsulate this expertise into papers. Besides the Ko-LLM Leaderboard, we also plan to operate leaderboards for various other tasks.

Q: What are your plans for the upcoming year 2024?

Park Chan-Joon: Plans set now for 2024 might need adjustment due to rapid changes. Regardless, our goal until the end of the year is to build a solid foundation and robustness in our work, enabling quick adaptation to change. As the leader in the data segment, I aim to complete the construction of high-quality Korean data by the end of the year and prepare for the next steps, like multi-language expansion and multimodal projects.

In 2024, our goal is not only to generate the most revenue domestically with LLMs but also to pursue global expansion. We aim to develop an LLM that’s recognized and trusted worldwide. Ultimately, beyond global expansion and maximizing revenue, our goal is to develop an LLM that satisfies customers.

Personally, I hope to contribute to the Korean LLM ecosystem, like with Open Ko-LLM. My personal goal is to create a playground where many people can engage and enjoy themselves.

Kim Hyun-Woo: In the short term, I'm working on the MathGPT development project in collaboration with Qanda, aiming to achieve state-of-the-art (SOTA) results on the Math Leaderboard and secure follow-up contracts. Currently, we are focused on math, but we plan to expand to other areas and combine them to develop an LLM proficient in various fields.

Last year around this time in Starview, I stated my goal was to achieve good results in recommendation competitions and write papers, which I accomplished. Next year, I aim to conduct significant research on the LLM topics we're currently working on and write papers on them.

Q: What’s the Upstage Way for you, and can you share some of your practical know-how?

Park Chan-Joon: From a sharing perspective, we're establishing a culture of 'sorting things out in one go'. We work with the mindset of 'when you think that’s the limit, take one more step'. Leading the way in creating ecosystems at Upstage is both exciting and rewarding. It's challenging, but interpreting situations positively and enjoying the work is crucial. Even when facing things that seem impossible, it's important to try before judging.

Kim Hyun-Woo: Sharing and being One team are crucial. The best part of Upstage's culture is the transparency in what others are doing, accessible through Notion or Upsquare. Everyone readily helps each other without hesitation, which is very inspiring. We try to share our work and experimental results daily or at the end of the week, and I strive to be as helpful as I have been helped.

Q: A message for your fellow Upstage stars!

Park Chan-Joon: Everyone is working so diligently, but I hope you all take care of your health while working. It would be great if you could maintain a positive mindset and find joy in your work. Even in tough situations, remember that every tunnel ends eventually. Don't give up; let's persevere and move forward together!

Kim Hyun-Woo: With the weather getting chillier lately, I've heard that many of you are catching colds. Please be careful and look after your health. I know everyone is working hard to achieve good results in their projects, and I sincerely hope that all your efforts are met with success!

Strengthening Korean AI Competitiveness with ‘Open Ko-LLM Leaderboard’ - [Starview Vol. 10]