Upstage

View Original

ICDAR 2023: AI Startup Makes a Splash, Sweeping Four Categories - [Starview Vol. 8]

2023/07/06 | Written By: Jieun Song, Sungmin Park

Upstage demonstrated our technical superiority in the field of AI OCR by winning four categories of the prestigious 'ICDAR 2023' competition, surpassing global tech giants like Amazon, NVIDIA, Alibaba, and Huawei.

The 'ICDAR Robust Reading Competition' evaluates the performance of companies in text detection and recognition from digital images and videos. Upstage's exceptional performance resulted in them securing the top position in four categories: HierText-1/2, VQAonBD, and IHTR. Meet our outstanding members behind this remarkable achievement.

[Experience the Power of Our Leading Document AI Now! →]

[HierText-1/2]
Participating and focusing on competition is great fun!


Q: It's a pleasure to meet you. Could you introduce yourselves?

Dahyun Kim: Hi, I'm Dahyun Kim, a multimodal AI researcher at Upstage.

Yoonsoo Kim: Hi, I'm Yoonsu Kim, a member of the AI Challenges team working on AI research and development.

'HierText' extracts everything from textual content in images to their hierarchical structure.

Q. Congratulations on your win again! Could you explain the HierText competition?

Yoonsoo Kim: HierText is a competition hosted by Google Research that extends beyond detecting words in images and aims to extract hierarchical structures. Extracting hierarchical structures involves detecting words and grouping them into lines, and then grouping lines into paragraphs. In Task 1, participants are evaluated based on their ability to detect hierarchical structures, while in Task 2, they are evaluated on their ability to accurately recognize detected words.


Daehyun Kim: To clarify further, in Task 1, participants were required to identify the positions of text hierarchically (word, sentence, paragraph) in a given image. In Task 2, participants also had to attempt to read words, essentially performing complete OCR. The main difference lies in this reading capability.

Cooperation with members was crucial in Task 2 as it leveraged the outputs of models used in Task 1. It was essential to establish a consensus regarding how to exchange results with the existing Task 1 participants. This introduced a difference from Upstage's existing OCR pipeline, consisting of D+R. To address this issue, it was essential to modify the preprocessing of the input data for the Task 2 model, tailoring it to the competition's data.

 

Q. Could you please tell us how you decided to enter HierText-1/2 among other categories, and what your preparations for the competition were like?

Yoonsoo Kim: I don't typically work with OCR, so I wanted to focus on one specific aspect of it rather than a combined task involving detection, recognition, and parsing. When I discovered that HierText Task 1 was primarily a detection task, but also required comprehending the hierarchical structure of words, lines, and paragraphs beyond simple detection, my curiosity was piqued. I dedicated about a month to the competition, and my team was very supportive in allowing me to concentrate on it.

Dahyun Kim, AI Research Engineer

Dahyun Kim: After watching my team members engage in HierText Task 1, I was motivated to participate in Task 2, especially since Upstage is dedicated to enhancing the full OCR pipeline. The postponement of the IHTR competition perfectly aligned with HierText Task 2, enabling me to efficiently manage my time and reduce my workload during participation. Preparing for the ICDAR competitions took approximately two months, and this would not have been possible without the great support from my team and company. They really got behind me and made room for my participation. During the prep time, I immersed myself in tailoring our OCR models and methods at Upstage to mesh well with the specific demands of the competition data.


Q. What do you think was the reason for achieving such a significant score gap against leading global competitors?

Yoonsoo Kim: Upstage managed to secure a remarkable lead over top companies like Naver Cloud, NVIDIA, AWS AI Labs, Alibaba DAMO OCR Team, AntGroup, HUAWEI, etc., by about 10% in Task 1. I believe the key to this success was a combination of existing Upstage OCR technology and know-how, enhanced by innovative approaches. Additionally, the team environment played a crucial role. We dedicated a whole month to improving our competition scores, which allowed us to experiment freely and foster a spirit of sharing and competition among team members, ultimately acting as a catalyst for our outstanding results.

Dahyun Kim: For me, 'One step more' really resonated throughout the competition. Both in the IHTR and HierText competitions, we pushed ourselves to improve performance until the very end. It was an incredibly important experience that I believe will help us make the best choices in an ever-changing environment in the future. 


Q. Could you share any lessons you've learned from participating in this competition?

Yoonsoo Kim: Like Kaggle, the competition had clear and straightforward goals which made it fun and easy to stay focused. It was also a meaningful experience as it reaffirmed that Upstage has the technical competitiveness to stand with the world's leading companies.

Dahyun Kim: The support from our company and team members made these achievements possible and it really highlighted how excellent the infrastructure at Upstage is. Technically, I got to apply various data preprocessing techniques that we don't use in our regular company tasks and it was great to directly experience their pros and cons.

Q. We'd love to hear about what you've got planned for the future and your dreams!

Yoonsoo Kim: Just as ChatGPT has shown its potential, I plan to explore areas like LLM training, or prompt programming with trained LLMs, such as AutoGPT. I want to create models or applications that are truly useful to people.

Dahyun Kim: Having utilized various technologies during the competition, I am motivated to venture further beyond current technological boundaries when developing future products.


[VQAonBD]
We can achieve anything when we work together with our colleagues!


Q. It's a pleasure to meet you! Could you please introduce yourself?

Suwon Shin: Hi, I'm Suwon Shin, working as a Natural Language Processing researcher at Upstage.


Q. Can you tell us about the VQAonBD competition you've participated in?

Suwon Shin: VQAonBD stands for Visual Question Answering on Business Documents, generally aims to answer a query described in natural language, taking cues from the document image as the only input. At this point, the table is mostly made up of numbers. The types of questions can be broadly divided into three categories. The first type involves identifying information about a row or column from the question and then extracting the value directly from that cell. The second type requires calculating ratios between two values. And the last one involves finding things like the maximum, minimum, average, median, or cumulative total of all the values in a certain row or column. For example, we might be given a financial statement and asked to find the highest value for the year 2017.

Example of a VQAonBD task


Q. What are the reasons for choosing the VQAonBD category at ICDAR?

Suwon Shin: Among the various tasks at ICDAR, this one is particularly relevant to our Document AI team's work, which involves extracting information for document automation. It's a domain we knew we'd have to tackle at some point. The preparation involved nearly all the members of the NLP-Engine team, thanks to the substantial support from our company. Even with the competition deadlines being pushed back, we received the consideration needed to stay focused throughout. We dedicated ourselves from mid-March to mid-May, and I'm happy to say that our hard work paid off with some great results.

 Q. How did you develop a methodology for the VQAonBD task?

Suwon Shin: The NLP-Engine team members all participated eagerly, resulting in several effective methodologies. We each implemented our ideas, tested their performance, and refined them through meetings. By repeating this process, we ultimately chose the best methodology to submit.

I can't discuss our best methodology without mentioning our team member, Minsoo Kang. Despite a three-week absence due to military training, Minsu contributed two very effective methodologies. The first involved teaching the model how to interpret tables, and the second utilized the incredibly popular ChatGPT to transform existing data into a format more comprehensible for our model. These two methods provided a solid foundation that all other team members used in developing their methodologies.

Building on these two approaches, we chose a methodology that simplified complex questions for easier learning by the model. We then combined the inferred answers to these simplified questions to arrive at a final result.

 

Q. Were you expecting the incredible achievement of Upstage winning in four categories at ICDAR?

Suwon Shin: When we first started the competition, I was secretly hopeful because all my teammates were incredibly talented, and I thought we had a real shot at winning. As the leaderboard went live midway through the contest, we could see our rankings in real-time and noticed that we were pulling ahead of the second-place team by a considerable margin. That gave me confidence that if we kept up the hard work, we could actually win this. And in the end, it was the expectations born from having great colleagues that led to our success.


Q. What was the most important aspect of the Upstage Way for you in this competition?

Suwon Shin: To me, the most essential part of the Upstage Way is being 'One team.' I believe that when we share the same goals and can rely on each other, no challenge is too great and we can create excellent products. I do make an effort to help others resolve difficult issues when they arise.

[IHTR]
Mastering Ten Indian Languages with Our OCR Expertise

Hyunsoo Ha, AI Research Engineer

Q. It's great to meet you, Hyunsoo!

Hyunsoo Ha: Hello! My name is Hyunsoo Ha, and I'm in charge of engineering and research at Upstage.

Q. Could you please introduce about IHTR?

Hyunsoo Ha: India is known for its rich linguistic diversity, with 10 languages (Bengali, Devanagari, Gujarati, Gurumukhi, Kannada, Malayalam, Odia, Tamil, Telugu, Urdu) officially recognized as standard scripts. Our task was to develop OCR technology that could accurately recognize handwritten text in these languages. While these languages share some similarities, they also have distinct features that make them unique. Additionally, Urdu, which is written from left to right, is quite different from the Korean script, adding an extra layer of difficulty to the task. As a company specializing in OCR, this was a challenging but exciting project for us at Upstage.

Q. Why did you take on such a difficult task?

Hyunsoo Ha: I believe that Upstage's OCR technology will eventually need to enter the global market. That's why we decided to tackle 'Indian languages,' which seemed to be some of the most challenging. Despite other tasks like developing solutions, the team was considerate enough to focus on the competition, which allowed us to devote ourselves to preparing for it.

Q. How were you able to build a high-performing model and achieve first place with languages you had never worked with before?

Hyunsoo Ha: I think the OCR-related know-how we accumulated over a year and a half was the biggest help. We had well-established synthetic data technology, which uses non-existent data combined for training to enhance the performance of our general models, and an automated pipeline for learning in Upstage's AI solution, Document AI. These allowed us to rapidly conduct a variety of semi-automated experiments, leading to good results in a relatively unfamiliar field. During the competition, we conducted as many as 2,400 different experiments. The most impactful factors were directly observing and considering differences in image quality and color between the Train and Test Datasets, applying various augmentation methods, and the effective use of a vast number of models to implement different model ensemble techniques.

Q: What’s the Upstage Way for you, and can you share some of your practical know-how?

Hyunsoo Ha: I believe that the best approach for the competition was to always push for "One Step More." In the case of IHTR, the battle for the top spots on the leaderboard was extremely fierce in the last week. In some language groups, even a 0.1% difference was significant enough to affect the final rankings. It was our determination and effort to keep experimenting, to take even one more step in the spirit of "One Step More," that I believe was crucial.

Q. If there's a lesson you’ve learned from this competition, we’d love to hear it.

Hyunsoo Ha: I've realized that 'With great colleagues cheering each other on, the impossible becomes possible!' and 'Even if you don’t speak a foreign language, with enough determination, you can create foreign language OCR!'

Q. What are your plans and aspirations for the future?

Hyunsoo Ha: Building on our experience with Indic languages, we've decided to expand the language capabilities of our Document AI model. It won't be long before our company's Document AI API will offer OCR in various languages! As for competitions, I’d love to win a global contest like Kaggle, which is even bigger than academic conferences. I would be grateful if you could keep an eye on Upstage's future endeavors.