Until the birth of OCR that recognizes text (Upstage in-house OCR image data collection challenge)
2023/02/21 | Written by: Sungmin Park
Upstage recently signed a contract with Hanwha Life Insurance to endue Upstage’s ‘Document AI,’ an optical character recognition (OCR) solution. Introduced for the first time, Upstage’s no-code/low-code AI solution took the lead in AI innovation, especially in the finance sector due to its effective insurance claim document processing, including medical expense receipts. The Upstage stars’ team commitment led to the creation of the highest performance solution ‘Document AI,’ holding in-house image data collection challenge events for OCR model training during the process.
Thanks to the efforts of our stars, Upstage’s Document AI has achieved an unprecedented recognition rate of over 95% accuracy with basic model performance alone, enabling document automation without the need for human post-processing. We take a look back on last year’s in-house image data collection challenge, the cornerstone of our creation of Document AI, through an interview with Upstage’s data manager, Hyeon Joo.
[ See the highest performance AI OCR, Upstage’s Document AI → ]
In-House Image Data Collection Event – What is it, and why was it created?
Upstage currently services ‘Document AI,’ an OCR (Optical Character Recognition) model specialized for Korean and English. OCR is a technology used in various industries for its digital innovation in detecting and recognizing text in captured images.
The process of creating and teaching the OCR model requires large amounts of data. In addition to the public data collected, the various supplements necessary brought to existence the in-house data collection event. Scores were provided according to characteristics of images, and prizes for the top three places as well as smaller prizes for 2 randomly selected individuals encouraged the participation of more people.
In particular, the second in-house image data collection event in March collected many special cases such as vertical text and handwriting, as well as text from scenes of daily life (horizontal text, signboard text, book text, etc.) with the intention of further developing the OCR model.
What images were collected?
Any photos containing Korean and Roman alphabet-based characters can be submitted. Especially for the healthy training of this model, various data such as font size, shape, and angle in images are considered. These characteristics form the foundation for point distribution and evaluation. Considering the vulnerabilities of the existing Upstage Model, extra points were presented for vertical text, handwriting, embossing, engraving, letters composed of dot/line combinations like in digital clocks, and instances where letters pass boundaries such as underlying or highlighting.
In addition, we opened the demo site of our OCR to the public in order to discover images the current model has difficulty recognizing. By uploading images directly to the demo site, stars can check the prediction results of the model and participate with fun.
The importance of data for AI model training has been established. But how much data is required to implement a high-fidelity model?
Depending on the desired accuracy, the amount of data needed may vary. But in the case of general scene-text data, about 50,000 pieces of data are required to train the model before it can be released to the public. Of course, the more the learning data the better, which is why Upstage aimed to collect as much image data as possible through in-house events.
How much data was gathered through the event?
Thanks to the participation of many stars as a team, we were able to collect an additional data total of 7,570 images. The Top 2 highest scorers of the event were Van, with 4.326 points, and YuJeong, with 3.373 points. Van won first place through his passionate execution of various strategies, including securing large amounts of vertical text images by visiting bookstores and focusing on items that scored extra points. Through pictures of book titles and more, he earned a high bonus score in the vertical text category.
In addition, a leaderboard with the scores of individuals’ submitted images updated every 30 minutes, leading to fierce competition between the top ranks up until the deadline. An interesting genius strategy was the submission of a high point value image just before the deadline, thus concealing the final result from competitors.
What conclusions were reached from the in-house image data collection event?
Since the performance of a model varies depending on the test set and measurement method, it is difficult to compare, for example, whether performance improves in one domain or deteriorates in another. However, when the data collected through this in-house event was utilized in the OCR model, the performance was shown to have significantly improved across all domains.
In particular, one of the main accomplishments gained through the event was the ability to quantitatively check issues in the model by collecting extensive image data for where performance worsened, such as in handwriting or text in unusual styles.
In the past, other than tests set to identify such model issues, there was not enough training data for the model. In order to gather enough data, the performance of the model was approached from an empirical point of view—namely the in-house event, which collected enough data to allow quantitative measurements of test set configurations as well as model issues for special cases.
In this event, texts from everyday scenery were collected to improve the performance of our general-purpose OCR model. The shape, size, and characteristics of characters differed compared to standardized document text, providing the basic skills necessary to enhance various tasks in the future.
The OCR model recently supplied to Hanwha Life Insurance is specialized for documents, differing from the general-purpose OCR model. However, in the early days of the Upstage Document AI development, rather than building a model focusing on just one aspect, Upstage created a model with all essentials in mind. Setting initial hypotheses and goals served as the foundation for this model.
The data team is a strong pillar of Upstage OCR! We are curious about its future plans and aspirations.
Our data team's goal this year in Upstage is to punctually supply the necessary data for engine development.
Since we collaborate often with the Document AI Engine team, we are working together to create superior data that can solve various challenges. For example, we focus on improving the recognition performance of the document-specialized model in texts such as handwriting, check boxes, and stamps. In order to provide such diverse data in a timely manner, we consider aspects necessary to automate and improve the efficiency of the data construction process.
Personally, I focus on designing data suitable for each task, such as forms for raw data and annotation methods optimized for model training. I hope to achieve this year’s goals alongside my team members so that Upstage's Document AI can shine even more!