Understanding Document Structure with OCR - Document AI Technology for LLM
2023/12/05 | Written By: Lucy Park (Upstage CSO)
The Necessity of Digital Assetization for LLM Development and Utilization
Since last year, there has been a particularly hot interest in LLMs (large language models). As widely known through commercial models like ChatGPT and open-source models like LLaMA, LLMs have shown potential beyond academic significance for commercial applications. Especially, as the name suggests, they excel in processing language, i.e., textual data.
Many organizations are attempting to leverage LLMs in a variety of applications. However, the challenge lies in the complexity of most documents worldwide, which often contain more than plain text, including visual elements like tables, graphs, and paragraph dependencies. These documents, termed VRDs (visually rich documents) in academic circles, present a unique problem: merely extracting text can lead to the significant loss of crucial information, complicating the achievement of the intended outcomes. In light of this, the importance of thorough digital assetization becomes clear for the successful development and application of LLMs.
The Digital Assetization Process for LLM Development and Utilization
To develop or utilize LLMs using VRDs, the following procedures are followed, with steps 1-2 being part of the digital assetization process:
Document Structure Analysis (Layout analysis): This is the most critical step, where important elements like tables and images in the document are recognized for their location, structure, and dependencies. It is crucial to clearly define (1) the elements you wish to include and (2) the elements you want to exclude. Many organizations that wish to use structure analysis for LLMs prefer to include tables but exclude ancillary document information like headers and footers.
Markdownification: After recognizing the elements, the next step is to arrange them in an order for LLMs and transform the information into a format that machines can consume easily.
Vectorization: Once transformed into markdown, the information is chunked (divided into meaningful units) and stored in the database in the desired format. The markdown text can be stored as-is or in vector form.
Query Embedding & LLM Inference: Finally, the query received from the user is embedded, and associated items in the database are linked to return the final results.
Digital Assetization Methods for LLMs
Many organizations apply OCR software for digital assetization aimed at LLMs. Additionally, for original documents that are digital-born, open-source software like pdf-to-text or PyPDF2 for PDF parsing is used. However, typical OCR or PDF parsing software, while successful in extracting text, often fails to extract crucial items like tables and graphs containing the most important information. Rule-based extraction attempts can be made for key elements, but unless dealing with standardized documents, the variety of input formats is bound to compromise accuracy.
Standard OCR or PDF parsing software fails to meaningfully extract the position or structure of important items like tables and graphs.
In contrast, document structure analysis-specific software extracts essential items like tables and graphs. Furthermore, some software goes a step further by performing markdown conversion after structure analysis. As many LLMs have been exposed to markdown-formatted data during training, converting documents to markdown makes them suitable for LLM use.
Document structure analysis-specific software extracts key items like tables and graphs, and more advanced software performs markdown conversion, making it easier for LLMs to use.
Upstage is also planning to launch a service for document structure analysis. Any document can be sent to our API and receive a markdown-converted result. Those interested in using this service can contact our UpStage sales team.