Data-Centric AI in the Real World

April 12, 2023 Hailey(박성민) .

2023/04/12 | Written By: Chanjun Park

💡 Just like a car needs fuel to move and a recipe requires ingredients to make a meal, artificial intelligence systems also need their own kind of fuel and materials, which is data. Explore the practical applications of data in the real world through this blog.

What is Data-Centric AI?

Artificial intelligence is everywhere in our daily lives. Every day, we search for information on websites and use translators when faced with language barriers. We also get engrossed in videos recommended by YouTube's system, which seems to know just what catches our interest, and we use ChatGPT as a handy tool for various tasks. It's amazing how we're always running into and using different AI systems in our daily lives!

So, what elements make up these AI systems that are so deeply woven into our daily routines? At the core, all AI systems are fundamentally composed of Data and Code. The first step involves planning and designing the AI system (Setup). The second step is to gather the relevant Data, which acts as the fuel. In the third step, we write the code needed to train the model, using GPU hardware to teach the AI system what the developers want it to learn. The final step is serving the system so that users or customers can actually use the model.

Does the AI system's lifecycle end with deployment? Not at all! Just as humans need a balanced intake of nutrients to grow, AI systems also require ongoing enhancement. So, what approach is needed to sophisticate AI systems? Ultimately, it's essential to improve either the Code or the Data. Data-Centric AI focuses not on improving the performance through modeling, or Code, but rather on enhancing the quality of Data and through Quality Control of Data to boost the model's performance. In other words, Data-Centric AI is all about fixing the data, not just tweaking the code!

I asked ChatGPT what Data-Centric AI means, and here's what I learned.

Responses from ChatGPT on Data-Centric AI

After looking over the responses, it is evident that Data-Centric AI refers to an AI system that focuses on data, emphasizing the enhancement of performance through data transformation. Data-Centric AI can be summarized in two main categories:

Research methodologies that contemplate improving performance from a data perspective (holding the code/algorithms constant), for example:
- Data Management (collecting new data)
- Data Augmentation (enriching the dataset)
- Data Filtering (refining the dataset)
- Synthetic Data (making artificial data)
- Label Consistency (standardizing labeling methods)
- Data Consistency
- Data Tools (labeling tools)
- Data Measurement and Evaluation
- Curriculum Learning
- Active Learning
2. Research methodologies that explore how to enhance a model's performance without modifying the model, such as:
- Should we find another model?
- AI algorithms that understand data and use that information to improve models.

How to Apply Data-Centric AI in the Real World

Data-flywheel

Ever wondered how companies are applying Data-Centric AI in the real world? There are many approaches, but one of the most notable is the "Data-flywheel" process. Whether it's a B2B or B2C company, service logs accumulate as AI-based services are provided. Many companies use this growing data to enhance their services.

For example, YouTube's recommendation model reflects our needs well because it uses log data to improve user satisfaction by incorporating it into the model. The keywords we search on portal sites and our search journey are forms of data that accumulate on the platform. In this way, as companies operate services and data accumulates, they process this data into training data for the model. By continuously adding to the model's learning, they naturally improve the model's recognition performance, which is the essence of the Data-flywheel.

In other words, through multiple iterations of interaction between the model and the data, both the model and the data's quality are improved. This is the practical way to apply Data-Centric AI in the real-world.

How to Make Data in the Real World

Is the Data-flywheel the be-all and end-all of Data-Centric AI in the Real World? Not exactly! In the real world, we also make data from scratch. However, despite AI research often focusing on model studies, there has been no specific and structured process for data development life cycles. As a result, there has been relatively little interest in who creates the data, what constitutes good data, and how to produce it. Recognizing the need for this process, Upstage is currently centering its efforts around the data team to design it.

The process of creating data in the real world.

We are conducting research on the entire process of data creation, from A to Z, to better understand how to generate high-quality data and improve our pipelines. Our research, named DMOPs(Data Management Operation and Recipes), includes the publication of a paper on the subject.

This competency is distinct from AI modeling and serving competencies, and assembling a team with expertise in these areas can provide a significant competitive advantage for companies. (This topic will be covered in the second part of the series.)

Pipeline structure for creating training data (Source: https://arxiv.org/pdf/2303.10158.pdf)

Various subdomains of Data-Centric AI also contribute during the Collection, Labeling, Preparation, Reduction, and Augmentation stages of the data development process.

The quantity and quality of data

When creating data, the question arises whether to prioritize quantity or quality. In my experience working with real-world data, I believe that the quality of data should be given more weight. While many existing Data-centric AI studies in academia focus on increasing data quantity through methods like Data Augmentation or generating synthetic data, I found that label consistency is crucial in the real world.

When making data, we must decide whether to prioritize quantity or quality. Based on my experience with real-world data, I advocate for prioritizing data quality. Although numerous academic studies in the field of Data-centric AI emphasize enhancing data quantity through techniques such as Data Augmentation or the creation of synthetic data, I have observed that label consistency is essential in practical applications.

To achieve label consistency, we must provide general rules to annotators based on each data type's characteristics, ensuring that individual subjective judgments do not bias the data. Additionally, it is essential to evaluate label consistency using data measurements and improve guidelines based on these evaluations.

From my perspective, a more desirable Data-flywheel involves a data-model bi-directional virtuous cycle where not only data quantity but also data creation process guidelines and processes are progressively improved based on model performance. This approach aims to improve data quality rather than just increasing quantity. By focusing on collecting more error-prone data and consistently refining ambiguous labels, we can achieve a meaningful impact on model performance.

Bi-directional Data-Flywheel: Rather than a unidirectional approach of simply increasing the amount of data, it's a structure of positive feedback where guidelines and processes for data generation improve incrementally based on the outcomes of the model.

Making high-quality data requires the development of annotation tools (Data Tools). These tools should be designed to make annotators' work efficient and include features to ensure label consistency. At Upstage, we have developed a data tool called "Labeling Space" that has been successfully implemented in our internal data pipelines to enhance the production of high-quality data. This tool serves as a key player in significantly reducing the time and cost of data production while ensuring the generation of high quality data.

Upstage Document AI

What is good data for AI?

We have explored the elements necessary for making good data.

But what exactly is good data? In the academic world, benchmark data that can objectively and definitively measure the performance of models, as well as publicly available high-quality learning data, would be considered good data. In the real world, however, good data can be defined in various ways that may not necessarily fit these criteria.

<Real-World Criteria for Good Data>

How informative is the meta data?
Is the quantity of data sufficient, and is its cost reasonable?
Is the data labelled by workers at a justifiable cost without any unnecessary expenses?
Does it have a good versioning system in place?
Is the data storage folder structure intuitive and clean?
Is there any unnecessary data included?
Does it meet the requirements specified in the data brief?
Is there any data bias, skews, contamination, or ethical issues present?
Is data labeling done consistently and correctly?
Are the Intellectual Property Rights, IP ownership, copyrights, confidentiality, and privacy appropriately considered?

As mentioned above, the quality of data can be evaluated based on various factors. This may seem obvious, but it is an essential factor in creating good data. While these factors may not be considered good data in academia, they are important to consider in the real world. In other words, there is a difference between Good Data as defined by Academia and Good Data as considered in industry.

When I look at the current data research in academia, it seems that they are creating data for data, rather than creating data for models. Many data-centric studies in academia focus on filtering data based on its inherent characteristics, rather than considering the synergy with models. However, when we ask the question of why we want to create good data, the answer is to create good models. Therefore, I believe that the criteria for distinguishing between good and bad data is highly valid when considering the model's performance.

As mentioned earlier, AI systems are divided into code and data. It is clear that data is the part that can quickly improve performance, but we should not overlook the code. Therefore, data-centric research that focuses on code as a milestone is necessary. In other words, model-based data-centric AI should be achieved.

I believe that true good data is data that has done multiple iterations with modelers and has been continuously cleaned based on the results of the models to improve the model's performance. The human-in-the-loop cycle, in which errors are discovered through the model and cleaned by humans, is important. Through continuous cycles, it is essential to have data that is not only error-free but also aligns well with the model's results, ensuring the data is highly informative and useful for the model to make accurate predictions. In other words, it is necessary to revisit Data-centric AI. True Data Centric AI should not only focus on having done data, but also make it compatible with the factors mentioned above.

To summarize what I think are the most important things for good data, it can be summarized as follows:

1) A systematic process done by DMOps

2) Guidelines set for label consistency by human annotators without subjectivity

3) A tool that makes it easy and efficient to produce data

4) Data that has undergone a continuous cleaning process with results from the model as a direction

In this way, data that has been made based on the existence of a mutual virtuous cycle in which data quality, guidelines, and model performance are improved through the cleaning process can be defined as good data. And its value will be evaluated through models in the market. Ultimately, I believe that even if it's an AI company, those who excel in both models and data will thrive in the future.

Conclusion (What about Upstage?)

Upstage is a company that excels at both models and data. Our "Upstage AI Pack" combines real-world data-centric AI techniques in a user-friendly, all-in-one package. It's designed to simplify AI system creation, even for those new to the field. By using our AI Pack, you can create not just good data, but great data that reflects real-world scenarios. Stay tuned for our next post, where we'll dive into DMOps and data management.