Three issues with handling data in AI models (Data! Data! Data!)
2022/11/14 | Written By: Sungmin Park
In our previous content, we discussed the steps involved in creating AI models for services. Data is frequently mentioned in AI model development for service improvement. While discussing models and optimization methods can be found in school lessons and online lectures, it is difficult to find methodologies related to data in practice because the task is challenging. Now, let's take a closer look at the reasons why data-related tasks are difficult when developing AI models.
Why Data-Related Tasks in AI Model Development Prove Difficult
1. The Best Approach is Unknown
To ensure the quality of AI models for services, a lot of effort goes into handling data, both before and after service launch. However, compared to the many AI-related papers focusing on model structures and learning methods, there are relatively few dealing with data. This lack of precedent means that you may need to tread new ground, and you may find that you need to invest more time.
It is said that "Few will teach you how to improve your dataset, since academia doesn’t have many resources to gather meaningful data." This task can be difficult for three main reasons: the challenge of gathering high-quality data, high labeling costs, and the extensive time required to build a reliable dataset.
According to a survey conducted in April 2021 among graduates majoring in AI at UC Berkeley, the most challenging task when applying AI to services was model maintenance and retraining (61.1%), followed by model monitoring (60.0%) and data labeling (38.2%). Tackling all these tasks requires a hefty amount of data, and not knowing how they are used in real-world applications can pose challenges.
2. Data Labeling Tasks Are More Challenging Than You Think
Another reason data work is challenging is that data labeling tasks are significantly more difficult than they seem. While outsourcing data annotation might seem straightforward, it requires meticulous attention to detail to maximize performance.
One common misconception is that more data always equates to better model performance. However, this is not always the case. Performance may decrease as the volume of labeled data increases, even when the performance was good up to 100,000 labeled data points.
So, what makes good data? The first requirement for high-quality data is the absence of labeling noise. Labeling noise refers to the inconsistency in labeling results, and higher levels of noise lead to lower data quality.
Let's take regression analysis as an example to help clarify things. Let's assume that the data we have collected is used to model the "true result," which is represented by the green indicator on the graph above. When you have a small amount of data with noisy labels, you'll find that the model's structure or learning method allows for various models, but the performance in terms of generalization significantly decreases. On the other hand, with a large dataset, even with noisy labels, the models tend to approximate the true result more closely. This happens because there are enough clean samples to offset the labeling noise. To sum up, to ignore improperly labeled data during training, you'll need at least twice as many accurately labeled samples.
Small data / Clean labels / Data balance: Even with a small quantity of data, if it is well-distributed, a good model can be achieved.
Small data, clean labels, data imbalance: These factors can degrade the generalization performance depending on the model structure or learning methodology.
Even if there is a limited amount of data, it can be valuable if it includes a wide variety of data that is consistently labeled, rather than just similar data. Let's look at a specific example to make things clearer.
The graph above assumes that data has been collected to perform a certain task and illustrates the intensity of labeling noise that commonly occurs. In cases that are frequently encountered, labeling noise is minimal because the content is familiar to the labelers and they are mindful of this when creating the work guide. In contrast, in rare cases, the work guide may not address certain issues, and the labelers may interpret the task differently, leading to relatively high labeling noise. This phenomenon naturally arises if data collection does not consider the frequency of different cases. As a more concrete example, consider Optical Character Recognition (OCR), a technology that can be used in AI services.
Let's assume that for creating an OCR model, we have set the labeling guidelines to "mark each word with a rectangle." In this case, the first scenario allows for annotating the data without significant issues. However, in the second scenario, because the words are not clearly separated, different workers might interpret the guidelines differently. This could lead to inconsistencies like neglecting to label misspelled words or treating quotation marks as separate words. If the same image is interpreted differently by various workers, it results in inconsistent labeling, leading to what is called labeling noise.
3. Balancing data is challenging.
To create high-quality data, it is essential to intentionally seek out unique cases, collect those samples, and develop a labeling guide that includes them. Balancing data is not an easy task.
There are two methods to streamline this process:
Accumulate experience relevant to the task. The more domain knowledge you have, the more likely you are to anticipate exceptional cases (for example, domain knowledge in voice recognition or detecting people in autonomous driving).
Indeed, Tesla meticulously manages such cases in autonomous driving by defining 221 specific triggers.
However, acknowledging that it is impossible to know and label every possible case, it is crucial to establish repetitive and automated processes.
Such automated processes are referred to as a “Data Engine” or “Data Flywheel.”
Addressing the various challenges in handling data for AI models is a complex task, necessitating tools like the “Data Engine” and “Data Flywheel,” as previously mentioned. For those who seek more detailed insights into AI model development, Upstage offers freely-accessible open-source project designed to streamline the extract, transform and load(ETL) pipeline using python. Explore our Dataverse on below link and shape the future of data processing with us!