Reinterpreting the History of NLP-based AI through a Data-Centric Perspective
2023/06/16 | Written By: Chanjun Park (AI Research Engineer)
With the advent of AlphaGo, ChatGPT, HyperClova, and others, we have now entered an era where artificial intelligence is familiar to the public. AI might feel like a discipline that emerged only in the 21st century, but in fact, the term 'artificial intelligence' was born in 1956 during a workshop at Dartmouth College in the United States, initiated by Professor John McCarthy. This means that the field of artificial intelligence has been around for nearly 70 years. In this article, we aim to reinterpret the history of AI, especially the history of natural language processing, from a Data-Centric AI perspective. From a data perspective, this article explores the progression of AI through various stages: starting from rule-based approaches, advancing through statistical and machine learning methods, and culminating in the era of deep learning and Large Language Models.
What is Natural Language Processing (NLP)?
Natural language refers to 'human language' (as opposed to artificial languages like Python, C language, etc.), and natural language processing is the computer's handling of this human language. It has evolved through fundamental research such as morphological analysis, syntactic analysis, semantic analysis, and pragmatics analysis. This evolution has led to various applied research areas like machine translation, document summarization, question answering, and dialogue systems. Currently, many subfields of natural language processing are converging towards super-large language models (Large Language Models, LLM).
The “History of Language Models” for Everyone
In extreme terms, computers understand nothing but 0s and 1s. This means they cannot directly understand human language. So, how can we make computers understand human language? We need a process that converts the system of human knowledge representation, namely the language expression system, into a knowledge representation system that computers can understand. This is precisely the role of a language model. The development of language models has been based on the question, 'How can we represent human language in a knowledge expression system that computers can understand?' To put it more simply, scholars have continually contemplated how to effectively translate characters (Human) into numbers (Neural Network). The ultimate answer to this question seems to be the advent of the Large Language Model era.
Looking at the technical developments based on these questions, the traditional approach for word representation has primarily used one-hot (or one-of-N) encoding. One-hot encoding is a method of language representation where only the word you want to express is marked with a 1, and all other words are marked with a 0. This requires a vocabulary, and the number of words in the vocabulary naturally becomes the dimension of the vector. For example, if there are only two words in the vocabulary, Dog and Cat, using one-hot encoding, a two-dimensional vector is formed, where Dog is represented as {1,0} and Cat as {0,1}.
However, one-hot encoding has the limitation of not being able to represent the relationships between words. Additionally, since the size of the vocabulary and the dimension of the vector are the same, it results in a very high dimensionality, making it memory expensive.
Word2Vec
To solve these problems, knowledge representation systems incorporating semantic information began to emerge. This started with the 2003 paper by Professor Bengio titled 'A Neural Probabilistic Language Model,' and the most representative example is Word2Vec, published ten years later in the paper 'Efficient Estimation of Word Representations in Vector Space.' These studies were conducted with the goal of 'mapping words to a dense real-number vector space while reflecting the meanings of the words.' It's a paradigm where similar words are trained to be distributed within close distances in the vector space. Based on this paradigm, various other research efforts like Glove and FastText have been conducted.
Unfortunately, this method also had its limitations, namely the inability to understand context. For example, consider two sentences: ‘I cannot bear the thought of losing my keys again.’ and ‘There is a bear in the woods behind our cabin.’ To represent these sentences in a knowledge representation system that a computer can understand, they need to be converted into high-dimensional vectors. When representing these sentences based on Word2Vec, it was discovered that the word 'bear' in both sentences was represented by the same vector. Why is this problematic? The 'bear' in ‘I cannot bear the thought of losing my keys again.’ refers to a verb that means to endure an ordeal or difficulty; while the other 'bear' in ‘There is a bear in the woods behind our cabin.’ refers to a large furry mammal. In other words, the same word has different meanings.
Consequently, a system that understands the context and represents homonyms with different high-dimensional vectors is required. Doing so allows for a language knowledge representation system that considers context beyond just meaning, leading to the development of more advanced natural language processing models. This led to the emergence of research into 'knowledge representation systems with contextual information', starting with the paper 'Deep contextualized word representations', which introduced ELMo (Embeddings from Language Models).
ELMo
ELMo brought about two major paradigms: pretraining and bidirectional learning. ELMo is a study that significantly applies the technique of fine-tuning a specific task based on a pretrained language model. This is the Pretrain-Finetuning Approach, which should be familiar to those in the field of natural language processing. Subfields of natural language processing like morphological analysis, syntactic analysis, document summarization, and machine translation all deal with language. Put simply, the existence and effective use of a finely developed language model that integrates contextual information can lead to a natural enhancement in performance across various subfields of natural language processing. This means that by pretraining to teach the computer a context-aware knowledge representation system and then fine-tuning this system with task-specific data, we establish a paradigm that inherently boosts the model's performance.
Secondly, bidirectional learning, or biLM (Bidirectional Language Model), involves training in both forward (from the beginning to the end of a sentence) and backward (from the end to the beginning) directions to understand the context. This approach simultaneously trains both a forward and a backward language model, and by linking their output, it enables smoother and more effective language representation.
However, ELMo, merely combining forward and backward language models, wasn't a true 'bidirectional language model' and, being based on LSTM, came with several inherent limitations.
The Advent of Transformer-Based Language Models
After ELMo, the era of Transformer-based language models began. Notably, OpenAI's GPT and Google's BERT emerged. GPT is based on the Transformer Decoder, while BERT is based on the Transformer Encoder. Natural language processing can be divided into natural language understanding and natural language generation, where the Encoder is for understanding, and the Decoder is for generation.
In terms of language representation, BERT trains using the Masked Language Model (MLM) approach, where it masks random tokens in an input sentence and predicts what those tokens are.
For the training data, only 15% require straightforward masking, eliminating the need for a separate labeling process. This allows for pretraining with extensive datasets. Consequently, through large-scale pretraining and authentic bidirectional learning using MLM, models can develop a more refined language representation system. In contrast to ELMo, which addressed bidirectional dependency by merging two unidirectional models, BERT stands out as a genuinely bidirectional model, learning both dependencies within a single framework.
GPT, on the other hand, is easier to understand as a natural language generation-based model, as it predicts the next word based on the preceding sequence of words. The size and training data of the GPT model have progressively increased, leading from GPT2, GPT3, to the current GPT4.
After the dominance of BERT and GPT, a period emerged with a proliferation of diverse research. There are two major directions in this era. The first is to 'increase the size of the model', and the second is 'rather than increasing the size, to compensate for weaknesses or make the models lightweight enough for practical services.' The fruits of the first direction are models like ChatGPT, GPT-4, HyperClova, leading us into the era of Large Language Models (LLM).
Representative models of the second direction include ALBERT, Linformer, Performer, along with research in Quantization, Distillation, Pruning. Additionally, there has been criticism that language models often lack basic human common-sense knowledge (e.g., responding that one can walk from Korea to the USA when asked), and many studies have been conducted in neural symbolic research, which integrates symbolic knowledge information into neural networks. These two directions are showing similar trends even in the era of LLMs.
Essential Elements of the LLM Era
We are now in the era of Large Language Models (LLM). To create an LLM, four key elements are necessary. Firstly, 'infrastructure' is crucial. This includes cloud computing on a massive scale, supercomputers, and data centers. In other words, the hardware for LLMs and the operational environment to support them are needed. This suggests a shift in the business paradigm towards AI and cloud computing.
Secondly, the 'BackBone Model'. For instance, ChatGPT was trained on GPT 3.5, and the upcoming HyperClova X and SearchGPT are reportedly based on HyperClova.
The third element is Tuning technology. This involves various tuning techniques for cost-efficiency. The big question is 'How to make it lightweight?' Semiconductor technology for matrix operation optimization becomes crucial here. Recently, Naver's MOU with Samsung Electronics highlights the importance of semiconductor technology, as does NVIDIA's surging stock price.
Last but not least, high-quality and abundant training data. This includes instruction data, human feedback data, and large-scale data for training the backbone model.
Based on these four elements, various companies are competing to create LLMs. As models and datasets grow and get refined, the knowledge representation system for language improves beyond imagination, leading to the emergence of diverse capabilities in models.
In conclusion, a Language Model (LM) 'models' 'language' into a system that a computer can understand. Understanding the history of natural language processing-based language models from one-hot encoding to the current GPT4 will help you grasp the overall history.
Data Perspective on Defining Humans
After exploring the definition of natural language processing in the first section and the flow of language model-based natural language processing in the second, we now turn to examining the history of natural language processing from a Data-Centric AI perspective.
Before reinterpreting the history of artificial intelligence from the relational perspective of humans and data, we’d like to emphasize that humans and data have been inseparable from the rule-based era to the LLM era.
From a data perspective, the definition of humans can be divided into two categories. The first is 'experts,' and the second is 'the public.' This includes everyone. These two definitions of humans have driven the history of AI data.
Rule-Based Natural Language Processing: The Era of 'Experts'
The rule-based era was the era of 'experts.' During this time, the role of linguists was crucial. Morphological analysis, syntactic analysis, and WordNet were areas only language experts could contribute to. Expertise was extremely important in this era as data had to be represented based on linguistic knowledge.
Statistical-Based Natural Language Processing to the Era of Machine Learning and Deep Learning: The Era of 'the Public'
In contrast, the era of statistical-based natural language processing, machine learning, and deep learning was, from a data perspective, the era of 'the public.' Put simply, our role was significant in this era. As most of our readers know, the convergence of GPUs, algorithms (open source), and big data accelerated deep learning. Among these, big data, in fact, was created by us. Texts on Wikipedia, Naver blogs, cafes, Knowledge iN (Korea’s Quora?), and numerous other web pages – we all were unconsciously generating data. This vast amount of data laid the foundation for statistical-based methodologies to dominate before transitioning to deep learning.
The Era of Pretrain-Finetuning: The Age of 'the Public' + 'Experts'
Around the mid-era of deep learning, the Pretrain-Finetuning technique became popular. This era was marked by the coexistence of experts and the general public. Pretraining, as the name suggests, is preliminary learning. For pretraining, learning is based on large corpora like Wikipedia, created by the general public. This pretrain model is then fine-tuned for specific tasks desired by users, such as morphological analysis, syntactic analysis, and document summarization. During this fine-tuning process, data created by experts is still used.
With the advent of the Pretrain-Finetuning technique, the role of data naturally evolved to require a system that could objectively evaluate multiple tasks simultaneously. This is what we know as benchmark data. In this era, benchmark datasets like Korea's KLUE, and the U.S.'s GLUE, SuperGLUE, SQUAD began to emerge. Thus, it can be seen as a period where both experts and the public contributed, coinciding with the arrival of the era of data capable of effective evaluation. It was a time when both mentioned types of people coexisted.
The Era of Neural Symbolic: The Age of 'Experts'
During this period, the paradigm of neural symbolic also emerged. One notable drawback of deep learning models includes their deficiency in common-sense knowledge, limited reasoning capabilities, and a lack of interpretability. As previously mentioned, for example, one cannot walk from Korea to New York; it requires air or sea travel. Manhattan is located within New York – such commonsense information is missing in deep learning models. Neural symbolic is about constructing this commonsense information in the form of a Knowledge Graph and injecting it into deep learning models. This type of data is truly within the domain of experts. For instance, creating graph data with common sense information like 'Mona Lisa was painted by Da Vinci and is currently in the Louvre Museum'.
The Era of Large Language Models Part 01 - The Era of 'the Public's' 'Unconscious' Data Creation
Subsequently, the era of Large Language Models (LLM) trained with huge models and big data emerged. Well-known examples include GPT3, HyperClova, etc. By scaling up models and data, an era began where Big Models could handle a surprisingly diverse range of tasks. Unlike the previous Pretrain-Finetuning methods, we’re now in an era where a single model can handle various tasks without the need for fine-tuning.
How can we define this era from a data perspective? Looking at Hyper Clova’s training data as a typical example, it includes data from Naver blogs, Naver cafes, Naver news, Naver Knowledge iN, etc. In essence, the era of LLMs also progressed with the training on massive data created by us. In other words, we have continuously been unconsciously generating data for AI training.
The Era of Large Language Models Part 02 - The Era of 'the Public's' 'Conscious' Data Creation
In the era of LLMs, the most innovative product is, as many of you know, 'ChatGPT.' The key to the ChatGPT era is Human Feedback Data, which doesn't necessarily come from experts; it involves all of us.
I previously mentioned that the era of LLMs was characterized by our unconscious data creation. However, with ChatGPT, we have entered an era where we create 'conscious data' by directly providing feedback. This means that the role of data maker is no longer exclusive to experts; all of us can now participate in developing AI models. We are in an ‘Data for all’ era. As we consciously provide feedback, ChatGPT can generate text that resembles human writing.
The History of AI I've discussed is entirely from the perspective of data. I want to emphasize that 'since the rule-based era, the relationship between people and data has been indispensable', and the role of humans in data will become even more crucial in the future.
In Conclusion
Why is an LLM like ChatGPT such a hot topic? Personally, I think it might be one of the only cases where both research and world impact (a paradigm shift) have occurred simultaneously. The research impact signifies a technical paradigm shift, while the world impact refers to an event that everyone in the world can feel. Post the advent of deep learning, the biggest research impacts in natural language processing have been 'Word2Vec,' 'Transformer (Attention),' and 'Reinforcement Learning from Human Feedback (RLHF).' Word2Vec revolutionized the knowledge representation system of language. The Transformer turned over the technology of CNN and RNN with Attention, and RLHF shifted the paradigm from a probability-based generation method to a reinforcement learning-based generation method. This RLHF is the technology applied in ChatGPT.
As for world impact, if I were to pick three, they would be IBM Watson, AlphaGo, and ChatGPT. IBM Watson was the first AI to win a quiz show, AlphaGo is known for winning against Lee Sedol 9 Dan in Go, and ChatGPT has surpassed 100 million monthly users. Thus, ChatGPT is particularly significant because it brings both research and world impacts together.
The core of ChatGPT is Human feedback data, quality data. The key competence for AI companies in the future will be whether they have secured quality data. When conducting AI business in companies, it will be crucial to consider whether they have a structure to easily secure data and, further, whether they have a process to create quality data.
With the influence of ChatGPT, I believe various new revenue models will emerge, with subscription-based AI business models coming to the forefront. An important point here is that I believe we are entering an era of subscribing to models rather than data. Sharing data can be burdensome for companies because good data gives a competitive edge and various policies like personal information issues must be considered, which is why I anticipate an era of AIaaS (AI as a Service), where good data is created and shared as models. Advertising paradigms will also change. Beyond direct and indirect advertising, we might enter an era where money is made through generated advertising.
Ultimately, securing a Data team in a company will become a competitive edge. Not only excelling in both models and data but also balancing and harmonizing these two will be key for companies to survive. The emergence of new revenue models, clear direction, and new job categories all point to Data being at the center of the AI era. Just as we moved from the SW 1.0 paradigm of 'Code in SW out' to the SW 2.0 era of 'Data in SW out.' I hope this article is helpful to many, and I will conclude by emphasizing the core of this piece: DATA! DATA! DATA!