Everything You Need To Know About How ChatGPT Works
A step-by-step breakdown of how ChatGPT operates as analyzed in Assembly AI.
Photo Source: Computing
It’s been two months since the release of ChatGPT, which has already surpassed 100 million users, but no one really knows how the chatbot actually works. We’ve found an article written by Assembly AI that analyzes the chatbot’s functioning principles, and we're here to break it down step-by-step to help you understand how ChatGPT operates.
First, let’s try asking ChatGPT how it works.
Q: How does ChatGPT work?
A: ChatGPT is a variant of the Generative Pre-trained Transformer (GPT) architecture, developed by OpenAI. It is a deep learning model that has been trained on a massive amount of text data from the internet to generate human-like text.
The model consists of an encoder-decoder architecture, with the encoder being a stack of Transformer blocks and the decoder being a linear layer. During training, the model is trained to predict the next word in a sequence of text given the previous words, and it's trained on a diverse range of text, including news articles, books, and websites, to help it generate a wide variety of outputs.
At inference time, when you ask a question or prompt the model, it uses the encoding of the input to generate a continuation of the text. The decoder uses the encoder's output to generate a probability distribution over the vocabulary, which is then used to select the most likely next word. This process is repeated until the model generates the desired number of words or reaches an end-of-sequence token.
Overall, ChatGPT leverages the large amounts of text data it was trained on to generate human-like responses to a wide range of questions and prompts, allowing it to converse with users in a way that feels natural and intuitive.
While this response makes sense, it’s still hard to understand with terms such as “encoder-decoder architecture” and “transformer blocks.” We don’t know what these terms mean, and we don’t expect you to either. With that being said, let’s start breaking down how ChatGPT works.
Capability vs Alignment
In the context of machine learning, capability is how well ChatGPT is able to perform its functions or tasks. Alignment refers to what we, humans, want the chatbot to do versus what it is being trained to do. For example, if you ask ChatGPT to write a blog post, it will write the post to its capability, simple and outlined, but it may not meet the human’s expectations of detailed, well-researched and grammatically correct.
The original Large Language Model, GPT-3, is misaligned because it trained on text data derived from the internet. The objective of this model is to create word sequences that allow them to predict each new word in the sequence to form full sentences. However, humans use context and knowledge to create the best text sequences for certain situations, which AI is unable to utilize.
The overall effects of misaligned language models include not following specific instructions, making up wrong or fake facts, generating biased data and lack of interpretability.
Training and Misalignment
Language models are trained based on “next-token-prediction” and “masked-language-modeling.”
For “next-token-prediction” training, the model is given a sequence of words and trained to determine the next word in the sequence. For example, if the model is given the sequence “The girl was reading at the library. She put her book back on the ____”, then the model could come up with “shelf,” “table” or “counter” based on the current sentence, or sequence of words. Given the previous sentence, or sequence of words, the model can estimate that the answer is “shelf.”
For “masked-language-modeling” training, the model is trained to determine a word that is missing in the middle of a sequence. The missing word is considered “masked” as a special token until it is determined. For example, if a model is given a sequence “The girl was [MASK] at the library,” then based on the other words in the sequence, it can estimate that the missing token is “reading.”
Based on how models are trained, they don’t have a deep understanding of language, and can sometimes predict the wrong word. Because of this, it’s important to fact check and edit content that ChatGPT produces. However, ChatGPT can use human feedback to fix issues with alignment.
Reinforcement Learning From Human Feedback (RLHF)
Reinforcement Learning from Human Feedback, or RLHF, can be broken down into three key steps. We’ll list them below, then go on to further explain each one in a digestible way.
Supervised Fine Tuning
Reward Model Creation
Proximal Policy Optimization
Before diving in, we want to note that while step one only occurs once, steps two and three can occur repeatedly.
Let's start with supervised fine tuning. A pre-trained model is tuned on a small set of demonstration data so that it can create a policy for generating output. Now what on earth does that mean? Basically, a model that has already been trained on a large dataset of some kind, is tuned on a smaller set of practice data, so that the system can create a plan for how it will respond to future prompts. In order to collect the demonstration data, prompts are selected from a set containing researcher developed prompts, and in the case of ChatGPT, real inputs from GPT-3 as well, and human workers come up with expected output responses. This is a slow process, so the result is a pretty small dataset to tune the model on.
The second step is the creation of the reward model. Seeing as it would be a very slow and tedious process to have human labelers create a massive data set of possible outputs, in order to better align the model to user preferences, the labelers instead rank various outputs of the supervised fine tuning (SFT) model. What happens is, the SFT model produces multiple outputs based on a single prompt, then the labelers rank the outputs according to provided guidelines, as well as their own individual preferences (which we will discuss further in a little bit). As a result of this more efficient process, the researchers now have a much larger dataset to work with.
The goal here is to create a model that can automatically imitate user preferences, so the new, larger data set is then used to train the reward model by asking the system to rank some of the original model’s outputs. This is essentially what was done by the human labelers, so the system is now being tested to see if it can mimic their preferences.
The third step is Proximal Policy Optimization, or PPO. A policy is the strategy that a system employs to pursue its goals, and PPO is an algorithm that can be used to help refine said policy. PPO is considered to be an “on policy” algorithm because of its ability to directly learn from and update the system’s policy, rather than only learning from past experiences, thus improving its ability to adapt in real time. Another advantage of PPO is that it maintains the system’s stability by only allowing its policy to change so much at any given moment.
In this step, the reward model is used to further fine-tune the SFT model in order to create what is called a policy model. There are a bunch of functions involved in this process, but essentially, the system’s policy is updated based on the estimated value of different actions. How this happens is that the model is presented with a prompt to which a response is expected, then a reward is produced based upon the quality of the response, and the policy is updated. A KL penalty is also applied, and while the concept of a KL penalty is rather complex, what is important in this case is that it helps prevent the reward model from becoming over optimized.
Performance Evaluation
The whole process of RLHF is based on interaction with humans, so why should performance evaluation be any different? ChatGPT was evaluated by having the labelers rank the quality of its responses to a variety of prompts, based on the following criteria.
Helpfulness: Can the system follow, or even infer, directions?
Truthfulness: Does it avoid hallucinations (made up facts)?
Harmlessness: Are the system’s responses appropriate?
Evaluations of these criteria are based on databases including TruthfulQA, RealToxicityPrompts and CrowS-Pairs.
The model was also evaluated for zero-shot performance on question answering, reading comprehension and summarization, meaning that its performance was tested on data that it had never seen before. In this area, ChatGPT may have suffered from what is called an alignment tax. In its efforts to align to human preferences, the system may have experienced some other performance regressions, compared to previous models.
Shortcomings of the Methodology
The most prevalent limitation to this process is the fact that the data is influenced, from prompt choice, to output ranking and performance evaluation, by the preferences and biases of labelers and researchers. This is an important consideration because the labelers and researchers may not represent a balanced view of the population at large. Everyone has different preferences, and there is no objectively correct order when it comes to the ranking of outputs. (If you are interested in seeing how some new research is looking at this problem, take a look at this paper.)
Another interesting limitation is that there is no control study to compare this process to in order to confirm that RLHF was successful at producing alignment and performance improvement. While it would have been, as mentioned earlier, incredibly time, money and energy consuming to hand curate a larger SFT dataset, without that control study it is difficult to prove that RLHF was what led to the positive outcomes in aligning the system.
We hope this step-by-step break down helped you better understand how ChatGPT operates, as well as what to look out for when using the application. Despite some faults, this chatbot has proven to be helpful in speeding up the thought and creative processes behind its objectives. We expect to see many improvements and changes for ChatGPT in the next few months, and we’ll be here to report it every step of the way.
Best,
Ariana and Nina for the Don’t Count Us Out Yet Team