GPT-3 and beyond - Part 1: The basic recipe

Fabian Gringel

Is this how your basement looks like? If not, don't try to train GPT-3 on your own.

GPT-3 is a neural network capable of solving a wide range of natural language processing (NLP) tasks, which has been presented by OpenAI in summer 2020 (upscaling the previous models GPT and GPT-2). For various tasks it has set new state-of-the-art performances and is considered by many as a substantial step into the direction of artificial general intelligence. “General intelligence” refers to the capability of not only behaving intelligently with respect to one set task, but also being able to adapt to and accomplish new, unforeseen tasks.

This blog article is the first of a two-article-series on GPT-3. In this first article I will explain

  1. how GPT-3 works,

  2. what it is good at and why some people think it’s dangerous,

  3. and how you can try out a GPT-3-like model for free.

The second part will deal with GPT-3’s weaknesses and where to expect the next breakthrough in the future.

1. How GPT-3 works

The basic recipe

Maybe the most surprising thing about the approach OpenAI took with GPT, GPT-2 and now GPT-3 is that there is no complicated idea behind it. The recipe contains only three main ingredients:

  1. Take a state-of-the-art NLP network with as many parameters as computationally feasible (for all GPT versions a Transformer-based model was used).

  2. Curate a huge text corpus, e.g. consisting of books, Wikipedia etc. (again: the bigger, the better).

  3. Train the model on a simple language modeling task: “predict the next word”.

A typical instance of this task looks as follows: Given a text, e.g.

Its central figure is the aging Dubslav von Stechlin, widowed for thirty years and living alone in his somewhat dilapidated mansion near the shore of Lake Stechlin.

we take its first n words, e.g.

Its central figure is the aging Dubslav von Stechlin, widowed for thirty

and train the model to correctly guess the next word (“years”). (What the model thereby actually learns is a conditional probability distribution on all possible next words, i.e. the vocabulary. A model with this ability is called a language model.)

Now, assuming that we train a huge version of a very powerful NLP model with an abundance of training data, we can be quite sure that the model will become good at the language modeling task.

The “traditional” approach of making use of the language understanding the model has acquired would be to adapt the model architecture to the desired task at hand (e.g. question answering, translation, sentiment analysis etc.) and finetune it. In this case language modeling plays the role of a pre-training task, and since it consists in generating the next word, OpenAI coined their approach “Generative Pre-Training”. But starting with GPT-2, they took a different approach, which I will describe next.

Language modeling as multi-task learning

Following the hypothesis “Language Models are Unsupervised Multitask Learners”, one can consider the trained language model as a ready-to-use multi-task solver. The idea is that in order to become good at the language modeling task, the model has to acquire a wide range of NLP-related skills.

For example, if the text corpus contains a lot of examples of German-to-English translations like

"Scholz packt das an" is the slogan, “Scholz will sort it”.

then the model, in order to do well on the next word prediction task in these examples, has to effectively learn to translate from German to English.

Or consider question answering. In many cases, correctly guessing the next word is only possible if one knows certain things about the objects in question. To predict the next word in the sequence

The current capital of Brazil, since its construction in 1960, is [Brasilia]

you need to know that Brasilia is the capital of Brazil. Hence the language model is forced to learn and store knowledge about the world in its parameters.

It must also learn to draw logical and common sense conclusions:

She wanted to go to China or Japan. For China she couldn’t get a visa, so she went to [Japan]

He had to be at the airport in 90min. Since his car was still in the shop and there were no busses going to the airport, he called a [taxi]

Prompt Engineering

The question is how to use these theoretical capabilities for any practical purposes. Recall that the only thing GPT-3 can do is predict the next word for a given sequence of words. So we must transform any task we want it to solve to the form of language modeling. The input sequence we ask the model to continue is called prompt. Hence the challenge lies in designing prompts that make the model behave as we want it to.

As we have seen above, there is a good chance it will complete the prompt “What is the capital of Brazil? - It is …” with “Brasilia” - since it should have acquired a semantic understanding of how the next word must relate to Brazil, and it knows a word exactly with this relation.

It’s similarly easy to design a prompt for the translation task: Given a German sentence $$G$$ that you want to translate to English you can use the prompt “The German sentence $$G$$ translates to English as …”.

In fact, it has shown that GPT-3 works best if you describe the task you want it to do and give some examples of the accomplishment of the task. So assume we already have three German phrases $$G_i$$ with their English translations $$E_i$$. Then the prompt can look as follows:

“Translate German to English: $$G_1$$-> $$E_1$$, $$G_2$$ -> $$E_2$$, $$G_3$$ -> $$E_3$$, $$G$$->”.

Translate German to English:

Seeotter -> sea otter

Pfefferminze -> peppermint

Hängebrücke -> suspension bridge

Käse ->

 GPT-3 produces deceptively real newspaper articles.

Another, and probably the most discussed, capability of GPT-3 concerns text generation. Using a prompt as an input that we interpret as the beginning of a text, we can make GPT-3 finish the text by predicting the rest of it word by word. If we want to specify the way in which a text is supposed to be written, we can do that by adding a natural language instruction, or, apparently yielding better results, by providing examples of the type of the desired generated text. OpenAI describe an interesting case in their GPT-3 paper in which they want GPT-3 to produce news article-styled texts:

The dataset used to train GPT-3 is much less weighted towards news articles [than towards Tweets from Twitter], so [...] GPT-3 often interprets the proposed first sentence of a “news article” as a tweet and then posts synthetic responses or follow-up tweets. To solve this problem we employed GPT-3’s few-shot learning abilities by providing three previous news articles in the model’s context to condition it. With the title and subtitle of a proposed next article, the model is able to reliably generate short articles in the “news” genre.

2. Too powerful to be published?

As we have just seen, in order to use GPT-3 for NLP purposes other than language modeling one needs to come up with a prompt telling the model what to do. This is, in fact, not a trivial task. Some hold that prompt designing is a new form of coding or call it “software 3.0”. Even though that might be exaggerated, it seems that the versatility of GPT-3 comes at the expense of the fact that it’s necessary to “learn its language” to make it perform well.

One of the few things that are clear about prompt design is that GPT-3 benefits from receiving a lot of examples of the task at hand within the prompt (as we have seen above with respect to the translation and the news generation tasks). You might wonder whether this doesn’t amount to some kind of fine-tuning. But keep in mind that:

  • One needs way less examples than for actual fine-tuning, often a few suffice (even if more usually increase the performance).

  • Fine-tuning adapts a model's parameters, providing examples within the prompt does not. In particular, the requirements towards the hardware are much lower.

GPT-3’s zero and few shot capabilities (where “zero” and “few” refer to the number of examples provided in the prompt) are impressive, but not for all NLP tasks state-of-the-art: In same cases specialized (e.g. fine-tuned) models still perform way better.

One area where it has raised the bar significantly, though, is natural language generation. It is able to produce “newspaper articles” that trick readers into believing they are written by real journalists. The following is an article written by GPT-3 which according to OpenAI test readers couldn’t identify as machine-written:

Given the prompt

Title: United Methodists Agree to Historic Split

Subtitle: Those who oppose gay marriage will form their own denomination


GPT-3 produced the following output:

After two days of intense debate, the United Methodist Church has agreed to a historic split - one that is expected to end in the creation of a new denomination, one that will be "theologically and socially conservative," according to The Washington Post. The majority of delegates attending the church's annual General Conference in May voted to strengthen a ban on the ordination of LGBTQ clergy and to write new rules that will "discipline" clergy who officiate at same-sex weddings. But those who opposed these measures have a new plan: They say they will form a separate denomination by 2020, calling their church the Christian Methodist denomination. The Post notes that the denomination, which claims 12.5 million members, was in the early 20th century the "largest Protestant denomination in the U.S.," but that it has been shrinking in recent decades. The new split will be the second in the church's history. The first occurred in 1968, when roughly 10 percent of the denomination left to form the Evangelical United Brethren Church. The Post notes that the proposed split "comes at a critical time for the church, which has been losing members for years," which has been"pushed toward the brink of a schism over the role of LGBTQ people in the church." Gay marriage is not the only issue that has divided the church. In 2016, the denomination was split over ordination of transgender clergy, with the North Pacific regional conference voting to ban them from serving as clergy, and the South Pacific regional conference voting to allow them.

GPT-3’s creators at OpenAI fear that this capability is prone to abuse: Potentially, people or organizations with bad intentions could flood the internet (e.g. social networks or via specially created “news” websites) with fake news articles which make the task of finding real information between them look akin to finding the needle in a haystack.

These concerns are also (part of) the official reason why OpenAI did not release the trained model files of GPT-3. Instead, it can be accessed only via a commercial API (for which there is a - probably very long - waiting list).

3. Trying out GPT-3

As stated above, GPT-3 is not freely accessible. One alternative is EleutherAI's GPT-Neo 2.7B, a down-scaled replica of GPT-3. Thanks to HuggingFace's transformer library it is really easy to use it in Python:

from transformers import pipeline
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B')
prompt = "[YOUR PROMPT]"
generator(prompt, do_sample=True, min_length=50)

Downloading the model might take a while, though, since the it is 10GB large. If you don't want to download it, your hardware doesn't allow you to use it locally or inference takes too long, you can also use the web inference API.

EleutherAI trained an even bigger model, GPT-J 6B, which is performance-wise very similar to the original GPT-3. It's also hosted by HuggingFace and comes with a web API to test the model. Due to its size (22.5GB), if you seriously want to use it you should do it on a (GPU) server with sufficient RAM.

That's it for now. In the follow-up article I will point out for which tasks and in which ways GPT-3 still fails and what techniques future models could make use of to surpass GPT-3. Stay tuned!