Released two years ago, OpenAI's remarkably capable, if flawed, GPT-3 was perhaps the first to demonstrate that AI can write convincingly -- if not perfectly -- like a human. The successor to GPT-3, most likely called GPT-4, is expected to be unveiled in the near future, perhaps as soon as 2023. But in the meantime, OpenAI has quietly rolled out a series of AI models based on "GPT-3.5," a previously-unannounced, improved version of GPT-3.
GPT-3.5 broke cover on Wednesday with ChatGPT, a fine-tuned version of GPT-3.5 that's essentially a general-purpose chatbot. Debuted in a public demo yesterday afternoon, ChatGPT can engage with a range of topics, including programming, TV scripts and scientific concepts.
According to OpenAI, GPT-3.5 was trained on a blend of text and code published prior to Q4 2021. Like GPT-3 and other text-generating AI, GPT-3.5 learned the relationships between sentences, words and parts of words by ingesting huge amounts of content from the web, including hundreds of thousands of Wikipedia entries, social media posts and news articles.
Rather than release the fully trained GPT-3.5, OpenAI used it to create several systems fine-tuned for specific tasks. One -- text-davinci-003 -- can handle more complex instructions than models built on GPT-3, according to the lab, and is measurably better at long-form and "high-quality" writing.
According to OpenAI data scientist Jan Leike, text-davinci-003 is similar but not identical to InstructGPT, a family of GPT-3-based models released by OpenAI earlier this year that are less likely to generate problematic (e.g., toxic and highly biased) text while more closely aligning with a user's intent. Text-davinci-003 -- and by extension GPT-3.5 -- "scores higher on human preference ratings" while suffering from "less severe" limitations," Leike said in a tweet.
Anecdotally, that appears to be the case. Data scientists at Pepper Content, a content marketing platform, report that text-davinci-002 "performs better in understanding the 'context' behind a request and then using that to produce better content" and "hallucinates" less than GPT-3-based models. (Where it concerns text-generating AI, "hallucination" refers to an AI writing inconsistent, often factually incorrect statements.)
In a test on OpenAI's Playground website, which provides a UI frontend for the models, the Pepper Content team fed several prompts to text-davinci-003 and a model based on GPT-3 (text-davinci-002). Given "What is the philosophy behind WeWork?," the GPT-3.5-based text-davinci-003 generated this:
It's not perfect -- note the excess commas and repetitiveness, for one. But the copy's certainly more engaging than what the GPT-3-based text-davinci-002 produced:
GPT-3.5 is also better at generating blog posts, it seems. Here's what the Pepper Content team got when they prompted text-davinci-003 to generate a post about picking a sofa:
Again, it's not perfect. GPT-3.5 oddly added the bit about the "green living room." But again, GPT-3 is more basic (and less grammatical) in its generation:
Experiments beyond Pepper Content's suggest that GPT-3.5 tends to be much more sophisticated and thorough in its responses than GPT-3.
For example, when YouTube channel All About AI prompted text-davinci-003 to write a history about AI, the model's output mentioned key luminaries in the field, including Alan Turing and Arthur Samuelson, while text-davinci-002''s did not. All About AI also found that text-davinci-003 tends to have a more nuanced understanding of instructions, for instance providing details such as a title, description, outline, introduction and recap when asked to create a video script.
A hallmark feature of text-davinci-003/GPT-3.5's generations is wordiness, as it turns out. In an analysis, scientists at startup Scale AI found text-davinci-003/GPT-3.5 generates outputs roughly 65% longer than text-davinci-002/GPT-3 under identical prompts.
Perhaps less useful but nonetheless entertaining, text-davinci-003/GPT-3.5 is better at composing songs, limericks and rhyming poetry than its predecessor. Ars Technica reports that commenters on Y Combinator's Hacker News forum used text-davinci-003 to write a poem explaining Albert Einstein's theory of relativity and then re-write the poem in the style of John Keats. See:
The Scale AI team even found that Text-davinci-003/GPT-3.5 has a notion of meters like iambic pentameter. See:
Relatedly, text-davinci-003/GPT-3.5 is wittier -- at least subjectively. Asking text-davinci-002/GPT-3 to "tell a joke" usually yields this:
Text-davinci-003/GPT-3.5 has cleverer responses:
Scale AI had it explain Python code as Eminem, a feat which text-davinci-002/GPT-3 couldn't accomplish:
So why is GPT-3.5 so much better than GPT-3 in these particular areas? We can't know the exact answer without additional details from OpenAI, which aren't forthcoming; an OpenAI spokesperson declined our request for comment. But it's safe to assume that GPT-3.5's training approach had something to do with it. Like InstructGPT, GPT-3.5 was trained with the help of human trainers who ranked and rated the way early versions of the model responded to prompts. This information was then fed back into the system, which tuned its answers to match the trainers' preferences.
Of course, it doesn't make GPT-3.5 immune to the pitfalls to which all language models eventually succumb. Because GPT-3.5 merely relies on statistical regularities in its training data rather than a human-like understanding of the world, it's still prone to, in Leike's words, "mak[ing] stuff up a bunch." It also has limited knowledge of the world after 2021 because the training data is more sparse after that year. And its safeguards against toxic output can be straightforwardly circumvented.
Still, GPT-3.5 and its derivative models demonstrate that GPT-4 -- whenever it arrives -- won't necessarily need a huge number of parameters to best the most capable text-generating systems today. (Parameters are the parts of the model learned from historical training data and essentially define the skill of the model on a problem.) While some have predicted that GPT-4 will contain over 100 trillion parameters -- nearly 600 times as many as GPT-3 -- others argue that emerging techniques in language processing, like those seen in GPT-3.5 and InstructGPT, will make that enormous jump unnecessary.
One of those techniques could involve browsing the web for greater context, a la Meta's ill-fated BlenderBot 3.0 chatbot. John Shulman, a research scientist and co-founder of OpenAI, told MIT Tech Review that OpenAI is continuing work on a language model it announced late last year, WebGPT, that can go and look up information on the web (via Bing) and give sources for its answers. At least one Twitter user appears to have found evidence of the feature undergoing testing for ChatGPT.
OpenAI has another reason to pursue lower-parameter models as it continues to evolve GPT-3: huge costs. A 2020 study from AI21 Labs pegged the expenses for developing a text-generating model with only 1.5 billion parameters at as much as $1.6 million. OpenAI has raised over $1 billion to date from Microsoft and other backers, and it's reportedly in talks to raise more. But all investors, no matter how big, expect to see returns eventually.