The Benefits of Training Tiny Language Models on Children's Stories

Category Computer Science

tldr #

Two Microsoft researchers have introduced a novel technique for training small language models - by raising them on a strict diet of children's stories. Their paper, posted to arxiv.org, shows how language models thousands of times smaller then current state-of-the-art systems can rapidly learn to tell consistent and grammatical stories, hinting at new ways of training large models and understanding their behavior.


content #

Learning English is no easy task, as countless students well know. But when the student is a computer, one approach works surprisingly well: Simply feed mountains of text from the internet to a giant mathematical model called a neural network. That’s the operating principle behind generative language models like OpenAI’s ChatGPT, whose ability to converse coherently (if not always truthfully) on a wide range of topics have surprised researchers and the public over the past year.

Reading children’s stories can enable training of more generalizable models, as the content of children's stories tend to be contextualized in a way that is simpler for AI models to interpret.

But the approach has its drawbacks. For one thing, the "training" procedure required to transmute vast text archives into state-of-the-art language models is costly and time-intensive. For another, even the people who train large language models find it hard to understand their inner workings; that, in turn, makes it hard to predict the many ways they can fail.

Faced with these difficulties, some researchers have opted to train smaller models on smaller data sets and then study their behavior. "It’s like sequencing the Drosophila genome versus sequencing the human genome," said Ellie Pavlick, a language model researcher at Brown University.

The original idea for training language models on children’s stories was first suggested in 2019 by a three-person team of researchers at Microsoft Research.

Now, in a paper recently posted to the scientific preprint server arxiv.org, a pair of Microsoft researchers have introduced a new method for training tiny language models: Raise them on a strict diet of children’s stories. Machine learning researchers have embraced this lesson. GPT-3.5, the large language model that powers the ChatGPT interface, has nearly 200 billion parameters, and it was trained on a data set comprising hundreds of billions of words. (OpenAI hasn’t released the corresponding figures for its successor, GPT-4.) Training such large models typically requires at least 1,000 specialized processors called GPUs running in parallel for weeks at a time. Only a few companies can muster the requisite resources, let alone train and compare different models.

The neural networks used in language models are loosely inspired by the human brain and are composed of many artificial neurons arranged in layers.

The two researchers showed that language models thousands of times smaller than today’s state-of-the-art systems rapidly learned to tell consistent and grammatical stories when trained in this way. Their results hint at new research directions that might be helpful for training larger models and understanding their behavior.

"I found this paper very informative," said Chandra Bhagavatula, a language model researcher at the Allen Institute for Artificial Intelligence in Seattle. "The concept itself is super interesting." .

GPT-3.5, the language model that powers ChatGPT, has nearly 200 billion parameters and was trained on a dataset comprising hundreds of billions of words.

Once Upon a Time .

The neural networks at the heart of language models are mathematical structures loosely inspired by the human brain. Each one contains many artificial neurons arranged in layers, with connections between neurons in adjacent layers. The neural network’s behavior is governed by the strength of these connections, called parameters. In a language model, the parameters control which words the model might spit out next, given an initial prompt and the words it has generated already.

Training tiny language models on children's stories can be orders of magnitude faster and cheaper then training larger AI language models

A model only truly comes to life during training, when it repeatedly compares its own output to the text in its training data set and adjusts its parameters to increase the resemblance. An untrained network with random parameers will always produce nonsensical gibberish.


hashtags #
worddensity #

Share