The Benefits of Training Tiny Language Models on Children's Stories
Category Computer Science Saturday - October 7 2023, 20:33 UTC - 1 year ago Two Microsoft researchers have introduced a novel technique for training small language models - by raising them on a strict diet of children's stories. Their paper, posted to arxiv.org, shows how language models thousands of times smaller then current state-of-the-art systems can rapidly learn to tell consistent and grammatical stories, hinting at new ways of training large models and understanding their behavior.
Learning English is no easy task, as countless students well know. But when the student is a computer, one approach works surprisingly well: Simply feed mountains of text from the internet to a giant mathematical model called a neural network. That’s the operating principle behind generative language models like OpenAI’s ChatGPT, whose ability to converse coherently (if not always truthfully) on a wide range of topics have surprised researchers and the public over the past year.
But the approach has its drawbacks. For one thing, the "training" procedure required to transmute vast text archives into state-of-the-art language models is costly and time-intensive. For another, even the people who train large language models find it hard to understand their inner workings; that, in turn, makes it hard to predict the many ways they can fail.
Faced with these difficulties, some researchers have opted to train smaller models on smaller data sets and then study their behavior. "It’s like sequencing the Drosophila genome versus sequencing the human genome," said Ellie Pavlick, a language model researcher at Brown University.
Now, in a paper recently posted to the scientific preprint server arxiv.org, a pair of Microsoft researchers have introduced a new method for training tiny language models: Raise them on a strict diet of children’s stories. Machine learning researchers have embraced this lesson. GPT-3.5, the large language model that powers the ChatGPT interface, has nearly 200 billion parameters, and it was trained on a data set comprising hundreds of billions of words. (OpenAI hasn’t released the corresponding figures for its successor, GPT-4.) Training such large models typically requires at least 1,000 specialized processors called GPUs running in parallel for weeks at a time. Only a few companies can muster the requisite resources, let alone train and compare different models.
The two researchers showed that language models thousands of times smaller than today’s state-of-the-art systems rapidly learned to tell consistent and grammatical stories when trained in this way. Their results hint at new research directions that might be helpful for training larger models and understanding their behavior.
"I found this paper very informative," said Chandra Bhagavatula, a language model researcher at the Allen Institute for Artificial Intelligence in Seattle. "The concept itself is super interesting." .
Once Upon a Time .
The neural networks at the heart of language models are mathematical structures loosely inspired by the human brain. Each one contains many artificial neurons arranged in layers, with connections between neurons in adjacent layers. The neural network’s behavior is governed by the strength of these connections, called parameters. In a language model, the parameters control which words the model might spit out next, given an initial prompt and the words it has generated already.
A model only truly comes to life during training, when it repeatedly compares its own output to the text in its training data set and adjusts its parameters to increase the resemblance. An untrained network with random parameers will always produce nonsensical gibberish.
Share