Sam's View: How a Baby's Experience is Teaching AI to Learn Language

Category Artificial Intelligence

tldr #

A new study used footage from a baby's life to train an AI to learn language like a child. The AI was able to grasp basic concepts with just a small portion of the child's experiences. This approach is different from large language models and could shed light on how children rapidly acquire language and concepts.


content #

Sam was six months old when he first strapped a lightweight camera onto his forehead. For the next year and a half, the camera captured snippets of his life. He crawled around the family’s pets, watched his parents cook, and cried on the front porch with grandma. All the while, the camera recorded everything he heard.

What sounds like a cute toddler home video is actually a daring concept: Can AI learn language like a child? The results could also reveal how children rapidly acquire language and concepts at an early age.

The camera used for the study was a GoPro-like camera strapped to the baby's forehead.

A new study in Science describes how researchers used Sam’s recordings to train an AI to understand language. With just a tiny portion of one child’s life experience over a year, the AI was able to grasp basic concepts—for example, a ball, a butterfly, or a bucket.

The AI, called Child’s View for Contrastive Learning (CVCL), roughly mimics how we learn as toddlers by matching sight to audio. It’s a very different approach than that taken by large language models like the ones behind ChatGPT or Bard. These models’ uncanny ability to craft essays, poetry, or even podcast scripts has thrilled the world. But they need to digest trillions of words from a wide variety of news articles, screenplays, and books to develop these skills.

The study focused on training AI to learn language like a child through everyday experiences.

Kids, by contrast, learn with far less input and rapidly generalize their learnings as they grow. Scientists have long wondered if AI can capture these abilities with everyday experiences alone.

"We show, for the first time, that a neural network trained on this developmentally realistic input from a single child can learn to link words to their visual counterparts," study author Dr. Wai Keen Vong at NYU’s Center for Data Science said in a press release about the research.

The AI, named Child's View for Contrastive Learning, was able to match visuals to audio and learn basic concepts.

Child’s Play .

Children easily soak up words and their meanings from everyday experience. At just six months old, they begin to connect words to what they’re seeing—for example, a round bouncy thing is a "ball." By two years of age, they know roughly 300 words and their concepts.

Scientists have long debated how this happens. One theory says kids learn to match what they’re seeing to what they’re hearing. Another suggests language learning requires a broader experience of the world, such as social interaction and the ability to reason.

Past research has shown that children learn quickly and generalize their learnings as they grow.

It's hard to tease these ideas apart with traditional cognitive tests in toddlers. But we may get an answer by training an AI through the eyes and ears of a child.

M3GAN? .

The new study tapped a rich video resource called SAYCam, which includes data collected from three kids between 6 and 32 months old using GoPro-like cameras strapped to their foreheads.

Twice every week, the cameras recorded around an hour of footage and audio as they nursed, crawled, and played. All audible dialogue was transcribed into "utterances"—words or sentences spoken before the speaker or conversation changes. The result is a wealth of multimedia data from the perspective of babies and toddlers.

Traditional cognitive tests in toddlers make it hard to determine how language learning happens.

For the new system, the team designed two neural networks with a "judge" to coordinate them. One translated first-person visuals into the whos and whats of a scene—is it a mom cooking? Throwing a Frisbee? The part "behind" this perspective becomes what’s called its "context." The other neural network describes these events by generating rich high-level sentences.

The judge assesses the quality of the pair, so both AI models learn how to become "multimodal"—the name for AI that can match sound and vision to words. The result is a smart system that can transcribe a scene into written language as heard from a specific point of view.

The new system used a judge to coordinate two neural networks and accurately transcribe a scene into written language.

The team tested the model on SAYCam and found the AI could accurately predict the utterances by a quantifiable margin.


hashtags #
worddensity #

Share