How do babies learn words? An AI experiment may hold clues

An artificial intelligence model learned words from audio and video of a baby

A baby sits on a white couch reading a book next to a teddy bear.

Babies are prodigious language learners. After being fed the sights and words that a baby encountered, an artificial intelligence model picked up its first words.

Vera Livchak/Moment/Getty Images

The AI program was way less cute than a real baby. But like a baby, it learned its first words by seeing objects and hearing words.

After being fed dozens of hours of video of a growing tot exploring his world, an artificial intelligence model could more often than not associate words — ball, cat and car, among others — with their images, researchers report in the Feb. 2 Science. This AI feat, the team says, offers a new window into the mysterious ways that humans learn words (SN: 4/5/17).

Some ideas of language learning hold that humans are born with specialized knowledge that allows us to soak up words, says Evan Kidd, a psycholinguist at the Australian National University in Canberra who was not involved in the study. The new work, he says, is “an elegant demonstration of how infants may not necessarily need a lot of in-built specialized cognitive mechanisms to begin the process of word learning.”

The new model keeps things simple, and small — a departure from many of the large language models, or LLMs, that underlie today’s chatbots. Those models learned to talk from enormous pools of data. “These AI systems we have now work remarkably well, but require astronomical amounts of data, sometimes trillions of words to train on,” says computational cognitive scientist Wai Keen Vong, of New York University.

But that’s not how humans learn words. “The input to a child isn’t the entire internet like some of these LLMs. It’s their parents and what’s being provided to them,” Vong says. Vong and his colleagues intentionally built a more realistic model of language learning, one that relies on just a sliver of data. The question is, “Can [the model] learn language from that kind of input?”

To narrow the inputs down from the entirety of the internet, Vong and his colleagues trained an AI program with the actual experiences of a real child, an Australian baby named Sam. A head-mounted video camera recorded what Sam saw, along with the words he heard, as he grew and learned English from 6 months of age to just over 2 years.

A baby in a plaid onesie sits on the floor and smiles while wearing a headband affixed with a video camera.
Videos taken from a baby named Sam (shown wearing a head-mounted camera) served as the sight and sound input for an AI program. Today, Sam is a happy tween.Courtesy of Sam’s dad

The researchers’ AI program — a type called a neural network — used about 60 hours of Sam’s recorded experiences, connecting objects in Sam’s videos to the words he heard caregivers speak as he saw them. From this data, which represented only about 1 percent of Sam’s waking hours, the model would then “learn” how closely aligned the images and spoken words were.

As this process happened iteratively, the model was able to pick up some key words. Vong and his team tested their model similar to a lab test used to find out which words babies know. The researchers gave the model a word— crib, for instance. Then the model was asked to find the picture that contained a crib from a group of four pictures. The model landed on the right answer about 62 percent of the time. Random guessing would have yielded correct answers 25 percent of the time.

A series of 16 everyday objects in video stills on the left, including ball and crib, and 16 images on the right, including tree and apple, show how the AI model's vocabulary was tested.
To see how well an AI program learned words from video and audio input, researchers used a test like this one. From each set of four images, the model had to identify the one image that contained a specific object. In multiple tests of a set of 22 words, the model chose the right object more than 60 percent of the time.Wai Keen Vong

“What they’ve shown is, if you can make these associations between the language you hear and the context, then you can get off the ground when it comes to word learning,” Kidd says. Of course, the results can’t say whether children learn words in a similar way, he says. “You have to think of [the results] as existence proofs, that this is a possibility of how children might learn language.”

The model made some mistakes. The word hand proved to be tricky. Most of the training images that involved hand happened at the beach, leaving the model confused over hand and sand.  

Kids get tangled up with new words, too (SN: 11/20/17). A common mistake is overgeneralizing, Kidd says, calling all adult men “Daddy,” for instance. “It would be interesting to know if [the model] made the kinds of errors that children make, because then you know it’s on the right track,” he says.

Verbs might also pose problems, particularly for an AI system that doesn’t have a body. The dataset’s visuals for running, for instance, come from Sam running, Vong says. “From the camera’s perspective, it’s just shaking up and down a lot.”

The researchers are now feeding even more audio and video data to their model.  “There should be more efforts to understand what makes humans so efficient when it comes to learning language,” Vong says.

Laura Sanders is the neuroscience writer. She holds a Ph.D. in molecular biology from the University of Southern California.