AI learned how to sway humans by watching a cooperative cooking game
Offline reinforcement learning could teach bots to collaborate, or manipulate
If you’ve ever cooked a complex meal with someone, you know the level of coordination required. Someone dices this, someone sautés that, as you dance around holding knives and hot pans. Meanwhile, you might wordlessly nudge each other, placing ingredients or implements within the other’s reach when you’d like something done.
How might a robot handle this type of interaction?
Research presented in late 2023 at the Neural Information Processing Systems, or NeurIPS, conference, in New Orleans, offers some clues. It found that in a simple virtual kitchen, AI can learn how to influence a human collaborator just by watching humans work together.
In the future, humans will increasingly collaborate with artificial intelligence, both online and in the physical world. And sometimes we’ll want an AI to silently guide our choices and strategies, like a good teammate who knows our weaknesses. “The paper addresses a crucial and pertinent problem,” how AI can learn to influence people, says Stefanos Nikolaidis, who directs the Interactive and Collaborative Autonomous Robotic Systems (ICAROS) lab at the University of Southern California in Los Angeles, and was not involved in the work.
The new work introduces a way for AI to learn to collaborate with humans, without even practicing with us. It could help us improve human-AI interactions, Nikolaidis says, and detect when AI might take advantage of us — whether humans have programmed it to do so, or, someday, it decides to do so on its own.
Learning by watching
There are a few ways researchers have already trained AI to influence people. Many approaches involve what’s called reinforcement learning (RL), in which an AI interacts with an environment — which can include other AIs or humans — and is rewarded for making sequences of decisions that lead to desired outcomes. DeepMind’s program AlphaGo, for example, learned the board game Go using RL.
But training a clueless AI from scratch to interact with people through sheer trial-and-error can waste a lot of human hours, and can even presents risks if there are, say, knives involved (as there might be in a real kitchen). Another option is to train one AI to model human behavior, then use that as a tireless human substitute for another AI to learn to interact with. Researchers have used this method in, for example, a simple game that involved entrusting a partner with monetary units. But realistically replicating human behavior in more complex scenarios, such as a kitchen, can be difficult.
The new research, from a group at the University of California, Berkeley, used what’s called offline reinforcement learning. Offline RL is a method for developing strategies by analyzing previously documented behavior rather than through real-time interaction. Previously, offline RL had been used mostly to help virtual robots move or to help AIs solve mazes, but here it was applied to the tricky problem of influencing human collaborators. Instead of learning by interacting with people, this AI learned by watching human interactions.
Humans already have a modicum of competence at collaboration. So the amount of data needed to demonstrate competent collaboration when two people are working together is not as much as would be needed if one person were interacting with an AI that had never interacted with anyone before.
Making soup
In the study, the UC Berkeley researchers used a video game called Overcooked, where two chefs divvy up tasks to prepare and serve meals, in this case soup, which earns them points. It’s a 2-D world, seen from above, filled with onions, tomatoes, dishes and a stove with pots. At each time step, each virtual chef can stand still, interact with whatever is in front of it, or move up, down, left or right.
The researchers first collected data from pairs of people playing the game. Then they trained AIs using offline RL or one of three other methods for comparison. (In all methods, the AIs were built on a neural network, a software architecture intended to roughly mimic how the brain works.) In one method, the AI just imitated the humans. In another, it imitated the best human performances. The third method ignored the human data and had AIs practice with each other. And the fourth was the offline RL, in which AI does more than just imitate; it pieces together the best bits of what it sees, allowing it to perform better than the behavior it observes. It uses a kind of counterfactual reasoning, where it predicts what score it would have gotten if it had followed different paths in certain situations, then adapts.
The AIs played two versions of the game. In the “human-deliver” version, the team earned double points if the soup was delivered by the human partner. In the “tomato-bonus” version, soup with tomato and no onion earned double points. After the training, the chefbots played with real people. The scoring system was different during training and evaluation than when the initial human data were collected, so the AIs had to extract general principles to score higher. Crucially, during evaluation, humans didn’t know these rules, like no onion, so the AIs had to nudge them.
On the human-deliver game, training using offline RL led to an average score of 220, about 50 percent more points than the best comparison methods. On the tomato-bonus game, it led to an average score of 165, or about double the points. To support the hypothesis that the AI had learned to influence people, the paper described how when the bot wanted the human to deliver the soup, it would place a dish on the counter near the human. In the human-human data, the researchers found no instances of one person passing a plate to another in this fashion. But there were events where someone put down a dish and ones where someone picked up a dish, and the AI could have seen value in stitching these acts together.
Nudging human behavior
The researchers also developed a method for the AI to infer and then influence humans’ underlying strategies in cooking steps, not just their immediate actions. In real life, if you know that your cooking partner is slow to peel carrots, you might jump on that role each time until your partner stops going for the carrots. A modification to the neural network to consider not only the current game state but also a history of their partner’s actions would give a clue as to what their partner’s current strategy is.
Again, the team collected human-human data. Then they trained AIs using this offline RL network architecture or the previous offline RL one. When tested with human partners, inferring the partner’s strategy improved scores by roughly 50 percent on average. In the tomato-bonus game, for example, the bot learned to repeatedly block the onions until people eventually left them alone. That the AI worked so well with humans was surprising, says study coauthor Joey Hong, a computer scientist at UC Berkeley.
“Avoiding the use of a human model is great,” says Rohan Paleja, a computer scientist at MIT Lincoln Laboratory in Lexington, Mass., who was not involved in the work. “It makes this approach applicable to a lot of real-world problems that do not currently have accurate simulated humans.” He also said the system is data-efficient; it achieved its abilities after watching only 20 human-human games (each 1,200 steps long).
Nikolaidis sees potential for the method to enhance AI-human collaboration. But he wishes that the authors had better documented the observed behaviors in the training data and exactly how the new method changed people’s behaviors to improve scores.
For better or worse
In the future, we may be working with AI partners in kitchens, warehouses, operating rooms, battlefields and purely digital domains like writing, research and travel planning. (We already use AI tools for some of these tasks.) “This type of approach could be helpful in supporting people to reach their goals when they don’t know the best way to do this,” says Emma Brunskill, a computer scientist at Stanford University who was not involved in the work. She proposes that an AI could observe data from fitness apps and learn to better nudge people to meet New Year’s exercise resolutions through notifications (SN: 3/8/17). The method might also learn to get people to increase charitable donations, Hong says.
On the other hand, AI influence has a darker side. “Online recommender systems can, for example, try to have us buy more, or watch more TV,” Brunskill says, “not just for this moment, but also to shape us into being people who buy more or watch more.”
Previous work, which was not about human-AI collaboration, has shown how RL can help recommender systems manipulate users’ preferences so that those preferences would be more predictable and satisfiable, even if people didn’t want their preferences shifted. And even if AI means to help, it may do so in ways we don’t like, according to Micah Carroll, a computer scientist at UC Berkeley who works with one of the paper authors. For instance, the strategy of blocking a co-chef’s path could be seen as a form of coercion. “We, as a field, have yet to integrate ways for a person to communicate to a system what types of influence they are OK with,” he says. “For example, ‘I’m OK with an AI trying to argue for a specific strategy, but not forcing me to do it if I don’t want to.’”
Hong is currently looking to use his approach to improve chatbots (SN: 2/1/24). The large language models behind interfaces such as ChatGPT typically aren’t trained to carry out multi-turn conversations. “A lot of times when you ask a GPT to do something, it gives you a best guess of what it thinks you want,” he says. “It won’t ask for clarification to understand your true intent and make its answers more personalized.”
Learning to influence and help people in a conversation seems like a realistic near-term application. “Overcooked,” he says, with its two dimensions and limited menu, “is not really going to help us make better chefs.”