How AI can identify people even in anonymized datasets

Weekly social interactions form unique signatures that make people stand out

Wearing a mask might keep you anonymous in a crowd. But in anonymized mobile phone databases, AI can still find you based on patterns in your social interactions.

leminuit/iStock Unreleased/Getty Images Plus

By Nikk Ogasa

January 25, 2022 at 11:04 am

How you interact with a crowd may help you stick out from it, at least to artificial intelligence.

When fed information about a target individual’s mobile phone interactions, as well as their contacts’ interactions, AI can correctly pick the target out of more than 40,000 anonymous mobile phone service subscribers more than half the time, researchers report January 25 in Nature Communications. The findings suggest humans socialize in ways that could be used to pick them out of datasets that are supposedly anonymized.

It’s no surprise that people tend to remain within established social circles and that these regular interactions form a stable pattern over time, says Jaideep Srivastava, a computer scientist from the University of Minnesota in Minneapolis who was not involved in the study. “But the fact that you can use that pattern to identify the individual, that part is surprising.”

According to the European Union’s General Data Protection Regulation and the California Consumer Privacy Act, companies that collect information about people’s daily interactions can share or sell this data without users’ consent. The catch is that the data must be anonymized. Some organizations might assume that they can meet this standard by giving users pseudonyms, says Yves-Alexandre de Montjoye, a computational privacy researcher at Imperial College London. “Our results are showing that this is not true.”

de Montjoye and his colleagues hypothesized that people’s social behavior could be used to pick them out of datasets containing information on anonymous users’ interactions. To test their hypothesis, the researchers taught an artificial neural network — an AI that simulates the neural circuitry of a biological brain — to recognize patterns in users’ weekly social interactions.

For one test, the researchers trained the neural network with data from an unidentified mobile phone service that detailed 43,606 subscribers’ interactions over 14 weeks. This data included each interaction’s date, time, duration, type (call or text), the pseudonyms of the involved parties and who initiated the communication.

Each user’s interaction data were organized into web-shaped data structures consisting of nodes representing the user and their contacts. Strings threaded with interaction data connected the nodes. The AI was shown the interaction web of a known person and then set loose to search the anonymized data for the web that bore the closest resemblance.

The neural network linked just 14.7 percent of individuals to their anonymized selves when it was shown interaction webs containing information about a target’s phone interactions that occurred one week after the latest records in the anonymous dataset. But it identified 52.4 percent of people when given not just information about the target’s interactions but also those of their contacts. When the researchers provided the AI with the target’s and contacts’ interaction data collected 20 weeks after the anonymous dataset, the AI still correctly identified users 24.3 percent of the time, suggesting social behavior remains identifiable for long periods of time.

To see whether the AI could profile social behavior elsewhere, the researchers tested it on a dataset consisting of four weeks of close-proximity data from the mobile phones of 587 anonymous university students, collected by researchers in Copenhagen. This included interaction data consisting of students’ pseudonyms, encounter times and the strength of the received signal, which was indicative of proximity to other students. These metrics are often collected by COVID-19 contact tracing applications. Given a target and their contacts’ interaction data, the AI correctly identified students in the dataset 26.4 percent of the time.

The findings, the researchers note, probably don’t apply to the contact tracing protocols of Google and Apple’s Exposure Notification system, which protects users’ privacy by encrypting all Bluetooth metadata and banning the collection of location data.

de Montjoye says he hopes the research will help policy makers improve strategies to protect users’ identities. Data protection laws allow the sharing of anonymized data to support useful research, he says. “However, what’s essential for this to work is to make sure anonymization actually protects the privacy of individuals.”