The Cambridge Analytica Data Apocalypse Was Predicted in 2007
In the early 2000s, Alex Pentland was running the wearable computing group at the MIT Media Lab—the place where the ideas behind augmented reality and Fitbit-style fitness trackers got their start. Back then, it was still mostly folks wearing computers in satchels and cameras on their heads. “They were basically cell phones, except we had to solder it together ourselves,” Pentland says. But the hardware wasn’t the important part. The ways the devices interacted was. “You scale that up and you realize, holy crap, we’ll be able to see everybody on Earth all the time,” he says—where they went, who they knew, what they bought.
And so by the middle of the decade, when people were massive social networks like Facebook were taking off, Pentland and his fellow social scientists were beginning to look at network and cell phone data to see how epidemics spread, how friends relate to each other, and how political alliances form. “We’d accidentally invented a particle accelerator for understanding human behavior,” says David Lazer, a data-oriented political scientist then at Harvard. “It became apparent to me that everything was changing in terms of understanding human behavior.” In late 2007 Lazer put together a conference entitled “Computational Social Science,” along with Pentland and other leaders in analyzing what people today call big data.
In early 2009 the attendees of that conference published a statement of principles in the prestigious journal Science. In light of the role of social scientists in the Facebook-Cambridge Analytica debacle—slurping up data on online behavior from millions of users, figuring out the personalities and predilections of those users, and nominally using that knowledge to influence elections—that article turns out to be prescient.
“These vast, emerging data sets on how people interact surely offer qualitatively new perspectives on collective human behavior,” the researchers wrote. But, they added, this emerging understanding came with risks. “Perhaps the thorniest challenges exist on the data side, with respect to access and privacy,” the paper said. “Because a single dramatic incident involving a breach of privacy could produce rules and statutes that stifle the nascent field of computational social science, a self-regulatory regime of procedures, technologies, and rules is needed that reduces this risk but preserves research potential.”
Oh. You don’t say?
Possibly even more disturbing than the idea that Cambridge Analytica tried to steal an election—something lots of people say probably isn’t possible—is the role of scientists in facilitating the ethical breakdowns behind it. When Zeynep Tufekci argues that what Facebook does with people’s personal data is so pervasive and arcane that people can’t possibly give informed consent to it, she’s employing the language of science and medicine. Scientists are supposed to have acquired, through painful experience, the knowledge of how to treat human subjects in their research. Because it can go terribly wrong.
Here’s what’s worse: The scientists warned us about big data and corporate surveillance. They tried to warn themselves.
In big data and computation, the social sciences saw a chance to grow up. “Most of the things we think we know about humanity are based on pitifully little data, and as a consequence they’re not strong science,” says Pentland, an author of the 2009 paper. “It’s all stories and heuristics.” But data and computational social science promised to change that. It’s what science always hopes for—not merely to quantify the now but to calculate what’s to come. Scientists can do it for stars and DNA and electrons; people have been more elusive.
Then they’d take the next quantum leap. Observation and prediction, if you get really good at them, lead to the ability to act upon the system and bring it to heel. It’s the same progress that leads from understanding heritability to sequencing DNA to genome editing, or from Newton to Einstein to GPS. That was the promise of Cambridge Analytica: to use computational social science to influence behavior. Cambridge Analytica said it could do it. It apparently cheated to get the data. And the catastrophe that the authors of that 2009 paper warned of has come to pass.
Pentland puts it more pithily: “We called it.”
The 2009 paper recommends that researchers be better trained—in both big-data methods and in the ethics of handling such data. It suggests that the infrastructure of science, like granting agencies and institutional review boards, should get stronger at handling new demands, because data spills and difficulties in anonymizing bulk data were already starting to slow progress.
Historically, when some group recommends self-regulation and new standards, it’s because that group is worried someone else is about to do it for them—usually a government. In this case, though, the scientists were worried, they wrote, about Google, Yahoo, and the National Security Agency. “Computational social science could become the exclusive domain of private companies and government agencies. Alternatively, there might emerge a privileged set of academic researchers presiding over private data from which they produce papers that cannot be critiqued or replicated,” they wrote. Only strong rules for collaborations between industry and academia would allow access to the numbers the scientists wanted but also protect consumers and users.
“Even when we were working on that paper we recognized that with great power comes great responsibility, and any technology is a dual-use technology,” says Nicholas Christakis, head of the Human Nature Lab at Yale, one of the participants in the conference, and a co-author of the paper. “Nuclear power is a dual-use technology. It can be weaponized.”
Welp. “It is sort of what we anticipated, that there would be a Three Mile Island moment around data sharing that would rock the research community,” Lazer says. “The reality is, academia did not build an infrastructure. Our call for getting our house in order? I’d say it has been inadequately addressed.”
Cambridge Analytica’s scientific foundation—as reporting from The Guardian has shown—seems to mostly derive from the work of Michal Kosinski, a psychologist now at the Stanford Graduate School of Business, and David Stillwell, deputy director of the Psychometrics Centre at Cambridge Judge Business School (though neither worked for Cambridge Analytica or affiliated companies). In 2013, when they were both working at Cambridge, Kosinski and Stillwell were co-authors on a big study that attempted to connect the language people used in their Facebook status updates with the so-called Big Five personality traits (openness, conscientiousness, extraversion, agreeableness, and neuroticism). They’d gotten permission from Facebook users to ingest status updates via a personality quiz app.
Along with another researcher, Kosinski and Stillwell also used a related dataset to, they said, determine personal traits like sexual orientation, religion, politics, and other personal stuff using nothing but Facebook Likes.
Supposedly it was this idea—that you could derive highly detailed personality information from social media interactions and personality tests—that led another social science researcher, Aleksandr Kogan, to develop a similar approach via an app, get access to even more Facebook user data, and then hand it all to Cambridge Analytica. (Kogan denies any wrongdoing and has said in interviews that he is just a scapegoat.)
But take a beat here for a second. That initial Kosinski paper is worth a look. It asserts that Likes enable a machine learning algorithm to predict attributes like intelligence. The best predictors of intelligence, according to the paper? They include thunderstorms, the Colbert Report, science, and … curly fries. Low intelligence: Sephora, ‘I love being a mom,’ Harley Davidson, and Lady Antebellum. The paper looked at sexuality, too, finding that male homosexuality was well-predicted by liking the No H8 campaign, Mac cosmetics, and the musical Wicked. Strong predictors of male heterosexuality? Wu-Tang Clan, Shaq, and ‘being confused after waking up from naps.’
Ahem. If that feels like you might have been able to guess any of those things without a fancy algorithm, well, the authors acknowledge the possibility. “Although some of the Likes clearly relate to their predicted attribute, as in the case of No H8 Campaign and homosexuality,” the paper concludes, “other pairs are more elusive; there is no obvious connection between Curly Fries and high intelligence.”
Kosinski and his colleagues went on, in 2017, to make more explicit the leap from prediction to control. In a paper titled “Psychological Targeting as an Effective Approach to Digital Mass Persuasion,” they exposed people with specific personality traits—extraverted or introverted, high openness or low openness—to advertisements for cosmetics and a crossword puzzle game tailored to those traits. (An aside for my nerds: Likes for “Stargate” and “computers” predicted introversion, but Kosinski and colleagues acknowledged that a potential weakness is that Likes could change in significance over time. “ Liking the fantasy show Game of Thrones might have been highly predictive of introversion in 2011,” they wrote, “but its growing popularity might have made it less predictive over time as its audience became more mainstream.”)
Now, clicking on an ad doesn’t necessarily show that you can change someone’s political choices. But Kosinski says political ads would be even more potent. “In the context of academic research, we cannot use any political messages, because it would not be ethical,” says Kosinski. “The assumption is that the same effects can be observed in political messages.” But it’s true that his team saw more responses to tailored ads than mistargeted ads. (To be clear, this is what Cambridge Analytica said it could do, but Kosinski wasn’t working with the company.)
Reasonable people could disagree. As for the 2013 paper, “all it shows is that algorithmic predictions of Big 5 traits are about as accurate as human predictions, which is to say only about 50 percent accurate,” says Duncan Watts, a sociologist at Microsoft Research and one of the inventors of computational social science. “If all you had to do to change someone’s opinion was guess their openness or political attitude, then even really noisy predictions might be worrying at scale. But predicting attributes is much easier than persuading people.”
Watts says that the 2017 paper didn’t convince him the technique could work, either. The results barely improve click-through rates, he says—a far cry from predicting political behavior. And more than that, Kosinski’s mistargeted openness ads—that is, the ads tailored for the opposite personality characteristic—far outperformed the targeted extraversion ads. Watts says that suggests other, uncontrolled factors are having unknown effects. “So again,” he says, “I would question how meaningful these effects are in practice.”
To the extent a company like Cambridge Analytica says it can use similar techniques for political advantage, Watts says that seems “shady,” and he’s not the only one who thinks so. “On the psychographic stuff, I haven’t see any science that really aligns with their claims,” Lazer says. “There’s just enough there to make it plausible and point to a citation here or there.”
Kosinski disagrees. “They’re going against an entire industry,” he says. “There are billions of dollars spent every year on marketing. Of course a lot of it is wasted, but those people are not morons. They don’t spend money on Facebook ads and Google ads just to throw it away.”
Even if trait-based persuasion doesn’t work as Kosinski and his colleagues hypothesize and Cambridge Analytica claimed, the troubling part is that another trained researcher—Kogan—allegedly delivered data and similar research ideas to the company. In a press release posted on the Cambridge Analytica website on Friday, the acting CEO and former chief data officer of the company denied wrongdoing and insisted that the company deleted all the data they were supposed to according to Facebook’s changing rules. And as for the data that Kogan allegedly brought in via his company GSR, he wrote, Cambridge Analytica “did not use any GSR data in the work we did in the 2016 US presidential election.”
Either way, the overall idea of using human behavioral science to sell ads and products without oversight is still the core of Facebook’s business model. “Clearly these methods are being used currently. But those aren’t examples of the methods being used to understand human behavior,” Lazer says. “They’re not trying to create insights but to use methods out of the academy to optimize corporate objectives.”
Lazer is being circumspect; let me put that a different way: They are trying to use science to manipulate you into buying things.
So maybe Cambridge Analytica wasn’t the Three Mile Island of computational social science. But that doesn’t mean it isn’t a signal, a ping on the Geiger counter. It shows people are trying.
Facebook knows that the social scientists have tools the company can use. Late in 2017, a Facebook blog post admitted that maybe people were getting a little messed up by all the time they spend on social media. “We also worry about spending too much time on our phones when we should be paying attention to our families,” wrote David Ginsberg, Facebook’s director of research, and Moira Burke, a Facebook research scientist. “One of the ways we combat our inner struggles is with research.” And with that they laid out a short summary of existing work, and name-checked a bunch of social scientists with whom the company is collaborating. This, it strikes me, is a little bit like a member of congress caught in a bribery sting insisting he was conducting his own investigation. It’s also, of course, exactly what the social scientists warned of a decade ago.
But those social scientists, it turns out, worry a lot less about Facebook Likes than they do about phone calls and overnight deliverys. “Everybody talks about Google and Facebook, but the things that people say online are not nearly as predictive as, say, what your telephone company knows about you. Or your credit card company,” Pentland says. “Fortunately telephone companies, banks, things like that are very highly regulated companies. So we have a fair amount of time. It may never happen that the data gets loose.”
Here, Kosinski agrees. “If you use data more intrusive than Facebook Likes, like credit card records, if you use methods better than just posting an ad on someone’s Facebook wall, if you spend more money and resources, if you do a lot of A-B testing,” he says, “of course you would boost the efficiency.” Using Facebook Likes is the kind of thing an academic does, Kosinski says. If you really want to nudge a network of humans, he recommends buying credit card records.
Kosinski also suggests hiring someone slicker than Cambridge Analytica. “If people say Cambridge Analytica won the election for Trump, it probably helped, but if he had hired a better company, the efficiency would be even higher,” he says.
That’s why social scientists are still worried. They worry about someone taking that quantum leap to persuasion and succeeding. “I spent quite some time and quite some effort reporting what Dr. Kogan was doing, to the head of the department and legal teams at the university, and later to press like the Guardian, so I’m probably more offended than average by the methods,” Kosinski says. “But the bottom line is, essentially they could have achieved the same goal without breaking any rules. It probably would have taken more time and cost more money.”
Pentland says the next frontier is microtargetting, when political campaigns and extremist groups sock-puppet social media accounts to make it seem like an entire community is spontaneously espousing similar beliefs. “That sort of persuasion, from people you think are like you having what appears to be a free opinion, is enormously effective,” Pentland says. “Advertising, you can ignore. Having people you think are like you have the same opinion is how fads, bubbles, and panics start.” For now it’s only working on edge cases, if at all. But next time? Or the time after that? Well, they did try to warn us.