In 2007, DNA pioneer James Watson became the first person to have his entire genome sequenced—making all of his 6 billion base pairs publicly available for research. Well, almost all of them. He left one spot blank, on the long arm of chromosome 19, where a gene called APOE lives. Certain variations in APOE increase your chances of developing Alzheimer’s, and Watson wanted to keep that information private.
Except it wasn’t. Researchers quickly pointed out you could predict Watson’s APOE variant based on signatures in the surrounding DNA. They didn’t actually do it, but database managers wasted no time in redacting another two million base pairs surrounding the APOE gene.
This is the dilemma at the heart of precision medicine: It requires people to give up some of their privacy in service of the greater scientific good. To completely eliminate the risk of outing an individual based on their DNA records, you’d have to strip it of the same identifying details that make it scientifically useful. But now, computer scientists and mathematicians are working toward an alternative solution. Instead of stripping genomic data, they’re encrypting it.
Gill Bejerano leads a developmental biology lab at Stanford that investigates the genetic roots of human disease. In 2013, when he realized he needed more genomic data, his lab joined Stanford Hospital’s Pediatrics Department—an arduous process that required extensive vetting and training of all his staff and equipment. This is how most institutions solve the privacy perils of data sharing. They limit who can access all the genomes in their possession to a trusted few, and only share obfuscated summary statistics more widely.
So when Bejerano found himself sitting in on a faculty talk given by Dan Boneh, head of the applied cryptography group at Stanford, he was struck with an idea. He scribbled down a mathematical formula for one of the genetic computations he uses often in his work. Afterward, he approached Boneh and showed it to him. “Could you compute these outputs without knowing the inputs?” he asked. “Sure,” said Boneh.
Last week, Bejerano and Boneh published a paper in Science that did just that. Using a cryptographic “genome cloaking” method, the scientists were able to do things like identify responsible mutations in groups of patients with rare diseases and compare groups of patients at two medical centers to find shared mutations associated with shared symptoms, all while keeping 97 percent of each participant’s unique genetic information completely hidden. They accomplished this by converting variations in each genome into a linear series of values. That allowed them to conduct any analyses they needed while only revealing genes relevant to that particular investigation.
“Just like programs have bugs, people have bugs,” says Bejerano. Finding disease-causing genetic traits is a lot like spotting flaws in computer code. You have to compare code that works to code that doesn’t. But genetic data is much more sensitive, and people (rightly) worry that it might be used against them by insurers, or even stolen by hackers. If a patient held the cryptographic key to their data, they could get a valuable medical diagnosis while not exposing the rest of their genome to outside threats. “You can make rules about not discriminating on the basis of genetics, or you can provide technology where you can’t discriminate against people even if you wanted to,” says Bejerano. “That’s a much stronger statement.”
The National Institutes of Health have been working toward such a technology since reidentification researchers first began connecting the dots in “anonymous” genomics data. In 2010, the agency founded a national center for Integrating Data for Analysis, Anonymization and Sharing housed on the campus of UC San Diego. And since 2015, iDash has been funding annual competitions to develop privacy-preserving genomics protocols. Another promising approach iDash has supported is something called fully homomorphic encryption, which allows users to run any computation they want on totally encrypted data without losing years of computing time.
Kristen Lauter, head of cryptography research at Microsoft, focuses on this form of encryption, and her team has taken home the iDash prize two years running. Critically, the method encodes the data in such a way that scientists don’t lose the flexibility to perform medically useful genetic tests. Unlike previous encryption schemes, Lauter’s tool preserves the underlying mathematical structure of the data. That allows computers to do the math that delivers genetic diagnoses, for example, on totally encrypted data. Scientists get a key to decode the final results, but they never see the source.
This is extra important as more and more genetic data moves off local servers and into the cloud. The NIH lets users download human genomic data from its repositories, and in 2014, the agency started letting people store and analyze that data in private or commercial cloud environments. But under NIH’s policy, it’s the scientists using the data—not the cloud service provider—responsible with ensuring its security. Cloud providers can get hacked, or subpoenaed by law enforcement, something researchers have no control over. That is, unless there’s a viable encryption for data stored in the cloud.
“If we don’t think about it now, in five to 10 years a lot people’s genomic information will be used in ways they did not intend,” says Lauter. But encryption is a funny technology to work with, she says. One that requires building trust between researchers and consumers. “You can propose any crazy encryption you want and say it’s secure. Why should anyone believe you?”
That’s where federal review comes in. In July, Lauter’s group, along with researchers from IBM and academic institutions around the world launched a process to standardize homomorphic encryption protocols. The National Institute for Standards and Technology will now begin reviewing draft standards and collecting public comments. If all goes well, genomics researchers and privacy advocates might finally have something they can agree on.