The threat of data re-identification is real. Just ask Netflix. Today, the streaming giant’s powers of personalization inspire worship in product and data circles, but back in 2007, when it was still developing its recommendation engine, it resorted to asking the public for help.
That year’s so-called Netflix Prize offered up $1 million to whoever could devise the best collaborative filtering algorithm. But two researchers at the University of Texas–Austin took the challenge in a different direction.
Netflix released a selection of user data for the competition, but stripped away all personal identifying information — no subscriber names, no movie titles. Still, the researchers were able to successfully de-anonymize a number of users across the nominally anonymous data set. All they had to do was scrape IMDB and compare those ranking patterns with Netflix ranking patterns from the data set.
More than 10 years later, machine learning has made the creation of complex matching algorithms a cinch and such re-identification gambits are even easier to pull off.
“Just erasing a piece of someone’s fingerprint doesn’t get rid of the whole thing.”
“Just erasing a piece of someone’s fingerprint doesn’t get rid of the whole thing,” said Andrew Trask, who leads OpenMined, an open-source community that builds privacy tools for artificial intelligence. “So the question becomes, ‘How can I make formal claims over what actually removes someone’s fingerprint from a document?’”
For Trask and a growing number of experts, a big part of the answer is an approach called differential privacy. The potential watershed method allows data to be masked by deliberately injecting noise into a data set, but in a way that still allows engineers to run all manner of useful statistical analysis on the data. When administered properly, the method makes it impossible to determine whether a given individual is part of a data set any better than guessing would achieve.
What is differential privacy?
Needless to say, it’s a staggering improvement over the anonymization techniques seen in the Netflix Prize contest.
The principle has been around for decades, but the tools it requires have only recently begun to mature. As it comes of age, it’s already proving useful in some of the biggest challenges of 2020. For the first time, the Census Bureau will fully incorporate differential privacy as it collects population data for this year’s census, after experiments showed that its older, non-differentially private data sets were indeed vulnerable to reidentification.
Google prominently used it this year, too, when it published mobility reports, which visualized our aggregate movement patterns — or lack thereof — in these days of social distancing. It could even help with COVID-19 exposure notification.
But differential privacy’s selling point, capitalistically speaking, is that it seems to do that rare trick of actually incentivizing privacy. If the data is cloaked so that no one can pick out an individual, it can be shared — and therefore analyzed and monetized — around the globe, even if it’s “going” to a place with stringent privacy regulations.
The technique could take hold “for purely economic reasons, which is really good news for people who care about privacy,” Trask said.
Right now only a handful of companies have used differential privacy. Outside of a few startups that are building platforms, they’re mostly giant tech firms — Apple, Facebook, LinkedIn, Uber, Google. (Google was first to put the method to commercial use, in 2014, with RAPPOR: its tool for studying Chrome user data without endangering privacy; Apple’s first use followed in 2016.)
That’s partially because the method needs a lot of data to work — and those companies have a lot of data. But given its dovetail between privacy and profitability, plus its increasing technological maturity, we’re standing at “a potential turning point,” according to Aaron Roth, a computer science professor at the University of Pennsylvania and one of the world’s foremost experts on differential privacy.
How Tight Is Our ‘Privacy Budget?’
Differential privacy works in one of two basic fashions. The noise that protects the data set is either added after the fact by the party that collected the information (known as centralized differential privacy) or the noise is directly built into the act of collecting data (local differential privacy, or random response). In the local version, there’s not even an original “true” database to safekeep — the holder of the information never got it in the first place.
The latter might seem particularly counterintuitive (how can giving “noisy” data produce anything accurate?), but it’s statistically sound. A common example used to explain how it works is to imagine if a data-gatherer were surveying how many people cheated on a partner, or committed a crime — situations in which respondents might lie.
The surveyor would ask the respondent to flip a coin. Heads means answer honestly; if tails, flip again, and say yes if heads or no if tails again, regardless of the truth. The surveyor accesses only the responses, not the coin flips that steered them. And they’d still have enough information to determine a probability.
Still, there’s no surefire way to implement differential privacy. Like any data science method, the more one builds privacy into the system, the less accurate it becomes. Too much noise makes the data unreliable; not enough diminishes privacy. That balance swings on a metric called a privacy-loss parameter, or a privacy budget. The lower the number, the better. Think single digits, with less than one being ideal. But a budget that low also requires serious amounts of data.
“How much quantitative privacy risk you’re willing to tolerate depends on how sensitive the data is. What is the bad thing that can happen?”
“There’s no universal answer to that question,” Roth said. “In some cases, you might prioritize privacy over accuracy or vice versa, and the exact tradeoff depends on what you’re trying to do. How much quantitative privacy risk you’re willing to tolerate depends on how sensitive the data is. What is the bad thing that can happen? If the bad thing is really bad, then you might want to be more conservative. Otherwise you might want to be less conservative.”
The theory is sound, Trask notes, but we’re still figuring out precisely how to best apply it. He draws a spacecraft analogy: “We have a pretty good definition of gravity, but we’re still iterating on the best possible rocket to get to the moon.”
Indeed, the privacy/accuracy tradeoff generated not-insignificant backlash against Facebook, which last year released a data set to researchers about election manipulation on the social media platform. When Facebook shot through its data with differential privacy, some researchers who were tasked to interpret the data set reportedly groused about its usability. A similar affair played out among some researchers who rely on census data, who argued that the method restricted data and solved a problem that didn’t exist. The fact that organizations as data-saturated as those are still fine-tuning should illustrate that engineers, broadly, are indeed still trying to perfect the parameter question.
Wherever an organization decides to set that figure, it should be transparent about it, lest it push differential privacy toward the kind of buzzword-ization that sometimes cheapens emergent tech concepts. So far that hasn’t happened — at least not yet — primarily because it’s so labor-intensive just to correctly implement a fully differentially private pipeline even with high, double-digit privacy parameters.
“We haven’t really seen a lot of examples of companies that really don’t care about privacy just doing that to say their algorithm is differentially private — but certainly that could happen,” Roth said. “If you tell me you’re using differential privacy, but don’t tell me the privacy parameter, then you haven’t told me much.”
The Census Bureau, for one, would seem to agree. The committee that will set the privacy budget is still finalizing the figure. It’s currently still analyzing research based on a demonstration experiment it ran on 2010 data and collecting public feedback, but it will eventually make public the privacy-budget number it institutes, a Census spokesperson told Built In.
How Robust Is the Tech?
The impulse toward transparency is making its mark on certain notable corners of differential privacy development too. Right now, there appears to be only one major, thoroughly battle-tested, open-source library for commercial-focused differential privacy: Google’s C++ library. But Trask and his OpenMined colleagues are working to change that. For the past several months, the team has been working on wrapping that library across a variety of languages. They’re developing wrappers for Python, JavaScript, Swift, Scala and Kotlin — a suite of languages that span machine learning, web apps, mobile apps and IoT devices.
Even though differential privacy is conceptually robust, the presently limited number of reliable available implementations leaves a lot of room for trouble, Trask said.
“I could take the tutorial for differential privacy and write something in Python, but the way cryptography works, you could implement the algorithm correctly, but still accidentally build out a library that can be hacked,” he said. “It could have vulnerabilities in it through very small and difficult-to-notice flaws.”
“The way cryptography works, you could implement the algorithm correctly, but still accidentally build out a library that can be hacked.”
Flaws he aims to sidestep with OpenMined’s wrapper roadmap.
OpenMined aims to have a tutorial built for each by early May, with full public releases of each wrapper to follow in June and July. From there, the task becomes connecting to previous OpenMined Python projects and making everything available through all the popular package managers.
“This theme is all about making sure what we do can actually be used in a practical way by the rest of the world,” OpenMined member Benjamin Szymkow wrote on the community’s blog. (They’ll be publishing about various roadblocks and solutions they encounter throughout the process at blog.openmined.org.)
The maturation of these kinds of tools — and engineers’ ability to draw experience from them — will be instrumental to how quickly differential privacy expands beyond tech’s big corridors of power. To circle back to recommendation engines, one could see how a major retailer that also has reams of data (your Targets or Home Depots, perhaps) would want the kind of deeper access that differential privacy affords in order to build a best-in-class recommender system — if only they had the institutional knowledge.
Startups like Privitar and LeapYear do offer some differential privacy services in sectors such as banking and healthcare, but otherwise “it’s still not quite plug and play,” Roth said. (A LeapYear spokesperson declined an interview with Built In.) “Even if the thing you want to do is one of the few things for which there are good [open-source] tools, you still need some sort of expertise to figure out how to use them correctly,” he said.
Can This Help COVID-19 Contact Tracing?
Differential privacy is getting its close-up thanks to the census, but an unexpected factor is also contributing: the pandemic. Strictly speaking, differential privacy isn’t compatible with contact tracing — that is, identifying direct, one-to-one contact between a sick person and a susceptible person — but it could be incorporated into higher-level exposure and proximity notification systems.
While incorporating differential privacy, an app could map out hotspots and let a user know if they spent time in one. “It could tell when you’ve been in places that had a high incidence of COVID-19, so you might be at risk,” Roth said. “That’s different from contact tracing, but it’s something that could be done with differential privacy.”
Or, in the context of more direct COVID-tracing, one could guarantee differential privacy up to a point. For instance, an app could keep all information noisy and differentially private, but if a person tested positive and subsequently alerted contacts, their information would lose the DP safeguard while the notified parties would still retain it. “That still allows you to do some second-degree contact tracing,” Roth said. (The Massachusetts Institute of Technology’s in-the-works exposure-tracking app reportedly incorporates differential privacy, for example.)
“When I enter a bar, why do I need to show the bouncer my name, where I live and other details about me, when all they really need to know is whether I’m of legal age?”
Of course, technologists are working to crack this problem too. OpenMined is also working to develop a so-called private identity server, or “the most mature manifestation” of self-sovereign identity (SSI). SSI is essentially a mechanism that allows someone to authenticate a certain aspect of their identity but withhold others — a cryptographic signature between a group (perhaps a medical organization, in this context) and an individual. The set-up helps “preserve privacy and prevent forgery,” Trask said.
By way of analogy, imagine an ID card for going to bars that didn’t have extraneous (in that context) personal information. “When I enter a bar, why do I need to show the bouncer my name, where I live and other details about me, when all they really need to know is whether I’m of legal age?” said Emma Bluemke, OpenMined’s lead of partnerships and writing, who pointed to the Covid Credentials Initiative as a prominent example of SSI-focused health-status verification efforts.
What Does This Have to Do with a Serial Killer?
As promising as differential privacy is, its arrival doesn’t mean that people suddenly have full control over their data. Roth, in his recent book, The Ethical Algorithm, co-written by Michael Kearns, draws quite the parallel: The Golden State Killer.
The accused serial killer Joseph James DeAngelo was never so careless as to hand over his own DNA to a public genetics database. But several of his distant relatives had. Investigators in 2018 were able to build a family tree and eventually find DeAngelo and charge him with eight counts of first-degree murder, nearly 50 years after the original crimes. Even fully opting out doesn’t mean interested parties can’t still learn a lot about you.
“We've seen the alternative, when there were no privacy-preserving technologies.”
“[T]he world is full of curious correlations, and as machine learning gets more powerful and data sources become more diverse, we are able to learn more and more facts about the world that let us infer information about individuals that they might have wanted to keep private,” they wrote.
Wait, Is This Good?
Unexpected connections aside, even if engineers consistently ace the challenge of calibrating privacy budgets, some data minimalists might still look skeptically upon differential privacy at first blush. Doesn’t it seem a little convenient that the great leap forward in privacy protection happens to also neatly align with commercial profit motives? Not to mention the fact that it asks for more data in order to beef up privacy?
To be sure, as Roth notes in The Ethical Algorithm, when Google first implemented differential privacy, it did so on user data that it had never gathered before. Differential privacy provided it something new. (It’s worth noting that Google used a local, rather than centralized, gathering method here, so it never “knew” any non-noisy, identifiable data.)
For those who generally wish to discourage big data, “it could be viewed as bad from the perspective of making gathering data less harmful and therefore encouraging more of it,” acknowledged Roth.
But that’s a narrow framing for advocates of differential privacy.
“We’ve seen the alternative, when there were no privacy-preserving technologies,” Roth said. “It wouldn’t be that nobody gathers data; it would be that people gather data in a more harmful way. There’s certainly no lack of companies willing to just pull data off your phone and sell it to third parties. So developing technologies to mitigate that problem is a good thing.”