If you're not worried about your identity being stolen, think again: One privacy expert says ‘reidentification’-- identifying real people from so-called anonymous databases--is a greater risk in a connected world.
I spoke last week with Paul Ohm, associate professor of law at the University of Colorado Law School and a former trial attorney in the U.S. Department of Justice’s Computer Crime and Intellectual Property Section. Excerpts of our conversation are below.
You authored a paper last year that said deleting information like names and Social Security Numbers in large databases does not actually protect our privacy in the way we thought. What allowed this to happen in the first place?
Computer scientists have been thinking about this for a long time. They’ve always known it’s theoretically possible to take a piece of information that looks anonymous and reattach the information that goes with that anonymous data.
But up until about 10 years ago, they thought it wasn’t likely to happen very often because computers just weren’t that powerful. Over the last 10 years two things have happened: Computers have gotten much faster, but more interestingly and importantly, the amount of outside information we have about people has just exploded—from the Internet, social networking trends… people are volunteering more information about themselves.
So over the last 10 years we’ve learned not only that it’s possible but that it’s much easier. So in my paper I make the argument that reidentification can be done more easily, more quickly and more cheaply today.
Why are regulators paying so little attention to it?
It’s not just regulators. Even computer scientists who know a lot about data are not aware of how easy reidentification has become. The regulators can only follow what the experts tell them. And even today the experts aren’t quite good enough at understating the risk. A very small subset does understand. Their results always surprise other people who you would think are experts. So you don’t blame the regulators for being a little bit behind the curve.
Why is this still so surprising to people?
It’s just counterintuitive. We’ve understood that if you delete a name, Social Security Number and home address, the stuff you leave behind is very useful (i.e. you can make a lot of money on it), but we also just feel like it’s very protected. We answer all sorts of personal questions if the person says, “Don’t worry, we’ll remove your identity.” So it’s going to take a little while to shake this.
It takes a little bit of effort to reidentify. It doesn't just happen spontaneously. You need someone who has the motive and time to reidentify. So some people will say, "Oh, you’re just assuming the worst, why does anyone have the motive to reidentify?” I obviously disagree with that.
Explain how reidentification works.
You have one anonymized database, such as the Netflix database of movie ratings. The key is--if I know that someone is in the Netflix database and I know a little bit about the movies that that person likes and dislikes—maybe I read Joe’s blog or I’m his Facebook friend or he was over at my house for dinner—I can identify him. It turns out I don’t need to know much about his movie preferences. If I know three or four movies, I stand a good chance of reidentifying him. If it’s six to eight, I have an excellent chance.
This ability to reidentify is possible because there are other databases that provide missing information. So by putting together two databases I’ve actually learned more than either database can reveal by itself.
It seems surprising that you can figure this out just from a few movie titles.
Here’s the key: The reason reidentification works is that when you get granular enough with the data and if you follow your trial of data, it turns out they are all a tiny bit different from one another. So you don’t have to know many movies Joe likes to know that it’s him.
It’s kind of a happy story of human uniqueness. That’s the silver lining to all this.
Earlier this month the Commerce Department released a green paper that proposes a privacy bill of rights. What are your thoughts on this?
I think it’s great in principle. The devil’s in the details. It depends on what is going into this so-called bill of rights. From the things I’ve seen, I’m not sure they’re sufficiently incorporating the trends I and others are seeing in technology.
We have 100 years of regulating privacy by focusing on the information a particular person has. But real privacy harm will come not from the information they have but the inferences they can draw from the data they have. No law I have ever seen regulates inferences. So maybe in the future we may regulate inferences in a really different way; it seems strange to say you can have all this data but you can’t take this next step. But I think that‘s what the law has to do.
What would you like to see from the regulation?
What I’m starting to do now is think about how I’d make more concrete recommendations. One I’ve been tiptoeing around: Quantity is an interesting thing to me. Reidentification is much easier if you have a lot of data, yet I don’t know of many laws that treat you differently once you have more data; our privacy laws are very qualitative, not quantitative. So if you don’t have sensitive information, you can have as much information as you want. For instance, you’re not regulated if you know 10 things about me, but if you know25 things about me, that might be enough to put you under a stricter form of regulation.
The Commerce report also proposes that industry self-regulate itself with respect to collection of consumers’ data, saying that this will ensure we have an Internet environment that encourages innovation. With this data having such a critical monetary value, do you think the industry can police itself?
Past history doesn't leave me very optimistic. They’ve been trying to self-regulate for decades, yet we learn more every day about invasive practices.
The other thing I’ve now started to say about deidentification/reidentification is that anonymization was like a silver bullet—it protected privacy but it didn’t require that we gave up much. But now we’ve lost our silver bullet and there’s no silver bullet to take its place. We have to ask, what are we willing to give up to protect our privacy?
We have to get used to talking about the price of privacy. People are starting to say, if you have this privacy law, and industry doesn’t have access to this big database, your favorite website will no longer be free. I actually think that’s the right conversation. Maybe we should give up some of the efficiency and convenience of the Internet if we can protect privacy.
It’s all about tradeoffs. Now we have to take tradeoffs more seriously.
What are some examples of what we’d give up for privacy?
So much of the Internet is focused on recommendations. This is the secret sauce of Amazon as well as Netflix. Recommendations are a good thing, but at some point, their ability to make an even better recommendation will cost too much privacy.
Do you think people are willing to have less of that and pay more?
I don’t know. I think it’s a conversation we need to start having.
I think some may start charging, and we have to decide whether we might have to pay for some websites in order to have much better privacy. Then that raises all sort of questions on class and access. These are all really hard, society-wide discussions we need to have. But if we don’t, it will always mean sacrifice in the name of efficiency and convenience. Today, that’s the status quo.