Posting in Environment
If you're not worried about your identity being stolen, think again: One privacy expert says 'reidentification'--identifying real people from so-called anonymous databases--is a greater risk in a connected world.
If you're not worried about your identity being stolen, think again: One privacy expert says ‘reidentification’-- identifying real people from so-called anonymous databases--is a greater risk in a connected world.
I spoke last week with Paul Ohm, associate professor of law at the University of Colorado Law School and a former trial attorney in the U.S. Department of Justice’s Computer Crime and Intellectual Property Section. Excerpts of our conversation are below.
You authored a paper last year that said deleting information like names and Social Security Numbers in large databases does not actually protect our privacy in the way we thought. What allowed this to happen in the first place?
Computer scientists have been thinking about this for a long time. They’ve always known it’s theoretically possible to take a piece of information that looks anonymous and reattach the information that goes with that anonymous data.
But up until about 10 years ago, they thought it wasn’t likely to happen very often because computers just weren’t that powerful. Over the last 10 years two things have happened: Computers have gotten much faster, but more interestingly and importantly, the amount of outside information we have about people has just exploded—from the Internet, social networking trends… people are volunteering more information about themselves.
So over the last 10 years we’ve learned not only that it’s possible but that it’s much easier. So in my paper I make the argument that reidentification can be done more easily, more quickly and more cheaply today.
Why are regulators paying so little attention to it?
It’s not just regulators. Even computer scientists who know a lot about data are not aware of how easy reidentification has become. The regulators can only follow what the experts tell them. And even today the experts aren’t quite good enough at understating the risk. A very small subset does understand. Their results always surprise other people who you would think are experts. So you don’t blame the regulators for being a little bit behind the curve.
Why is this still so surprising to people?
It’s just counterintuitive. We’ve understood that if you delete a name, Social Security Number and home address, the stuff you leave behind is very useful (i.e. you can make a lot of money on it), but we also just feel like it’s very protected. We answer all sorts of personal questions if the person says, “Don’t worry, we’ll remove your identity.” So it’s going to take a little while to shake this.
It takes a little bit of effort to reidentify. It doesn't just happen spontaneously. You need someone who has the motive and time to reidentify. So some people will say, "Oh, you’re just assuming the worst, why does anyone have the motive to reidentify?” I obviously disagree with that.
Explain how reidentification works.
You have one anonymized database, such as the Netflix database of movie ratings. The key is--if I know that someone is in the Netflix database and I know a little bit about the movies that that person likes and dislikes—maybe I read Joe’s blog or I’m his Facebook friend or he was over at my house for dinner—I can identify him. It turns out I don’t need to know much about his movie preferences. If I know three or four movies, I stand a good chance of reidentifying him. If it’s six to eight, I have an excellent chance.
This ability to reidentify is possible because there are other databases that provide missing information. So by putting together two databases I’ve actually learned more than either database can reveal by itself.
It seems surprising that you can figure this out just from a few movie titles.
Here’s the key: The reason reidentification works is that when you get granular enough with the data and if you follow your trial of data, it turns out they are all a tiny bit different from one another. So you don’t have to know many movies Joe likes to know that it’s him.
It’s kind of a happy story of human uniqueness. That’s the silver lining to all this.
Earlier this month the Commerce Department released a green paper that proposes a privacy bill of rights. What are your thoughts on this?
I think it’s great in principle. The devil’s in the details. It depends on what is going into this so-called bill of rights. From the things I’ve seen, I’m not sure they’re sufficiently incorporating the trends I and others are seeing in technology.
We have 100 years of regulating privacy by focusing on the information a particular person has. But real privacy harm will come not from the information they have but the inferences they can draw from the data they have. No law I have ever seen regulates inferences. So maybe in the future we may regulate inferences in a really different way; it seems strange to say you can have all this data but you can’t take this next step. But I think that‘s what the law has to do.
What would you like to see from the regulation?
What I’m starting to do now is think about how I’d make more concrete recommendations. One I’ve been tiptoeing around: Quantity is an interesting thing to me. Reidentification is much easier if you have a lot of data, yet I don’t know of many laws that treat you differently once you have more data; our privacy laws are very qualitative, not quantitative. So if you don’t have sensitive information, you can have as much information as you want. For instance, you’re not regulated if you know 10 things about me, but if you know25 things about me, that might be enough to put you under a stricter form of regulation.
The Commerce report also proposes that industry self-regulate itself with respect to collection of consumers’ data, saying that this will ensure we have an Internet environment that encourages innovation. With this data having such a critical monetary value, do you think the industry can police itself?
Past history doesn't leave me very optimistic. They’ve been trying to self-regulate for decades, yet we learn more every day about invasive practices.
The other thing I’ve now started to say about deidentification/reidentification is that anonymization was like a silver bullet—it protected privacy but it didn’t require that we gave up much. But now we’ve lost our silver bullet and there’s no silver bullet to take its place. We have to ask, what are we willing to give up to protect our privacy?
We have to get used to talking about the price of privacy. People are starting to say, if you have this privacy law, and industry doesn’t have access to this big database, your favorite website will no longer be free. I actually think that’s the right conversation. Maybe we should give up some of the efficiency and convenience of the Internet if we can protect privacy.
It’s all about tradeoffs. Now we have to take tradeoffs more seriously.
What are some examples of what we’d give up for privacy?
So much of the Internet is focused on recommendations. This is the secret sauce of Amazon as well as Netflix. Recommendations are a good thing, but at some point, their ability to make an even better recommendation will cost too much privacy.
Do you think people are willing to have less of that and pay more?
I don’t know. I think it’s a conversation we need to start having.
I think some may start charging, and we have to decide whether we might have to pay for some websites in order to have much better privacy. Then that raises all sort of questions on class and access. These are all really hard, society-wide discussions we need to have. But if we don’t, it will always mean sacrifice in the name of efficiency and convenience. Today, that’s the status quo.
Mar 27, 2011
Seems more like a veiled attempt to scare more peeps into agreeing with a pay as you go Internet. And as far as identity goes there is no such thing as privacy on the Internet. Who's being kidded by whom? Wake up sleepy head.
Always mix in some lies and fabrications when participating in online forums and polls. Keep the database miners guessing.
My experience here miight make great fodder for someone more capable than I am. Please read on and see if anyone else wished to post such information, but be sure to use details to suppoort it; things that others can try in order to verify it as my post does. A site had a policy that it doesn't re-identify. We have their word on that, in plain English and easy to understand language. But how do we KNOW they aren't? They could covertly do it for years before they were discovered because they'll be harder to catch than those they re-identified. They're doing it NOW and I doubt we know for sure who'd doing it from a verifiable point of view; it's all by inference and hard to detect in most cases. Social sites were obviously missing from the article and I wonder why? Collusion? Fear of law suits? What? For example, any site that won't let you re-use a username or password AFTER you've closed out your accounts, is likely doing it. Take FaceBook: Please! They never actually delete anything, even if the user has deleted it. All they do is remove your access to it, and It's stashed away in some secret cache/stash somewhere. Why do I think that? Well, I can't prove it of course, but they DO tell you certain things in their policies that lends to that very thing; either reidentification of sales of data to someone who does reidentification. One thing that adds to my thoughts on this is that they require a user toaccept ALL cookies, including all third party cookies, but will not disclose who those parties might be. If there is a third party, there is likely a fourth party and so on. I joined them once, with ALL cookies accepted and post-session immediate removal of all third-party cookies. There must have been 20 of them though I didn't count. And, it was all phony data excapt that the IP I spoofed would have been true, although no such person by that name existed onlne for the full complement of data. The person was NOT online, didn't own a computer, and gave me permission to use his name as long as I didn't use his phone, street address or state. It was easy to do. I accessed the site three more times, looking around at various areas of the site. The 4th time, I got the dreaded "You do not have cookies enabled ... . Obviously because, even though I accepted them, I had post-session deleted them from my machine. I deleted all information and my account. They verified my account was no longer "usable". Their word, not mine. I waited a week til the followinig Saturday: turned off accepting ANY third party cookies with the intent of opening a new account. I wasn't even able to create an account. It wasn't from my IP; they didn't have MY IP. And besides, I don't have a static IP Address of course. I used my own name and information this time. I was unable to get arount it. Then: Support told me to turn on accepting ALL cookies and I'd be able to get in. I told them I wouldn't do that, and the conversation was over. Actually, they hung up on me. Obviously what they did was try to place a 3rd party cookie and then read it and when it wasn't there to be read, well... . The interesting thing is that they never said to turn on third party cookies, only "cookies". Now, if third party site cookies are so important to them, there's a very good reason for it and we all know what that is. I'm doing the same thing for the other two major social sites and expecting the same or similar results. I have to wonder too if they aren't exchanging data amongst all three or more of similar social sites. I know there are some strange things with LinekIn but they do let me in without the 3rd party cookies enabled. I think; it's been over 3 weeks since I was there. The above is all my own opinion based on my own experiences and no one/nothing else.
Opsec (Operations Security) programs are intended to reduce or eliminate the unintentional release of the unclassified indicators that can be put together to obtain a classified conclusion. Most of us routinely violate the precepts of this type of programs in our on-line activities.
What he is saying is really operational analysis type of work. You see A, you see B, but together you can figure out C. Example: Large order of Green dye was announced as being ordered by a unnamed company. Another large order was also announced of cotton to a textile mill - enough to make 300,000 suits. You know that a textile mill just received a contract from the government and is now hiring people: result now you know 300,000 uniforms are being made for a 100,000 German Army - so the German Army is expanding outside the Treaty of Versailles.
It has always been my experience that different databases DO NOT necessarily talk to each other. In fact, many will not. There are so many database constructs out there, that the chances of finding two that will converge data are slim. Of course, this is only true with small databases. I am not conversant with large databases to understand any (if any) similarities that exist.
Paying for a website will not make it more private. But a website has to be paid for. 1 way is to sell data about the people who visit the site. If that source of income is gone, another is to charge the users.