Innovation

Q&A: How the mythology of big data can blind us

Big data collection is a part of everyday life now. But do the collectors even know what they are looking for?

Written by Sonya James, Contributor June 27, 2014 at 12:00 p.m. PT

Kate Crawford has an unusual talent. She demystifies big data so effectively she finds herself flown around the world to do just that. But there's a catch. Just when you think you understand how big data works, she throws a conceptual wrench in the conversation that is so surprising you walk away thinking, "Wait, what?" And big data is complicated again, albeit in a much more compelling way.

As a researcher at Microsoft Research, a visiting professor at MIT's Center for Civic Media, a senior fellow at NYU's Information Law Institute, and a fully booked public speaker, Crawford is self-professed to have "a ridiculous collection of titles." Most strikingly, they explode the boundaries between the public sphere, the private sector, and academia.

Crawford began writing about big data in 2011. "There is a lot of hand waving to make big data look magical and mysterious, yet there are very pragmatic and practical things happening on the technological side," Crawford said.

Kate Crawford

In many ways, clarifying how big data functions in society is a highly political act. By focusing on the emotional consequences of an era of surveillance, Crawford has been able to highlight big data's alarming effect on the democratic process. Yet importantly, her work also seeks out the possibilities of a healthy, symbiotic relationship between big data and the huge systems it functions within -- governance, social media, consumer spaces, and humanitarianism to name only a few.

Big Data and our conceptions of Big Brother do not seem far apart. There is something very ambiguous and mysterious about "Big Data" -- it feels like an omnipresent, yet elusive power. Can you demystify what we know about how big data is tracking us?

In a short period of time big data has become a very well known phrase -- really just in the last five years. There is a mythological component to big data that can be blinding in particular ways: with more data comes more truth, and the larger the data set the more objective it is. What I found in my research was that there are some really good examples of where this fails us.

One of the things that I looked at was that Street Bump app. This is a nice example of how large data sets can still have particular kinds of sampling bias in them. Street Bump was designed by the city of Boston. It collects your accelerometer and GPS data as you are driving down the street. It reports every time you hit a bump back to the city. For example, the city sees that a thousand people have gone over a bump and so they know to fill in the pothole.

When you start comparing the statistics around who has a smartphone in Boston you begin to see smartphone ownership still maps more closely to affluent and younger populations. In fact, when we look at older populations and particularly low-income populations, smartphone penetration drops down below 20%. This way of sampling a city through a big data lens is inherently oversampling younger, wealthier people. That has real resource implications if it means roads in those areas are going to get fixed first or get more attention. Fortunately, in the case of Street Bump they found ways to try and offset that problem. They gave city workers the app, and they drive around the entire city. It is a very interesting parable that makes big data personal.

I have also looked at major flood events in Australia and hurricanes in the U.S. If we rely on Twitter data, for example, to try and get a sense of what's happening during a crisis event, you are automatically getting a much younger, urban, privileged account of that particular event.

It sounds as if you are pointing to the unpredictable social consequences that big data collection can overlook. Is that partially because the kinds of questions are not in the interest of those who have the technological resources to be collecting big data in the first place? Is this connected to your phrase, "big data fundamentalism"?

That is a really good set of questions. I use big data fundamentalism as a shorthand for the kinds of celebration of big data that are not grounded in particular realities. This is when we hear people saying, essentially, "Correlation is just as good as causation and the bigger the data set the more objective it becomes."

People are now starting to be more critical of those sorts of claims. Big data fundamentalism is more of a tactic being used, and I think it is good to be very wary and critical when we see people draw on those kinds of ways of understanding the world.

But this other question around how we got there -- it is interesting because you can trace an intellectual history. There was a real break between computer science and statistics and the more humanistic and social science disciplines where you might see things like philosophy and ethics being discussed. The fact that these disciplines are split becomes quite problematic. You really want to have a deep understanding of statistics when you are working with a data set to avoid these kinds of biases and representations. But you also want to have a very firm grounding in ideas of ethics and an understanding of how societies change over time. That comes from the social and humanistic sciences. You can't just look through one of those lenses and think that you're seeing the whole picture.

That answers one side of the question, but the other side is who has the resources to be running these studies. Who has the power to organize the teams?

I would actually say that is the central motivating question that guides my work. When we talk about data we are really talking about power. Data is actually playing a very important role in terms of concentrations of power. That is a very difficult problem to solve. There are certain entities in both the public and private sectors that have enormous amounts of data.

A recent example is Amazon's negotiations with the publisher Hachette regarding their e-book prices. The fact that Amazon has the capacity to say, "Until you agree to our terms, we're going to slow down the delivery of your books by three to four weeks" -- that is extraordinary power.

These very serious concentrations of power have come about through particular arrangements of both technical infrastructures and data. This plays out in consumer spaces, but also government spaces and humanitarianism as well. That's one of the reasons I find crisis data so interesting. Who has access to that data and what does it mean for the community that is been affected by a disaster? How can they reflect on their own data? How can that enrich their own community and their own community responses? Or is it always something that is being used by agents who are far away and able to observe their activity on a dashboard rather than allowing them to use that data themselves?

Could you give us more examples of big data collection going awry versus a successful project?

Let me give you three examples of projects I'm working on right now that touch on these questions in different ways. One is a large international interview-based study that I have been doing with Mike Ananny from The University of Southern California. We have been looking at news app designers -- the people who are creating the apps we use to engage with our news everyday. This new ecosystem of news is a key part of how we understand the public sphere.

We can customize an application to acknowledge our preferences and the kind of news stories that we like. But what do you do as a news app designer when you find that someone is just interested in reading stories about the Kardashians and you really want to inform them about what's happening in Syria? On what basis would you give somebody a new story outside of what they have set in their particular personalization? This is a space where we have seen enormous changes based on particular forms of data personalization and aggregation. But the designers are having to make these kinds of decisions as they make these new spaces of engagement.

Are you questioning the role of the designer as a content editor?

It's even bigger than that. To some degree, designers can actually play the role the publishers once played. They get to shape our way of understanding news as a space. That's extraordinarily powerful. Because this new ecology of news has happened so quickly, we don't have much of a public conversation around what we expect from the spaces. We interviewed designers across a range of the most popular news apps. It was really interesting to hear them struggling with similar problems yet finding different kinds of approaches. It really gets down to this basic understanding of what we think news does. What is news for? What do we value in news? That is something designers are grappling with in very idiosyncratic ways.

Another study I am working on looks at high-frequency trading algorithms. You specifically asked about some examples of where this is going horribly wrong. In terms of big data, it is widely known now that high-frequency trading is producing shifts faster than human cognition. That raises a lot of questions in terms of governance, ethics, and what we think it means to have a functioning financial system that we understand. We are looking specifically at the relationship between social media and high-frequency trading algorithms. We are seeing that events on social media can cause rapid market fluctuation and flash crashes.

The most recent example was on the AP. Their feed got hacked and they sent out a message that there had been an explosion at the White House and we saw a massive drop in the Dow Jones. It was an extraordinary market collapse that happened on the basis of a very quick response from both human and nonhuman agents trading on the basis of this false information. Diving into the complexity of this connective tissue between these huge systems is very interesting.

Finally, we just finished a paper on the mechanism of the flag. These little flags we use to report offensive content on platforms like Facebook, YouTube, and Twitter. It is supposed to be a structure that allows us to shape public discourse -- to say things are inappropriate, pornographic, bullying. This is fascinating because you think you are having an impact on public discourse but it is also completely opaque because you have no idea how many flags it's going to take for YouTube to remove the video or Facebook to say that post is inappropriate. So it's this tiny mechanism that actually has to do an enormous amount of work.

One of the cases I find fascinating involves the woman who was campaigning to have more faces on U.K. currency. She managed to get Jane Austen on one of the bills. I thought this was a fairly inoffensive campaign, but she got completely attacked with hundreds of death and rape threats on Twitter. It forced Twitter in the U.K. to introduce a new system -- to have a flag capability on every individual tweet. Previously if you found something offensive on Twitter you had to just write to them. That doesn't work when you have hundreds of people en masse trying to threaten an individual. The reason why flags are so important is this is where public discourse happens. These are the spaces where important debates happen.

You speak about this emotional, lived, reality of big data, what you call "surveillance anxiety," in a way that touches on the anxieties of both the surveilled and those surveilling. Could you speak a little bit about that?

After Edward Snowden released his trove of documents to the selected journalist that he wanted to work with, I was really interested to see how people's feelings about data were going to change. The conversation has already taken on a particular kind of awareness of just how much data collection is being done behind-the-scenes all the time. I am interested in the emotive content of that shift. It is actually quite a significant cultural moment that we have experienced in the last year. In that sense, how is it going to change people's lived experience to data? The more work I did on this, the more I found that data collectors and the surveilled can be mirror images of each other.

By looking at the documents that were released, specifically the Squeaky Dolphin Deck [a secret program to monitor millions of YouTube views and Facebook likes in real time] that came from GCHQ, the British Intelligence Agency, we can actually see into some of the concerns of intelligence agencies. That's not something most of us ever get to do. It is an extraordinary historical moment. We can reflect on how they are thinking about data and what problems they have with these enormous data troves. In many ways they are the old guard of big data. They have been doing this for a long time and facing a lot of the problems that only now some of the private sector big data collectors are starting to face. Looking at the anxieties they are experiencing gives us a whole new kind of insight into the problems of big data.

What kinds of problems are you referring to?

The key problem or anxiety reflected in the GCHQ PowerPoint is: what can we tell when we have so much data and what happens if we miss the important piece? This is what is so difficult when you have truly staggeringly large data sets. How do you actually resolve the fact that with so much data the critical pieces can be drowned in an enormous fog of correlation?

The way the GCHQ is dealing with it is by creating clusters of disciplinary orientation -- people in political science, economics, communications, and anthropology -- to try to make sure things don't slip into the gaps.

Think of the Boston bombing. We know that the Tsarnaev brothers could carry bombs in backpacks into a major street in Boston. We know that planes can go missing. These are the big data black holes that are producing enormous amounts of anxiety for those people collecting big data. In some cases this is intelligence agencies but it goes across the board. So it is in that moment that we can see this mirrored anxiety. For the surveyors it's a question of what can we tell by looking at this data set and what might we miss out on. For those being surveilled the anxiety is, what can they tell about me? Between these two anxieties we are seeing a completely new dialectic been formed.

One of the responses to this anxiety around surveillance culture is to fit in. You mentioned that in the context of big data some people are more vulnerable than others. Can you talk about fitting in as a political strategy?

The cost of standing out is suddenly so much higher. We have an awareness of how much our life is now being recorded, and that data can tell a story that is at once very intimate and also incorrect. We are not our browser history. We are not our entire email archive. But there is the risk that your data can be used against you. That is culturally very profound. It means that there is a big incentive now to blend in, to not stand out, to not cause trouble. That raises a lot of questions around how we think about political change and activism.

One example I give is from the protest in the Ukraine in January of this year. Every single protester on received the same message on their phone that said, "You have been recorded as an illegal participant in this protest." That is a very spooky experience. But it also has a profound effect on whether you are going to raise your voice and stand up on an issue that is very important to you or whether it is too big a risk to stand out now. This poses a series of questions for the democratic process. How are we going to make big data and democracy work together rather than undermine each other?

That is really the question to land on. Following along this thread, do you think today's "dominant cultural affect" -- this mass anxiety -- is a phenomenon specific to our era? Or have these cultural affects existed throughout history?

I think every era has a particular kind of dominant affect. This has recently been written about very well by the group Plan C in the U.K. They suggest that in the 19th century, particularly with the emergence of industrialism, one of the dominant affects was misery. There were strikes, wage struggles, and a lot of people living below the poverty line. Moving into different stages in the mid-20th-century, boredom becomes a kind of affect. And in our current stage, anxiety becomes an affect. I would not necessarily map directly to their choices, but what I think is very interesting is that there are ways of thinking about a dominant aspect of the time. It's often because something so significant is happening that it touches all of us. Big data touches all of us, whether or not we invite it or are even aware of it. If you have a phone, an email account, if you are walking down a street that is using sensors to record foot traffic, you are becoming part of a data set.

The lived experience of that will change the way we think about our relationships to other people, it will change our relationship to our political system, and it will change our relationship to our cities. That is very profound. That is a conversation worth having before we sign up wholesale to a whole range of systems that are operating on the presumption that big data is always a good thing.

This is where the work of people like Virginia Eubanks is so useful. She reminds me that it is always the marginalized populations that are most vulnerable to the uses of surveillance data. She looks specifically at low income Americans over the last decade and how particular technologies like Electronic Benefit Transfer (EBT) cards were used to produce a culture of surveillance and tracking.

Who is most vulnerable when we turn to these systems? We have to ask those questions first.

Related:

Photo: Jonathan McIntosh/Flickr

This post was originally published on Smartplanet.com

Editorial standards

Show Comments

Q&A: How the mythology of big data can blind us

Related

Can Meta AI code? I tested it against Llama, Gemini and ChatGPT - it wasn't even close

Move over, Alexa and Homekit: A new Assistant is here to open-source your smart home

The best free AI courses (and whether AI 'micro-degrees' and certificates are worth it)