RSS

The Bulletin

3 reasons why 'big data' can often be meaningless or misleading

Posting in Architecture

Big data analytics promise to bring great advances to the way we do business. However, observers caution there are risks in making data-based decisions without proper context, or by over-relying on algorithms or by cherry-picking data.

A few years back, I was doing some research on service oriented architecture, or SOA, for clients. SOA -- in which essential components of applications are broken down into reusable services -- is the foundation and forerunner to today's cloud computing. Making an inquiry with Google Trends, I found that interest in SOA was most prevalent in The Netherlands, and Dutch cities scored highest of any city in the world -- far surpassing places such as San Francisco and San Jose, the heart of Silicon Valley.

Hmm, I pondered -- why is interest in this new computing model so high in The Netherlands? They must be really far ahead of the technology and innovation curve. (The Dutch are extremely industrious, after all.) Perhaps there are some companies and individuals really pushing the technology envelope there?  Is it geographic, perhaps because The Netherlands are at a crossroads point in Europe?

I soon found out that in The Netherlands, 'SOA' are the initials for "seksueel overdraagbare aandoening," or "sexually transmissible disease."

The lesson from my simple search exercise is that big data is essentially meaningless -- or potentially misleading -- without proper context. A global search of a term or concept without translation and cultural context can go seriously wrong.  In a recent post, Nick Bilton of The New York Times also cautioned against reading too much into Google Flu Trends data, which attempted to track the progression of the virus via an algorithm that tracked mentions of the flu. Bilton quotes Nature's Declan Butler:

“'Several researchers suggest that the problems may be due to widespread media coverage of this year’s severe U.S. flu season,' Declan Butler wrote in Nature. Then add social media, which helped news of the flu spread quicker than the virus itself. In other words, Google’s algorithm was looking only at the numbers, not at the context of the search results."

Context is one important element of big data that needs to be better understood. Over-reliance on big data analytics is a second peril businesses and society are creating.  In another NY Times, post, Steve Lohr points to big data as a means to better allocate government resources and understand patterns within society. But, he cautions, relying on algorithms has its own form of risk, since they are "created by people and they contain inferences and assumptions coded in. Those coded-in values shape the output — computer-generated predictions, recommendations and simulations."

Over-reliance on algorithms could lead decision makers down the wrong path. Brian Bergstein of MIT Technology Review suggests that growing reliance on big data analytics is even creating a corporate bubble of overconfidence.

[He] fears a future in which such "intuitive knowledge" about how to deploy resources is overruled by algorithms that can work only with hard data and can't, of course, account for the data they don't have ... While it might seem obvious that data, no matter how "big," cannot perfectly represent life in all its complexity, information technology produces so much information that it is easy to forget just how much is missing.

History is full of examples of the incomplete pictures data provides, versus human observations on the ground. The U.S. overreliance on data during the 1959-1975 Vietnam War is a classic example, Bergstein pointed out.

Cherry-picking data is a third area of risk that comes with big data analytics. With abundant information flowing in from so many sources, there is also a potential issue in relying on incomplete or misdirected results. Nassim Taleb cautions in an article in Wired that researchers and analysts working with big data run the risk of cherry-picking information:

"Big data means anyone can find fake statistical relationships, since the spurious rises to the surface. This is because in large data sets, large deviations are vastly more attributable to variance (or noise) than to information (or signal)."

In other words, big data analytics can find you the results you want, versus real-life situations.

Big data offers a lot of insights and opportunities that could never have been dreamed of before. But its users must carefully weight what it tells them, and still keep human intelligence in charge of the effort.

(Photo credit: Joe McKendrick.)

— By on February 26, 2013, 1:06 AM PST

Joe McKendrick

Contributing Editor

Joe McKendrick is an independent analyst who tracks the impact of information technology on management and markets. He is a co-author of the SOA Manifesto and has written for Forbes, ZDNet and Database Trends & Applications. He holds a degree from Temple University. He is based in Pennsylvania. Follow him on Twitter. Disclosure