Never mind trying to tame “big data, ” is this something that can even be measured? How big is the big data market anyway? Deloitte is attempting to do just that.
The first question is what, exactly, is “big data”? In a recent interview (video posted below), Duncan Stuart, director of research for TMT at Deloitte Canada, defined it as 5 petabytes or more.
Many in the industry define big data not just by volume, but also by velocity and variety. But 5 PB is a good, simple threshold for current measurements. It’s also a moving target — bear in mind that 5 PBs may be what you find in a tablet computer three years from now.
Such large, multi-petabyte sites are likely to be proliferating. In a survey I helped conduct last fall as part of my work with Unisphere Research/Information Today Inc., nine percent of the companies participating reported data stores exceeding 1 petabyte. For comparative purposes, a petabyte is 1,000 times bigger than those 1-terabyte databases that made news just a decade ago.
In its report, Deloitte spells out the challenges with sizing the big data market — there are varied definitions of what big data is, it is still early in the adoption cycle of big data technologies, and most of the companies who are doing big data do not disclose their spending.
Nevertheless, Deloitte pegs the size of the big data market at about $1.3-$1.5 billion in 2012. The consultancy also predicts that this year, we’ll see big data experience accelerating growth and market penetration:
“As recently as 2009 there were only a handful of big data projects and total industry revenues were under $100 million. By the end of 2012 more than 90 percent of the Fortune 500 will likely have at least some big data initiatives under way.”
But the industry is still in its infancy, Deloitte cautions. “Big data in 2012 will likely be dominated by pilot projects; there will probably be fewer than 50 full-scale big data projects (10 PBs and above) worldwide.”
There are compelling reasons for companies to pursue big data. “Big data can see through time, big data basically allows you to see everything all at once, and in much finer detail,” says Stuart. “Instead of looking at my customer’s behavior once a month, I can look at it every minute of every day. That kind of insight is very, very powerful. It allows me to serve my customer better — either very large or very fast or both, requires the big data toolset.”
Challenges include the fact that solutions are still maturing — “software is still being written,” says Stuart. A looming skills shortage may also make big data projects difficult to bring to reality. Up to 140,000 to 190,000 skilled big data professionals will be needed in the US alone, over the next five years.
In addition big data will also require businesses to align work flows, processes and incentives to get the most out of it. In addition, Stuart says, it is important to note that enterprises should not concentrate on big data at the expense of ‘current data,’ or business information as normal. “There is still a lot of value left to be extracted from the information inside their traditional databases,” he says. “Can I solve this using traditional relational database tools or traditional BI tools? Use the right tools where the need is.”
The rise of big data also has the full attention of the venture capital community, Deloitte notes in its report. “Big data companies are attracting funding rounds of over $50 million, big data venture funds are being created, and large existing software players are validating the markets by partnering with or acquiring outright early stage leaders in the space.”
Joe McKendrick is an independent analyst who tracks the impact of information technology on management and markets. He is the author of the SOA Manifesto and has written for Forbes, ZDNet and Database Trends & Applications. He holds a degree from Temple University. He is based in Pennsylvania.
Joe McKendrick is an independent consultant and editor. Joe has performed project work for the following companies in the IT marketspace: IBM, Systinet/HP, Teradata. He has performed project work for the following organizations in partnership with Unisphere Research (Unisphere Media): IBM, Oracle Corp., International Oracle Users Group, Oracle Applications Users Group, Professional Association for SQL Server, International DB2 Users Group, International Sybase Users Group.
He writes for SmartPlanet and is not an employee of CBS.
As far as defining "big data" goes, putting a static quantitative label on it is just short of nonsensical. Big data is merely data that's an order of magnitude bigger than you are accustomed to, Grasshopper!
--Doug Laney, VP Research, Gartner, @doug_laney
Edited by douglaney
Updated - 21st Feb 2012
Just In
But is it???
Are they *really* talking about Petabytes (10^15 bytes) OR Pebibytes (2^50 bytes). I know it's pedantic, but at PB/PiB levels the difference in data per 'pb' is 125899906842624 bytes
For those that don't understand the difference ... 1 bit (contraction of binary digit) = either a zero or one 8 bits in a byte. 1000 bytes is a kilobyte 1000 is 10 to the power of 3 - or 10x10x10, written 10^3 However, most computer hardware doesn't generally work to decimal expansion ;o) and instead uses the notation of two to the power of something. So two to the power of 10 (i.e. 2^10) = 1024. Here's a good example. You can change your display settings to 8bit colour and you'll get 2^8 colours - 256 colours 16bit colour is 65'536 (2^16) colours If you've used an old version of Excel, you may have noticed there was a 65'536 row limit due to it previously being 16bit. Anyway, why is the above important>? because as you scale the numbers go way out.
Have you got a 1tb hard disk? Ever notice when you format it you only have ~932gb available to Windows? Ever thought that it was because Windows reserves some of that space or something similar? Well, the reality is you bought a drive to one standard of notation (KB/MB/GB/TB) but Windows displays to a different notation (KiB/MiB/GiB/TiB).
Unfortunately few places reference kibibytes/mebibytes/gibibytes/tebibytes and instead just confuse matters by using the 'mb' or 'gb' terms etc.
Your 1TB drive presents itself to Windows (accurately) as 1000000000000 bytes (10^12 bytes). However as Windows uses binary notation, '1gb' is actually 1073741824 bytes. Therefore you only get ~932 lots of 1073741824 bytes in your 1000000000000 bytes, or to put it another way '10^12 bytes divided by 2^30 bytes', hence a '1tb' drive will display in Windows as about '932gb' (932GiB).
I'm guessing as the terms Petabytes and PB have been used that we're talking 10^15bytes, but just thought I'd check. I wouldn't want Datacenter administrators to start crying that they seem to be missing storage because the OS they mount it in is displaying in Pebibytes / PiB's :P
As far as defining "big data" goes, putting a static quantitative label on it is just short of nonsensical. Big data is merely data that's an order of magnitude bigger than you are accustomed to, Grasshopper!
Thanks, Doug. You nailed it more than a decade ago, and the 3 Vs do represent the best way to size the big data inflow.
Your conclusions from that original report still resonate with today's challenges: "Attention to data management, particularly in a climate of e-commerce and greater need for collaboration, can enable enterprises to achieve greater returns on their information assets... IT organizations must look beyond traditional direct brute force physical approaches to data management."
Are they *really* talking about Petabytes (10^15 bytes) OR Pebibytes (2^50 bytes). I know it's pedantic, but at PB/PiB levels the difference in data per 'pb' is 125899906842624 bytes
For those that don't understand the difference ... 1 bit (contraction of binary digit) = either a zero or one 8 bits in a byte. 1000 bytes is a kilobyte 1000 is 10 to the power of 3 - or 10x10x10, written 10^3 However, most computer hardware doesn't generally work to decimal expansion ;o) and instead uses the notation of two to the power of something. So two to the power of 10 (i.e. 2^10) = 1024. Here's a good example. You can change your display settings to 8bit colour and you'll get 2^8 colours - 256 colours 16bit colour is 65'536 (2^16) colours If you've used an old version of Excel, you may have noticed there was a 65'536 row limit due to it previously being 16bit. Anyway, why is the above important>? because as you scale the numbers go way out.
Have you got a 1tb hard disk? Ever notice when you format it you only have ~932gb available to Windows? Ever thought that it was because Windows reserves some of that space or something similar? Well, the reality is you bought a drive to one standard of notation (KB/MB/GB/TB) but Windows displays to a different notation (KiB/MiB/GiB/TiB).
Unfortunately few places reference kibibytes/mebibytes/gibibytes/tebibytes and instead just confuse matters by using the 'mb' or 'gb' terms etc.
Your 1TB drive presents itself to Windows (accurately) as 1000000000000 bytes (10^12 bytes). However as Windows uses binary notation, '1gb' is actually 1073741824 bytes. Therefore you only get ~932 lots of 1073741824 bytes in your 1000000000000 bytes, or to put it another way '10^12 bytes divided by 2^30 bytes', hence a '1tb' drive will display in Windows as about '932gb' (932GiB).
I'm guessing as the terms Petabytes and PB have been used that we're talking 10^15bytes, but just thought I'd check. I wouldn't want Datacenter administrators to start crying that they seem to be missing storage because the OS they mount it in is displaying in Pebibytes / PiB's :P