"Big Data" has not only become one of the hottest terms in information technology today, it has also become one of the most confusing. This is partly because of the many definitions of Big Data floating around, and no real consensus as to which sources are authoritative. And of course, once the marketing masses get hold of a valuable term like this, the definition-stretching of a valuable search term gets leveraged to modernize the productizing of legacy solutions and adds to the perplexity, which is already happening a great deal.
Many people think Big Data refers strictly to very large amounts of data. While this is somewhat true, there is no real size characteristic that would identify a company, a product, or a solution as Big Data. Big Data to one might not be Big Data to another. For example, there have been large volumes of data around for a long time, so in of and itself, the idea of data volume would not be anything new.
I think the easiest way to understand the meaning of Big Data and its implications, at least in the here and now of late 2012, is to instead think of it as an umbrella term for a collection of technologies and various architectures (that are creatively inexpensive) for the handling of large, otherwise unwieldy datasets, where data is being added at very high rates of speed, largely as the result of all of the data being generated on the Internet.
Some historical context may help.
Up until several years ago, there was essentially, what is now known as PC-scale (single computer) applications and Enterprise-scale (across the company) networked applications. A standard relational database such as Oracle, MySQL, SQL Server, or even in some cases dBase or Microsoft Access, were adequate to meet the needs of even the heaviest of data requirements for these scales of applications.
Then, as the Internet began to mature, a third class of applications came along known as “Web-scale” applications. These include the various applications made available by Google, Facebook, Amazon, NetFlix and others whose ongoing usage was generating so much data around the globe, far more than any single enterprise had ever seen, that it was essentially too much for any standard relational database to be able to handle.
Alternatively, even if one of the major commercial database vendors could handle it, price tags would be in the millions of dollars, along with the cost of all of the hardware to achieve it (probably another $10M). These heavy data and corresponding performance requirements, combined with a need to solve this problem in a lower-cost way, spawned a completely new set of database technology that could handle this scale of operational data without the otherwise absurd price tags.
Some of these architectures include “Hadoop/HBase,” “Cassandra,” “Big Table,” “Dynamo” and other distributed, text-based solutions that could use distributed standard PCs in some cases to handle these complex requirements. These solutions have been architected in such a way to handle the requirements of “Web-scale” applications, each one with its own advantage (I/O-centric like Cassandra, versus analytics-centric such as Hadoop for example) without having to invite the commercial database salesperson to visit. Also, not only storage and retrieval mechanisms of huge datasets are included underneath the umbrella Big Data term, but other capabilities such as capturing data, searching, analysis, and other ways to manipulate these gigantic streams of data, and all of course, in a reasonable, useful timeframe.
Therefore, when the industry talks about “Big Data,” it is the technology approaches to which most of the experts are referring and discussing, and usually in the context of Web-scale data architectures. Companies that are focusing on these particular technology solutions are the ones, which rightfully fall into the Big Data category. Unfortunately, many companies offering some level of data volume are often saying “Hey! We are doing Big Data, too!” However, this is somewhat like someone saying they are doing the Cloud because they have a website. You cannot stop them from saying this, but it helps to be able to recognize when this occurs.
Hopefully, thinking of Big Data in these technology terms help assuage some of the mass confusion which seems to be out there around the emerging concept, and enable you to recognize solutions and concepts that fit within this realm, versus those that would purely like to be, and allow you to be a contributing member of the discussion.