Big Data Quality
"Big Data" is all the rage these days, and the Big Data marketing umbrella seems to be rapidly expanding as a result. The term is getting slapped on all kinds of product marketing narratives, including many kinds of data-oriented analysis product, or in products where data exists in any kind of volume (much of which is an evolution of data warehousing concepts needing some newness). So of course the usual market confusion is present as with any hot industry term.
As for me, I like to think conceptually of Big Data as referring to datasets that are so large, they tend to fall outside the scalability and performance afforded in traditional table-driven SQL-based data management approaches, and instead need a different way of thinking about and handling the tremendous amount of potential information that exists within these data entities.
The term Big Data emerged as many Web-scale companies such as Facebook, Twitter, Google, Amazon, and others started stretching the limits of traditional databases with their sheer data volumes and performance requirements, and began to realize they needed a data management approach more finely-tuned to their massive data requirements.
As a result, technology such as Hadoop, Cassandra, BigData, Dynamo, and others began to appear to assist in addressing these requirements. Analytics solutions focused on these massive data volumes have also begun to appear, as well as storage and performance alternatives slated as ideal for Big Data. There is also a new class of operational metrics solutions that help to generate these volumes of data, including both software and hardware instrumentation.
However, one concept seems to be often missing from these excited conversations: data quality. While it is true that much of big data goes well beyond structured data, much of it is still data, and data always has the potential to be unusable or flat out wrong. This omission of course creates opportunity for the astute and innovative. Many of the traditional data stewardship approaches are still applicable and necessary and need to be implemented with Big Data characteristics in mind. Customer data quality, profiling, data standardization, consistency prior to analysis and integration, rules-based testing, and even non-technology oriented quality initiatives (data completeness incentives for example) need to be part of any Big Data strategy for anyone hoping to have any sort of success.
So as you embark on the path of massive data volumes, be sure that a data quality strategy exists as part of the larger Big Data strategy, and keep your eye out for what happens in this space as its still in its formative period. After all, the last thing any organization wants is Big Bad Data.