Subscribe by Email

Your email:

More Info


StrikeIron Blog

Current Articles | RSS Feed RSS Feed

Big Data Quality


"Big Data" is all the rage these days, and the Big Data marketing umbrella seems to be rapidly expanding as a result. The term is getting slapped on all kinds of product marketing narratives, including many kinds of data-oriented analysis product, or in products where data exists in any kind of volume (much of which is an evolution of data warehousing concepts needing some newness). So of course the usual market confusion is present as with any hot industry term.

As for me, I like to think conceptually of Big Data as referring to datasets that are so large, they tend to fall outside the scalability and performance afforded in traditional table-driven SQL-based data management approaches, and instead need a different way of thinking about and handling the tremendous amount of potential information that exists within these data entities.

The term Big Data emerged as many Web-scale companies such as Facebook, Twitter, Google, Amazon, and others started stretching the limits of traditional databases with their sheer data volumes and performance requirements, and began to realize they needed a data management approach more finely-tuned to their massive data requirements.

As a result, technology such as Hadoop, Cassandra, BigData, Dynamo, and others began to appear to assist in addressing these requirements. Analytics solutions focused on these massive data volumes have also begun to appear, as well as storage and performance alternatives slated as ideal for Big Data. There is also a new class of operational metrics solutions that help to generate these volumes of data, including both software and hardware instrumentation.

However, one concept seems to be often missing from these excited conversations: data quality. While it is true that much of big data goes well beyond structured data, much of it is still data, and data always has the potential to be unusable or flat out wrong. This omission of course creates opportunity for the astute and innovative. Many of the traditional data stewardship approaches are still applicable and necessary and need to be implemented with Big Data characteristics in mind. Customer data quality, profiling, data standardization, consistency prior to analysis and integration, rules-based testing, and even non-technology oriented quality initiatives (data completeness incentives for example) need to be part of any Big Data strategy for anyone hoping to have any sort of success.

So as you embark on the path of massive data volumes, be sure that a data quality strategy exists as part of the larger Big Data strategy, and keep your eye out for what happens in this space as its still in its formative period. After all, the last thing any organization wants is Big Bad Data.

bigdata resized 600


Profiling, matching, de-duping, handling missing values and enhancement, the bread and butter of data quality, are all just as important when analyzing big data as any other kind of data. The only difference is that most software can't handle that level of data crunching. It's very compute intensive. It's the current biggest challenge for integration, data quality, and analytics software to solve. Governance and stewardship processes depend on the technology to not fall over and die when it gets hit with the massive volumes businesses are starting to have to deal with. 
Posted @ Thursday, July 05, 2012 2:47 PM by Paige Roberts
That's a great point. This is why you are likely to see data quality mechanisms built into business processes where the data is collected, rather than after-the-fact data quality cleanups in a lot of Big Data scenarios - the concepts are the same, but often the approaches will need to be different to handle the volumes. It's a real opportunity.
Posted @ Thursday, July 05, 2012 9:47 PM by Bob Brauer
Post Comment
Website (optional)

Allowed tags: <a> link, <b> bold, <i> italics