Data Quality in the Age of Big Data
Traditional data quality best practices and tool functions still apply to big data, but success depends on making the right adjustments and optimizations.
Whether data is big or small, old or new, traditional or modern, on premises or in the cloud, the need for data quality doesn’t change. Data professionals under pressure to get business value from big data and other new data assets can leverage existing skills, teams, and tools to ensure quality for big data. Even so, just because you can leverage existing techniques doesn’t mean that’s all you should do. We must adapt existing techniques to the requirements of the current times.
Data professionals must protect the quality of traditional enterprise data as they adjust, optimize, and extend data quality and related data management best practices to fit the business and technical requirements of big data and similar modern data sets. Unless an organization does both, it may fail to deliver the kind of trusted analytics, operational reporting, self-service functionality, business monitoring, and governance that are expected of all data assets.
Adjustments and Optimizations Make Data Quality Tasks Relevant to Big Data
The good news is that organizations can apply current data quality and other data management competencies to big data. The slightly bad news is that organizations need to understand and make certain adjustments and optimizations. Luckily, familiar data quality tasks and tool functions are highly relevant to big data and other valuable new data assets — from Web applications, social media, the digital supply chain, SaaS apps, and the Internet of Things — as seen in the following examples.
Standardization. A wide range of users expect to explore and work with big data, often in a self-service fashion that depends on SQL-based tools. Data quality’s standardization makes big data more conducive to ad hoc browsing, visualizing, and querying.
Deduplication. Big data platforms invariably end up with the same data loaded multiple times. This skews analytics outcomes, makes metric calculations inaccurate, and wreaks havoc with operational processes. Data quality’s multiple approaches to matching and deduplication can remediate data redundancy.
Matching. Links between data sets can be hard to spot, especially when the data comes from a variety of source systems, both traditional and modern. Data quality’s data matching capabilities help validate diverse data and identify dependencies among data sets.
Profiling and monitoring. Many big data sources — such as e-commerce, Web applications, and the Internet of Things (IoT) — lack consistent standards and evolve their schema unpredictably without notification. Whether profiling big data in development or monitoring it in production, a data quality solution can reveal new schema and anomalies as they emerge. Data quality’s business rule engines and new smart algorithms can remediate these automatically at scale.
Customer data. As if maintaining the quality of traditional enterprise data about customers isn’t challenging enough, many organizations are now capturing customer data from smartphone apps, website visits, third-party data providers, social media, and a growing list of customer channels and touchpoints. For these organizations, customer data is the new big data. All mature data quality tools have functions designed for the customer domain. Most of these tools have been updated recently to support big data platforms and clouds to leverage their speed and scale.