Rob DuMoulin, Information Architect   Knowledge Based Solutions, Inc.

 

Originally published 10-Jan-2010

http://hubdesignsmagazine.com/2010/01/10/data-profiling-for-all-the-right-reasons-part-1/

Data Profiling For All The Right Reasons

 

Swiss psychologist Carl Gustav Jung founded the Analytical school of Psychology.  His word association theories form the basis of the Myers-Briggs Type Indicator Assessment test given today to high school students to identify personality traits and career aptitude. Dr. Jung's approach assigned personality profiles based on how an individual's thoughts associated to various phrases.  From the results of his test, he could understand how an individual viewed the world around them and perceived their interactions with it.  Typically, the subject is asked to speak the first thought that enters their mind after hearing a trigger phrase.  For the following example, remember, there are no wrong answers.  If I say the words "Data Profiling", what is the first phrase that pops into your mind?

 

If you thought of food, cats, country music, CSI NY, or residential plumbing, it means you are either not in IT or are a Manager.  

 

If your first thought was "Quality Assurance", you align yourself with data quality professionals with potentially anti-social thoughts of failing test cases and sadistically reporting lazy developers for letting bugs slip into their code.  You scour test cases looking for evidence of truncation, missing values, non-matching codes, numeric precision errors, and inconsistent abbreviation, text, and date formatting.

 

If "Integration" comes first in your mind, past legacy integration projects have left you with a disdain for source system data quality levels.  You view production apps with contempt and loathe the time it takes to track down data issues caused by system integrations.  You investigate upstream sources to create detailed mappings and transformation rules.  A typical debugging session might consist of validating relationships to identify orphaned data, identifying attributes that contain overloaded columns (values for more than one distinct data element), or fixing format errors caused by implied decimals.

 

Some of you will respond with "Value Domain" or "Data Types", indicating you are an obsessive compulsive data architect compelled to organize the world into strict and orderly fashion that has some degree of normalization, though you may not be normal yourself.  Your concerns lie in understanding and regulating naming conventions, relationships, existence of NULL or default values, and understanding the meaning of each data element so you can accurately identify business constraints and when two or more objects relate to the exact same business element. 

 

Lastly, if "Debugging" is the first item cached in your mind, chances are you are a coder looking for a reason to defend why presumably good code is not working.  Your paranoia has taught you to assume nothing about data quality, so you must add tests to identify duplicates, validate relationships, track change data capture, provide substitute values, and even add auditing to your code to inform a DBA of data issues rather than waking you up in the middle of the night.

 

It is truly amazing how much we can conclude from the response to one simple phrase.

 

As stated before, there are no wrong answers here.  I did take an innocent jab at Managers and non-IT resources, but we all realize the benefits of information quality and we absolutely need business involvement to help us understand the context and proper domains of information use throughout the business.  The meaning and actions of Data Profiling change not only by role but also by phase of a project.  Through it, we are able to identify best sources of information, learn the proper ways to categorize and store it, reactively identify when there are quality issues, and proactively define business rules to prevent future issues.

 

We need to identify what is important to profile, when and how profiling is done, and how to share our findings across business and project resources.  Done properly, profile information integrates with a master metadata repository whose values are periodically refreshed through an automated process.

 

This article provides a tool-agnostic approach to doing comprehensive data profiling.  The intent is to provoke thought in terms of information meaning and use, rather than limiting discussions to a tool comparison.

 

Profiling basics

 

Important goals of any data profiling strategy include the automation of metric collection and publication of results to benefit the different objectives of a data-centric project.  With automated metric collection comes the decision to archive results or not and how frequent collections should occur. These are business decisions primarily based on how and when the profile results are used. Early on in a system development lifecycle, profiling helps define data storage and movement requirements. As a product goes into production (or if profiling were added to an existing system for quality control purposes), profile sessions may occur daily or on-demand to audit quality or identify business rule violations. The frequency of collection and amount of effort you expend to automate your profile methods should be based on the ability of the organization to benefit from updated results.

 

For this discussion, consider that a Profile Group is a set of like information being analyzed together because they share a common purpose and management method. Examples of profile groups are tables within a single database schema or a group of spreadsheets with the same format but each spreadsheet represents a different day of data.

 

The underlying System managing a set of information within the profile group may be a named relational database, a file system directory, or even a web site being accessed through web services.  The reason we abstract information into Systems is to group the information into distinct governance methods common to the underlying information.  Relevant metadata and governance methods we track at the system-level include: technical contacts, backup schedules, system descriptors, connection strings, business unit owners, and host operating systems.  System-level metadata common to a profile group helps us understand and troubleshoot future analyses.  This level of information also provides developers with an understanding of inherent restrictions (or freedoms) they may encounter when trying to use or integrate the information.

 

Entities within a profile group all belong to the same system, have a unique identifier, and exist under the same owner(for database entities).  Typically, entities are database tables, but may also be similar files or spreadsheets that contain consistent attribute lists.  For entities, we track characteristics common to all the attributes they contain.  These include: row counts, entity-level descriptors, growth characteristics (size and frequency), last analyzed date, and various customized indicators such as active/inactive, existence of change data management attributes such as insert/update timestamps, or existence of audit traceability indicators such as insert/update username.

 

The combination of system and entity profiling supply the foundation for the attribute-level information, which is where business information actually resides.  It also provides valuable metadata to classify information and allows for future correlation of like information that exists across systems. Assembly and publication of entity and system level information benefits the various consumers of the information by providing a centralized "master" source of contact and context information.

 

Attribute Analyses

 

Alone, an attribute has no context.  Given the simple descriptor of "Cost" for an attribute tells very little about the attribute's true purpose and immediately summons the need for additional information, such as: units (hours, Dollars, Euros...), type (weighted, unit, gross...), or use (invoice, sum, average...).  Attributes therefore must be analyzed within the context of their business purpose.  Some characteristics require business knowledge to define and others can be determined through interrogation of existing values and underlying rules of the storage medium. It takes both analyses to get a complete picture of information.  While assembling this puzzle, though, keep in mind that until you validate the enforcement of business rules, data profiling will only provide you assumptions of data quality.

 

Analyses of values, domains, and constraints allows insight into domain and use (or abuse) of an attribute. The larger the sample size, the better confidence you gain in the results.  Without explicit proof of business rule enforcement, though, you cannot be sure that just because a value does not exist does not mean it cannot exist. Knowledge of business rule enforcement is found in active database constraints, data type/precision, and application-level checks or transformations. Existance of automated validations like these allow you to accurately narrow a domain but not necessarily to totally understand it.  Profiling of actual values gives you additional refinement in terms of: percentage of NULL values, percentage of distinct values, minimum, maximum, and average values, top x and bottom x recurring values along with their counts, and minimum, maximum, and average data lengths.

 

Some attributes serve specific purposes within a data set that are relevant to consumers of the profile results.  Attributes that individually or in conjunction with others define uniqueness of the data set are relevant to understanding relationships between entities. Uniqueness can be further classified as being either members of a system-enforced primary key or for a business key (outside of the defined primary key).  System-enforced primary keys are relatively easy to define within a database system by just viewing the system catalog for assigned keys. Business keys that exist in tables in addition to a primary key may be more difficult to identify, especially if more than one attribute are needed to define uniqueness.

 

Other types of attributes that may help in identifying relevancy are those that provide system-level auditing or change control.  Knowing which attributes fill these roles may either allow you to ignore them for profiling purposes or use them to help explain data anomalies.

 

Relationships and Patterns

 

Pattern grouping provides a mask of distinct format patterns within an attribute data set and a count of the number of occurrences. Patterns give insight into the type of values found in an attribute. For example, a numeric pattern analysis may show values such as 999.99999, 99, or -.9999.

 

Observing distinct patterns gives insight into the maximum digits and precision, and also domains such as integer or real. Pattern of a database date or date-time type provides unremarkably similar patterns for all dates. Because the database management system typically enforces the domain, date analysis provides no value and can be ignored. If dates are stored in character format, however, patterns quickly show variations in date formatting. Character patterns only have significance to a limited number of positions. It makes no sense to pattern a description field of 200 or 2000 characters. Smaller code attributes of less than 10 characters though do provide value. Ignore pattern profiling for character strings over 20 characters at first, then refine to shorter character strings if the results do not add value.

 

In pure database theory, referential integrity (RI) is your friend. In practice, designers and software vendors forgo RI to improve system performance on data inserts. These designers place the data quality burden on the application and do not endorse external data manipulation outside the application interfaces. In the real world, though, data corruption occurs and without RI or routine data quality checks, corruptions may not be found for a long time or not at all. Personally, I have identified over $50,000 of recent orphaned sales in a retail client resulting from deliberately disabled RI. These unreported sales were not added to the ledger and were allowed to occur for performance reasons until I found them through simple profiling. Enforcement of RI is a topic for another discussion but is mentioned here because it does identify a valid reason for data profiling.

 

In even presumably good relational designs, some parent-child relationships are not enforced for different reasons. First, interrogate the RI listed in the system catalogs to identify all enforced relationships. Reverse-engineering a system with a good modeling tool is probably the best way to do this. A harder and more valuable analysis is to identify unenforced relationships and determining the probability of the relationship if not all values are an exact match. Do this by counting all the candidate child attribute values that exist within a known parent attribute table. If all match and there are a non-trivial number of matches, there is a good probability of a non-identified relationship. A small number of mismatches could identify data quality issues.

 

The Payoff

 

If you have defined your corporate data profiling strategy similar to the methods discussed in this article, youll have amassed a robust collection of metadata spanning relevant systems across your business.  Although systems may be of different types and locations, the structured approach and common metrics you collected create a centralized repository of information that can be examined holistically. Ideally, this information will exist in an open-source database repository with reports made available across the enterprise. System and Entity information help planners and developers organize information strategies. Attribute-level domains, constraints, and business rules help data architects understand existing systems. Relationships and value patterns are readily available to support validation of information-related hypotheses as needed.

 

If you plan to design your own repository, consider adding timestamps and indicators to help you manage and present the information.  To keep your repository relevant to business needs, design collection rules to be configurable. This allows you to easily ignore superfluous information or enable tests only at certain critical times.  Allow initial system profiling efforts to gather a large set of metrics and store them as your baseline.  As you learn about the information, you will see which tests or which data objects add no value.  Us geeky DBA-types who understand system-level catalogs have our own scripts to do much of what was described in this article.  Those less-inclined may prefer to use a third-party tool for profiling. Either way works as long as the business needs are satisfied and the entire enterprise standardizes on one approach (and thus one integrated repository).

 

You will find that collecting and maintaining this level of detail has a definite cost.  Even if the collection is automated, interrogations of large data sets places an overhead on production systems that may not be practical. Record and monitor profile execution metrics to identify bottlenecks or tuning opportunities. Realize that the extent of data profiling is contingent on the project phase, specific data elements, and most of all, business value. Review profiling goals on a regular basis and eliminate unnecessary and redundant checks.

How much profile history to maintain is another consideration.  Even though disk is relatively cheap, maintaining all historical entries in a live repository may not be necessary. Consider business needs and value for historical profile information. Even consider archiving at a summarized (or less frequent) level and keep only a limited time window of statistics online.

 

This discussion on data profiling was intended to broaden perceptions of what it means to a business and the value it can bring if done in a sustainable way.  Use your imagination, but remember that no matter how cool it might be to collect and store some profile output, if it does not add business value to somebody, it might not be worth the overhead to continue recording it.