Rob DuMoulin, Information Architect         Knowledge Based Solutions, Inc

 

Originally published 14-Nov-2011http://hubdesignsmagazine.com/2011/11/14/the-importance-of-context-and-explicit-values-part-1/

“I shot a cat wearing green pajamas.”

Before you report me for animal cruelty, did you paint a mental image of me or the cat in pajamas? How do you know I was not referring to the phrase used in the 1920’s to describe a person at the top of their game, Cat’s Pajamas? Would your opinion of me change if you had implicit knowledge that I was a professional photographer specializing in novelty pet shoots? How about if you knew I photographed vintage sports while wearing bed clothes or made analogies as if cars were dressed? Maybe you implicitly know I am a caretaker that lives at a zoo who carries a tranquilizer gun for safety or that I am a guide for a big-game safari hunting club? Do I need to go on? A simple factual statement like this is common in tabloid press because without proper context or additional implicit information, it is easy to reach invalid conclusions. Computers are good at many things, but unless they are programmed to be learning machines, computers are not good at inferring context. Moreover, computers need not infer context if system designers and data owners understand how ambiguity affects their MDM and DG effectiveness. Do your business users have all the information they need to absolutely reach valid conclusions in every situation? If you said ‘Yes’ and truly understand what the term ‘implicit values’ means, you do not need to continue reading.

It looks like all of you are still here. For the record, I was not the one that was wearing pajamas.

In Data Architecture terms, an implicit value is one that is conveyed indirectly by applying information found elsewhere. The antonym of an implicit value is an explicit one, which is a value with a self-contained purpose that exists in a single domain (which means that all values are of the same type and have the same business rules applied to them). These topics will be further discussed in this article. In practice, explicit definitions are desired over implicit ones, but there are exceptions.

This article will present the various scenarios, reasons, risks, and alternatives to having certain types of implicit values in your design and why there is literally no reason to use certain types of implicit values in MDM or BI architectures. Three classes of implicit values will be presented: Coded Values, Overloading, and Null/Default values.

Using coded values is a widely accepted and valuable relational modeling practice used to classify a value within a single domain as defined by a separate code value. For example, a GL dollar amount can be assigned to one of many GL Line Types. Two columns consisting of the GL dollar amount and the GL Line Type Code are implicitly linked together because each has no context without the other. This is an acceptable design practice because each column exists in one and only one domain: GL dollar amounts or GL Line Type Codes.

Overloaded columns or records, on the other hand, exist when one of the columns can be of a different domain based on the value of another. If the GL amount from the previous example represented a dollar amount for one GL line type and an item count for a different GL line type, the GL amount column would be considered overloaded. This practice was done quite often in the past using COBOL Copybooks. Copybooks exist under the premise that a variable record type was an efficient way to model information within a 3GL coding environment. With the advent and advances of relational databases (and normalization standards), the copybook approach does not translate well into database tables; thank goodness.

One disadvantage of using overloaded columns is that the characteristics of a column are dictated by the superset of all the domains the column represents. This removes one of the benefits of defining a domain to explicitly enforce data quality, integrity, and definition. For example, if an overloaded column implicitly stores either the nine characters of a Social Security Number or an account holder’s alpha-numeric 14 character Member ID, the column must be at least 14 characters in length and allow for numbers and characters. This makes it more difficult to explicitly enforce SSN formatting rules and display options because it needs some outside code or test to put the value into context.

Overloading inherently restricts a record into referencing only one type of value, which on the surface may seem like a noble goal because it guarantees mutual exclusivity. In some business cases this may never change, but in others, the intent may be short-sighted when business rules change. For instance, an overloaded column of division code and line of business code created under the premise that a record would align to only one of these classifications breaks if products realign to span multiple LOBs.

Whether the reason for using this approach was to upgrade older systems cheaply, or because a designer believed at the time for it to be an innovative solution to a problem, the results are the same. Overloading makes data management more difficult, data quality harder to enforce, adds risk of incorrect use of data, and longer-term data changes become more expensive. Despite the brow-beating (or possibly because of the lack of it), designers and developers still overload columns in ERP, CRM, and custom-purpose systems. Their existence in legacy systems may not be undoable without costly rework, but they have no place in good MDM or BI design. More on this later.

In the next of this series, we’ll discuss the use of NULLs and default values.

The Importance of Context and Explicit Values – Part 2 of 3

 

The first segment of this series defined implicit values as ones that require additional context in order to infer an unambiguous meaning from the value. The segment went on to define how coded values are a valuable design practice that does not cause concern when each component of the implicit relationship exists within a single domain. A derivation to the coded values concept called “Overloading” was discussed as an undesirable practice where one column or record spans more than one domain. In this segment, we will discuss the practices of implicit defaulting and the assigning of implicit business meaning to blanks or NULL values other than the business meaning of “uh, nobody put a value here.”

From a user perspective, text fields containing a NULL or any number of blank spaces look the same. Although the casual business user cannot tell the difference between NULL and blanks, the computer treats them very differently and even changes retrieval performance based on the distinction.[1] Suppose a data source or entry screen of a Yes/No business attribute defines that a NULL has an implicit meaning of ‘No’ because the value was not explicitly set to ‘Yes’. Since ‘No’ and NULL implicitly mean the same thing, the business cannot determine if the value was intentionally meant to be ‘No’ or if the user accidently missed data entry, but intended to select ‘Yes’. Unless somebody realizes and corrects the mistake, the value will be incorrect and nobody will ever know. Now suppose this column propagates through and ends up in an MDM system and the corporate data warehouse as Null and considered the best version of the truth. To accurately perform analyses using the column with this implied business rule, one must check for a ‘Y’ or the existence of both NULL and ‘No’. Another analysis option would be to check for a ‘Yes’ or a Not Equal to ‘Yes’ condition. In either case, a consumer of this column has to implicitly know that a NULL and ‘No’ are the same and even with this knowledge, the consumer still does not have confidence that the column was intentionally meant to be ‘No’.

Expand the above discussion to consider a lookup code value. The practice of allowing NULL code values is referred to as creating an ‘optional relationship’ within a parent (code table) and a child (reference table). Code tables are, by definition, explicit domains of allowable codes with each having a mutually-exclusive meaning for a record. Allowing NULL values for codes poses a similar ambiguity towards the data meaning. Did the user or source neglect to enter a value? Does the existence of NULL implicitly mean something else, like ‘Not Applicable’?

I can think of no positive business value to introducing ambiguity in data, ever. Moreover, there should be no compelling argument to intentionally inject ambiguity into new designs, especially in MDM or BI systems which are supposed to be reliable and pristine. When doing data integration from multiple systems, such as merging data into MDM or BI systems, the existence of NULLs and implicit values introduces several unnecessary challenges. MDM and BI systems are tasked with the collect of information from multiple sources with the intention of conforming it to a single Enterprise view. Adding implicit values to that process ensures logic will be needed in the transformation layer to conform it to the “Best Version of the Truth”. Once in the MDM Hub, more logic is necessary to transform the data back to its implicit form for each of the consuming systems.

“Oh what a tangled web we weave when first we try to cut corners and be implicit.”

In the last part of this three-part series, we will review the revelations thus far and discuss ways to both manage implicit legacy conditions and implement policies to detract from creating any additional ones.

The Importance of Context and Explicit Values – Part 3 of 3

 

In the first two segments of this three-part series, we discussed situations where implicit values weave themselves into business processes or IT systems unknowingly or on purpose. The examples provided may seem trivial on the surface if one considers that everybody in the business had the implicit knowledge needed to make sense of information. It is not so trivial when time passes, employees retire or leave, and there are implicit values scattered around the Enterprise that nobody remembers anymore. Add to that the fact that implicit values, by nature, are not self-documenting and in my experience rarely externally documented. By now you should be able to identify the existence of implicit values in overloaded, NULL, or implied default values within your organization. More importantly, you should understand the potential consequences of each situation and be able to intelligently challenge those who perpetuate these practices. This last section provides alternatives to implicit designs and provides methods for mitigating risks within legacy systems.

Governance Policies are where many data quality issues get resolved or avoided. Without defined Governance, the status quo becomes the standard and data quality does not improve. Adding new policies may do little for directly combatting legacy system shortcomings, but it does set the stage for obviate proliferation of implicit values by mandating that new interfaces to legacy systems not accept them.

An involved and informed Data Governance team is the best defense to limiting proliferation of implicit values. Introducing an Enterprise Standard to restrict the use of optional relationships between code parent and referring child tables is a start. As stated earlier, optional relationships are ones that allow NULL child table values to exist. For example, a sales table allows a discount type code of NULL if no discount applies. To satisfy such a mandate in this situation, discount code table entries of ‘Not Provided’ and ‘No Discount’ would be added. If a NULL is received, the relationship maps to the ‘Not Provided’ entry. The Point of Sale system could now default a discount to the ‘No Discount’ code value if none was given. This solution explicitly provides a discount value in all situations and provides an easy means to identify when data transformation issues occur if a ‘Not Provided’ value sneaks in. The approach leaves open possibilities for additional entries in code tables to represent a ‘Not Applicable’ business condition or an ‘Invalid’ situation which could be set if a code value is detected during loading of the sales table that does not match a value within the domain of valid discount codes.

A Data Governance body worth its salt creates standards concerning definition, collection, and dissemination of metadata. In the policy planning phase, the Governance body can define the required column-level metadata collected to include values like: Default Value, Min/Max values, Domain, and Nullability. If business users define that NULL values are allowed, a follow-up discussion is required to determine the business meaning of a NULL value. Such a discussion is the opportune time to explore options to NULL values. Forcing the business data owners to define data elements to this degree provides the rules to eliminate several forms of undesired implicit values and it makes for better business system designs.

A strong Data Architecture group plays a role in Data Governance policy enforcement. Data Architecture groups influence database and application designers to use NOT NULL and default value constraints and to define relationships as ’Required’ instead of ‘Optional.’ Mandatory peer code reviews using checklists ensure designs meet enforced policies.

Overloading, as defined in the first segment of this series, is a costly practice that is easily avoided through proper data design techniques. Overloaded code values are avoided by creating separate child fields and separate parent code tables. Overloading of transaction metrics is handled similarly by splitting transaction values into different attributes if at the same summary level or different tables if transactions are at different summary levels. Overloaded records, like copybooks, just have no place in relational database design. If somebody tries to justify them, you have my permission to mock them in public but I am not responsible for any repercussions if the mock-ee is your boss.

The goal of this article was to demonstrate that using implicit values comes with long term consequences and this practice should be scrutinized. There are compelling arguments to use implicit design methods, but doing so should be done knowing all of the options, consequences, risks, and potential long-term costs. Resist temptations for quick-fixes when it comes to data design and focus instead on best practices.



[1] For example, Oracle does not index null values, so if one has a dataset that is predominantly nulls for a certain column, a query for a non-null value will perform quickly using the index, whereas a query for a null value would result in a table scan and would not use the index.