Archives For Data Governance

You may have business rule ‘messes’ in your transaction systems.  I’m referring to business rules created by well-meaning IT folks and Business Analysts, attempting to direct a complex decision using linear business rules.  This pretends to be data science but is often a ‘mess’.  The overzealous use of rules posing for data science is a common situation.  Many times these rules are worse than doing nothing (guessing), as far as supporting a complex decision.  Worse, they lead to poor data (to much data entry required to make them work as planned).  Worst of all, they may be encoding a linear thinking and a bias towards ‘averages’ rather than distributions, when it comes time to interpret heuristics into data science.

The ‘mess’, may be a good place for a new data science project.  Likely you will need to rip out the rules altogether (not popular with IT).  However, assuming the data is semi-clean, the historical use of the business rules may prove to be useful predictors in a multi-variate model. Not all business rules are a mess if they are the result of simple heuristics and have been maintained properly.  In any case, look for these pseudo models as a way to improve decision making with true analytics.

Edward H. Vandenberg

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights – NYTimes.com.

A colleague sent this article to me and what follows is my response.

Read the article thank you.  Everybody trying to understand analytics needs to understand this and the burden it puts on projects and coming up with results.

Unfortunately, the issue goes even deeper than the article describes.  Transaction systems were designed for accounting and contractual fulfillment, not for data science.  The designers of those systems weren’t too particularly savy about the way people work so the data entry became corrupted by laziness and short cuts and some just crazy sloppy validation and edits.  Now we’re in a state where the data to model coming from these lousy data entry systems got loaded into data warehouses. The ETL performed on data again, was now maybe a bit better….supposed to make reporting and analysis easier.  But the Transform logic just added another layer of poor hygiene to the data and/or illogical transformations.  And the Load logic was all about reporting and not data science.  So data warehouses are not great to facilitate data science.

Data science is unwinding all of that row by row and column by column in a brute force effort. We even try to get inside of the bugs by finding patterns in null values and unexpected 1’s and 0’s where there is supposed to be valid values entered.

Data science projects simply run out of time to correct all of this and end up throwing out half the data originally thought to be interesting.  Also keep in mind that after the janitorial work, the data has to be preprocessed for the specific algorithmic approaches being used…..binning, log transformations, and a dozen other critical techniques to extract signal and not get fooled by the noise.

I don’t believe there is an automated approach beyond what we already have, because the source systems are so varied in the way the data collection was programmed, the ETL was programmed and the data entry actually happens. The first step is to perform statistical evaluation to ‘smell’ the data.  These are pretty basic steps but need to be done on every column you are working with…sometimes hundreds or thousands.