Archives For Methodology

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights – NYTimes.com.

A colleague sent this article to me and what follows is my response.

Read the article thank you.  Everybody trying to understand analytics needs to understand this and the burden it puts on projects and coming up with results.

Unfortunately, the issue goes even deeper than the article describes.  Transaction systems were designed for accounting and contractual fulfillment, not for data science.  The designers of those systems weren’t too particularly savy about the way people work so the data entry became corrupted by laziness and short cuts and some just crazy sloppy validation and edits.  Now we’re in a state where the data to model coming from these lousy data entry systems got loaded into data warehouses. The ETL performed on data again, was now maybe a bit better….supposed to make reporting and analysis easier.  But the Transform logic just added another layer of poor hygiene to the data and/or illogical transformations.  And the Load logic was all about reporting and not data science.  So data warehouses are not great to facilitate data science.

Data science is unwinding all of that row by row and column by column in a brute force effort. We even try to get inside of the bugs by finding patterns in null values and unexpected 1’s and 0’s where there is supposed to be valid values entered.

Data science projects simply run out of time to correct all of this and end up throwing out half the data originally thought to be interesting.  Also keep in mind that after the janitorial work, the data has to be preprocessed for the specific algorithmic approaches being used…..binning, log transformations, and a dozen other critical techniques to extract signal and not get fooled by the noise.

I don’t believe there is an automated approach beyond what we already have, because the source systems are so varied in the way the data collection was programmed, the ETL was programmed and the data entry actually happens. The first step is to perform statistical evaluation to ‘smell’ the data.  These are pretty basic steps but need to be done on every column you are working with…sometimes hundreds or thousands.

Share things that can (must) be shared

  • Specialized Talent
  • Infrastructure
  • Some datasets
  • Tools

 

Focus services to deliver on demand

  • Domain knowledge
  • Data and systems expertise
  • Capacity for high demand customers

 

Standardize things that will help deliver consistent quality

  • Methodology
  • Project Practices
  • Role Descriptions

 

Give synergy to the effort

  • Complimentary skills and knowledge
  • Knowledge sharing and imagination
  • Contrarian viewpoints

 

Control Risks Formally

  • Skills definition
  • Independent Quality Review of the models and interim work products
  • Checks on conflict of interests and influence
  • Management accountability (project level)
  • Sign-off and approval process
  • Ethical Standards

 

Develop Enterprise Assets

  • Reusable datasets
  • Documented models

 

An Enterprise Identity and Voice

  • Organizational voice
  • A place for people with unique skills to belong
  • Promote identity, value and scope of the work
  • Tell the story to the enterprise

As the leader of an analytics business unit, the scope of the methods and projects you should plan to deliver include:

  • Supervised and unsupervised modeling
  • Operations Research/Optimization
  • Design of Experiment
  • Statistical Quality Control
  • Simulation
  • Forecasting
  • Text Mining (flow verbatim and text corpus)
  • Link Analysis
  • Big Data techniques (Map Reduce/Hadoop)
  • Heuristics (complex business rules)
  • Process Mining
  • Cognitive Decision Analysis
  • Visualizations (Mental Modeling)
  • Interpretation of dense signals (voice and image)
  • Interpretation of flow data (click streams, verbatim, dialogues and diaries)

You will have to work to uncover projects and understand the value proposition for these kinds of projects.  Even more challenging, you will have work to do to explain why your operations managers need these types of analytics to improve their operating results.

Lastly, you will need to bring the talent, tools and processes together to perform this type of work for your organization.  An exciting and challenging prospect.

While consulting firms come with a cost that subtracts from their value to you, some of them come with some gains.  An established and experienced firm has built models around your problem statements before.  What they may know and you don’t is what are the derived predictors that can dramatically improve a model’s performance. This is ‘golden’ knowledge but comes at a price. Best case is you get the details of the derived predictor and the transformations.

For those new to this subject, a derived predictor is a data element created from combining two or more features that is more predictive than the individual features.  In many models, the derived predictors create most of the lift. It is part of the art of analytics to understand how to create values that represent the target phenomenon more accurately than any raw data can.