For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights –

A colleague sent this article to me and what follows is my response.

Read the article thank you.  Everybody trying to understand analytics needs to understand this and the burden it puts on projects and coming up with results.

Unfortunately, the issue goes even deeper than the article describes.  Transaction systems were designed for accounting and contractual fulfillment, not for data science.  The designers of those systems weren’t too particularly savy about the way people work so the data entry became corrupted by laziness and short cuts and some just crazy sloppy validation and edits.  Now we’re in a state where the data to model coming from these lousy data entry systems got loaded into data warehouses. The ETL performed on data again, was now maybe a bit better….supposed to make reporting and analysis easier.  But the Transform logic just added another layer of poor hygiene to the data and/or illogical transformations.  And the Load logic was all about reporting and not data science.  So data warehouses are not great to facilitate data science.

Data science is unwinding all of that row by row and column by column in a brute force effort. We even try to get inside of the bugs by finding patterns in null values and unexpected 1’s and 0’s where there is supposed to be valid values entered.

Data science projects simply run out of time to correct all of this and end up throwing out half the data originally thought to be interesting.  Also keep in mind that after the janitorial work, the data has to be preprocessed for the specific algorithmic approaches being used…..binning, log transformations, and a dozen other critical techniques to extract signal and not get fooled by the noise.

I don’t believe there is an automated approach beyond what we already have, because the source systems are so varied in the way the data collection was programmed, the ETL was programmed and the data entry actually happens. The first step is to perform statistical evaluation to ‘smell’ the data.  These are pretty basic steps but need to be done on every column you are working with…sometimes hundreds or thousands.

Analytic Executives should be reading The Race Against the Machine, Brynjolfsson and McAfee. 2011.

I will quote from the book to raise the point that process re-engineering is critical to analytics return on investment.

“The most productive firms reinvented and reorganized rights, incentive systems, information flows, hiring systems, and others aspects of organizational capital to get the most from the technology…..The intangible organizational assets are typically much harder to change, but they are also much more important to the success of the organization.”

This is partly why analytics needs to rise to the level of a corporate function, with staff level executive leadership, so as to be able to move the organization to re-engineer itself for the technology.

Edward H. Vandenberg

IoT is your job. Data Science has always been greedy for complex data and has a pretty good handle on how to process it for insights and predictions.

For most of us and most projects, the practically of getting it to model and having it available to execute run-time algorithms has been the barrier. IoT data is meaningless without algorithms to process it and provide information and predictions/optimizations from it.

IoT is exciting and will change the fundamentals of businesses and industries.  The technology is interesting and very dynamic.  All of this has implications for your analytic operation and practice.

The more interesting and challenging future of IoT (and also part of your job): what are the new processes, user roles, use cases, management scenarios and business cases for IoT. Who will manage the IoT function ‘X’ of the future and what does that role look like.

The other important reason for you to pursue IoT is to keep your data scientists engaged and retained.  Many are still working around the same types of projects, methods, tools etc. that have been around for 10 plus years.  All projects are interesting and challenging but some of them are getting bored.

This means research and discussions with your colleagues, sponsors and stakeholders (while you are still working in the pre-IoT world). Enjoy!

Edward H Vandenberg

Working outside of the IT org as we know it. This is a comment on the ability of traditional IT to support advanced analytics.

Edward H. Vandenberg

Random Research

palentir Palantir Technologies disclosed selling $50 million worth of equity securities as part of financing round. The equity securities, which started to sell on November 26th, were bought by three investors. The offering has no fundraising cap and the company may elect to raise additional funds until the offering closes in November 2015. The private placement’s expected net proceeds amounts to $25 million which excludes $25 million paid in brokerage fees. A total of fifteen unregistered securities offerings closed by the company raised an estimated $1.14 billion.

Morgan Stanley & Co and SF Sentry Securities acted as placement agents.

Palantir Technologies develops and markets big data analytics platforms. The company’s Gotham product allows enterprise to integrate, manage, secure, analyze and visualize all their data. The Metropolis product is designed to integrate, enrich, model, and analyze any kind of quantitative data.

The company is headquartered in Palo Alto CA. Palantir Technologies elected to keep…

View original post 71 more words

EMC IT Proven

lena2By Dr. Lena Tenenboim-Chekina — Senior Data Scientist, EMC IT

Smart data visualization is proving to be an essential tool in maintaining increasingly complex Big Data systems in the cloud.

The adoption of Big Data tools and technology heavily relies on distributed scaled out computing. One of the main differences in this setting is that it includes systems that operate as a whole on top of several independent hosts. These hosts coordinate their actions with limited information and as a result maintenance complexity significantly increases. One way to overcome this challenge is smart data visualization, which helps the IT experts and management pinpoint the source of problems quickly.

The need for smart visualization is not unique to this problem. Representing complex data as a concise picture which tells decision-makers a story is a key part of any data analytics or data science project. Valuable results of a rigorous analysis may…

View original post 1,250 more words

Who has it in your org today? Will they listen to the data science?

Your stakeholders lack a shared understanding of the methods and practice of advanced analytics.  You start out with a trust deficit when explaining how the mathematics will improve business results.

To build trust, start in advance by building an ordinary business relationship to the operations management.  Next share stories of analytics successes and how they were achieved (ideally those that you have directed). Next coach your stakeholders to interpret model results by simplifying the complex model validation process.

Gradually build an Arena of shared understanding for how models can help operations arrive at a better performance state.  This is hard work and not the stuff of algorithms and data but almost as important.

Look at the Jahari Window for expanding the Arena of trust.

Edward H. Vandenberg