The Missing Link in Why You’re Not Getting Value From Your Data Science

The Missing Link in Why You’re Not Getting Value From Your Data Science

by Robert Morris, Ph.D.

DECEMBER 28, 2016

Recently, Kalyan Veeramachaneni of MIT published an insightful monologue in the Harvard Business Review entitled “Why You’re Not Getting Value from Your Data Science. The author argued that bfbusinesses struggle to see value from machine learning/data science solutions because most machine learning experts tend not to build and design models around business value. Rather, machine learning models are built around nuanced tuning and subtle, yet complex, performance enhancements. Further, experts tend to make broad assumptions about the data that will be used in such models (e.g., consistent and clean data sources). With these arguments, I couldn’t agree more.

 

WHY IS THERE A MISSING LINK?

At Predikto, I have overseen many deployments of our automated predictive analytics software within many Industrial IoT (IIoT) verticals, including the Transportation industry. In many cases, our initial presence at a customer is in part due to limited short-term value gained from an internal (or consulting) human driven data science effort where the focus had been on just what Kalyan mentioned; a focus on the “model” rather than how to actually get business value from the results. Many companies aren’t seeing a return on their investment in human driven data science.

wallThere are many reasons why experts don’t cook business objectives into their analytics from the outset. This is largely due to a disjunction between academic expertise, habit, and operations management (not to mention the immense diversity of focus areas within the machine learning world, which is a separate topic altogether). This is particularly relevant for large industrial businesses striving to cut costs by preventing unplanned operational downtime. Unfortunately, the bulk of the effort in deploying machine learning solutions geared toward business value is that one of the most difficult aspects of this process is actually delivering and demonstrating value to customers.

WHAT IS THE MISSING LINK?

In the world of machine learning, over 80% of the work revolves around cleaning and preparing data for analysis, which comes before the sexy machine learning part (see this recent Forbes article for some survey results supporting this claim). The remaining 20% involves tuning and validating results from a machine learning model(s). Unfortunately, this calculation fails to account for the most important element of the process; extracting value from the model output.

In business, the goal is to gain value from predictive model accuracy (another subjective topic area worthy of its own dialog). We have found that this is the most difficult aspect of deploying predictive analytics for industrial equipment. In my experience, the breakdown of effort required from beginning (data prep) to end (demonstrating business value) is really more like:

40% Cleaning/Preparing the Data

10% Creating/Validating a well performing machine learning model/s

50% Demonstrating Business Value by operationalizing the output of the model

The latter 50% is something that is rarely discussed in machine learning conversations (with the aforementioned exception). Veeramachaneni is right. It makes a lot of sense to keep models simple if you can, cast a wide net to explore more problems, don’t assume you need all of the data, and automate as much as you can. Predikto is doing all of these things. But again, this is only half the battle. Once you have each of the above elements tackled, you still have to:

Provide an outlet for near-real-time performance auditing. In our market (heavy industry), customers want proof that the models work with their historical data, with their “not so perfect” data today, and with their data in the future. The right solution provides fully transparent and consistent access to detailed auditing data from top to bottom; from what data are used to how models are developed, and how the output is being used. This is not only about trust, but it’s about a continuous improvement process.

Provide an interface for users to tune output to fit operational needs and appetites. Tuning output (not the model) is everything. Users want to set their own thresholds for each output, respectively, and have the option to return to a previous setting on the fly, should operating conditions change. One person’s red-alert is not the same as another’s, and this all may be different tomorrow.

Provide a means for taking action from the model output (i.e., the predictions). Users of our predictive output are fleet managers and maintenance technicians. Even with highly precise, high coverage machine learning models, the first thing they all ask is What do I do with this information? They need an easy-to-use, configurable interface that allows them to take a prediction notification, originating from a predicted probability, to business action in a single click. For us, it is often the creation of an inspection work order in an effort to prevent a predicted equipment failure.

Predikto has learned by doing, and iterating. We understand how to get value from machine learning output, and it’s been a big challenge. This understanding led us to create the Predikto Enterprise Platform®, Predikto MAX® [patent pending], and the Predikto Maintain® user interface. We scale across many potential use cases automatically (regardless of the type of equipment), we test countless model specifications on the fly, we give some control to the customer in terms of interfacing with the predictive output, and we provide an outlet for them to take action from their predictions and show value.

As to the missing 50% discussed above, we tackle it directly with Predikto Maintain® and we believe this is why our customers are seeing value from our software.

pm1

Robert Morris, Ph.D. is Co-founder and Chief Science/Technology Officer at Predikto, Inc. (and former Associate Professor at University of Texas at Dallas).

What did the Coffee Pot say to the Toaster?

The Internet of Things (IoT) is at the precipice of the Gartner Hype cycle and there is no shortage of the “answers to everything” being promised. Many executives are just now beginning to find their feet after the storm wave that was the transition from on-premise to cloud solutions and are now being faced with an even faster paced paradigm shift. The transformative tidal wave that is IoT is crashing through CEO, CTO, and CIO’s offices and they are frantically searching for something to float their IoT strategies on but often are just finding drift wood.

51vsKtk-yHL

Dr. Timothy Chou and his latest book Precision: Principles, Practices, and Solutions for the Internet of Things is your shipwright. The framework presented by Dr. Chou cuts through the fog that surrounds IoT and provides a straight forward no jargon explanation of IoT and the power that is harnessable. Dr. Chou then goes on to present a showcase of case studies that are real life profitable IoT solutions by a variety of traditional and hi-tech businesses.

One of the case studies Dr. Chou features is based on my work at New York Air Brake where we utilized instrumented and connected locomotives to create the world’s most advanced train control system that has saved the rail industry over a billion dollars in fuel, emissions, and other costs. It was this work that gave me a taste of the power IoT has and gave me the passion to want to make a bigger impact in the rail and transportation industries utilizing IoT data and thus join the Predikto family.

What makes a successful UI for a startup?

No one really knows… until they have some real customers and real users.

Friends reach out for help when joining a new early stage startup, or looking for ideas on what it would take to build a UI for a startup they are thinking about starting.  This is some of the info I give them, so I figured this might help others to.  Is this “best practices”, or “industry standards”? No, but might lead you to the path for the right answer.

From a UI Developer’s perspective, let’s separate what parts the UI is normally responsible for. Asinthecity’s blog post diagram, shown below helps illustrates this.

ui vs ux

As you see, the UI is the glue that connects all the work that everyone in the team has contributed. From the back end team, which usually gets very little credit but gets all the blame when something is not working, to the stakeholders that have pushed this from an idea to a product.

Startups go through many stages, that’s a reality. Below are some typical stages startups go through:

ui-blog-post-20160113-stages

What does this mean for UI Development?

  • START – An idea, from thought or paper, to something people can interact with using the core product
  • SEE WHAT STICKS – Enhancements to help sales close the deal (will not go into details, but a lot of throw away development)
  • CLOSE DEAL – The first couple of customers have say on additional product features (probably will not have anything to do with your core product)
  • STABILIZE – Go back, clean up code, and implement some backlog features
  • FUTURE PROOF – Start transitioning to long term plans for better maintenance and enhancements

Developers want to build a successful UI for their startup. They want potential customers to “want” to use the product, not “have” to use it. With that in mind, I am going to walk you through some scenarios and possible solutions to help make this happen.

Let’s put our UI Development hat on and work on a project from start to end.

Start Project

Fundamentals

If you have a UX Designer on your team, good for you. Skip this section since it’s their responsibility. If not, then be mindful in this stage. These are the fundamentals that could cause a domino effect delaying rollout and transitions to the different product stages. How? At times, startups get desperate, shift quickly and forget about the companies core product or offering.

Before you even start talking colors, layout, etc., take a step back and make sure you have at least these questions answered.

Fundamentals before starting ui development

Will we be building a Public or Enterprise type of app (there are others, but sticking to these 2 examples for now).

Public vs Enterprise type of app

 

Let’s Get Started

So we join a startup that is revolutionizing the way products are sold and delivered. Additionally, they get a ton of information and claim they can analyze and help with reports. At the moment, all this will be displayed on a website.

  • Core Product = Custom software that makes it easier to sell, everything from Amazon, eBay, and their own online stores.
  • What problem does it solve = Greater Sales, easier to maintain, robust tracking and analytics
  • Target Audience = Large Retail Companies (ENTERPRISE)

It is your first day on the job and they say “We are doing a pitch to Toys’ R Us at the end of the month, do you think you can have something done by then?”  That’s less than 2 weeks away.  Welcome to the Startup world!

Here is some additional information that was gathered:

  • They have been presenting power point slides with mockups
  • There is a lot of interest on the Core Product
  • Some potential customers like the widgets but prefer they be organized differently
  • Some widgets have not even been finalized because backend is still working on them

Not perfect, but you can work with this.

Based on the information gathered some of the requirements consists of:

  • Get a fundamental UI to display data decently
  • Many widgets (tables, charts, etc.) that will be used throughout the UI, just not sure where they will be in each view
  • Need the ability to rollout new features quickly (widgets) and add them to the different views
  • There is an existing API that can provide all the information needed

 

Technology Stack

Thought only back end dealt with technology stacks? Guess again. Below are some front end frameworks and tools used to help with development.

Popular Front End Technology Stacks

Chances are you will end up using 3 or more of the above for your development.

A CSS Framework, a Javascript Framework, and a build system.

Why? Many of the requirements and problems can be addressed by using a Javascript Framework.  To start getting feedback ASAP we utilize a CSS Framework and to help with development a Javascript Build System.

Choose wisely.  All these tools are great, but each have their pros and cons.

Some friendly advice.

  • Choose tools that you either know or can pick up quickly
  • Make sure you have sources to get help
  • If you need to bring in outside help or hire new team members, there are nearby sources.  (User groups are a great source)

For the purpose of this exercise we are going with:

UI Blog post

Why?

  • Angular, a HTML5/JS framework, for its modularity but specifically for us, directives (widgets), services (api calls) and user forums
  • Bootstrap, a CSS framework, because of my experience and comfort level
  • Sass to set variables like company colors and re-use them in the different files
  • Grunt as the build tool, local web server and misc. tasks (deploy, build, concat)
  • Bower for dependency management

In the next blog post, we will start to write some code and create some views leveraging the tools we have chosen to go with.

How Predikto hired thousands of data scientists

innovation_leadershipGE’s Jeff Immelt was recently interviewed on the Predictive Analytics investments and overall initiatives that have been ongoing over the last 5 years within his walls. The transcript, available here, is an excellent read on how a legacy company is attempting to transform itself for the digital future, leveraging vast amounts of sensor data to predict failure in large machinery. This marks a pivotal moment in GE’s history, where turning around a Titanic-size ship won’t be a trivial matter. The build-out began 5 years ago with a massive scaling of data science and predictive analytics division

Immelt seemed to drive one point home more than others in this interview; the mass hiring of Data Scientists (and ancillary staff) to accomplish the goal of building out the Predix division.

We have probably hired, since we started this, a couple thousand data scientists and people like that. That’s going to continue to grow and multiply. What we’ve found is we’ve got to hire new product managers, different kinds of commercial people. It’s going to be in the thousands.

We also hired thousands of Data Scientists (although we didn’t hire any “people like that”), so I figured I would shed some light on why and how we accomplished this.

The Need for Data Scientists

Data Scientists are the corner-stone of the machine learning world. Generally speaking, data scientists come from varied backgrounds; mechanical engineers, electrical engineers, and statisticians, to name a few. Their function within a predictive analytics organization is to (putting it simply) make sense of the data and select the features that influence the predictive models. Feature Selection goes hand-in-hand with making sense of the data, in that the Data Scientist is analyzing large amounts of data often with sophisticated software designed to choose which sensor readings, external factors, and derivations / combinations of each truly impact whether some *thing* will fail or not. Data scientists are the tip of the spear in determining what features/reading/factors matter and what predictive/mathematical models should be trained and applied to forecast events and probability of failures.

We faced the same crossroad as GE; data scientists are essential in getting things right and you need a lot of them when analyzing machine data. We aren’t talking about a few terabytes of data here. No, you’re typically looking at hundreds of terabytes generated by a system in a month… every month… for years.

Scaling the Data Science Team

Big data beckons a big data science team, and to that end, we had to employ, as GE does, thousands of data scientists.

Unlike GE, our data scientists don’t have names or desks. They don’t require ancillary staff nor coffee to stay awake.

Our data scientists work 24 hours a day, 7 days a week, 365 days a year and never tire or complain. Larger dataset? Our data science team clones itself to meet the demand elastically.

Predikto has a unique approach to machine learning and data science. Our data scientists are tiny workers operating on multi-core computers in a distributed environment, acting as one. Just like machines automated many of the mundane human tasks during the industrial revolution, Predikto has automated machine learning and the mundane tasks once accomplished by humans. Our feature selection? Automated. Feature scoring? Automated. Training models? Automated.

I invite you to read the Immelt interview. It truly is a good read on one way to approach building a predictive analytics company. At Predikto, we chose a different path that we felt was innovative and scalable for our own growth plan.

Also a good read… Innovation Happens Elsewhere (http://dreamsongs.com/IHE/IHE-24.html#pgfId-955288)

Using the Spark Datasource API to access a Database

spark-logo

At Predikto, we’re big fans of in-memory distributed processing for large datasets. Much of our processing occurs inside of Spark (speed + scale), and now with the recently released Datasource API with JDBC connectivity, integrating with any datasource got a lot easier. The Spark documentation covers the basics of the API and Dataframes. There is a lack of information on actually getting this feature to work on the internet, however.

TL;DR; Scroll to the bottom for the complete Gist.

In this example, I’ll cover PostgreSQL connectivity. Really, any JDBC-driver-supported datasource will work.

First, Spark needs to have the JDBC driver added to its classpath:

os.environ['SPARK_CLASSPATH'] = "/path/to/driver/postgresql-9.3-1103.jdbc41.jar"

Once loaded, create your SparkContext as usual:

from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
 
sc = SparkContext("local[*]", '')
sqlctx = SQLContext(sc)

Now, we’re ready to load data using the DataSource API. If we don’t specify a criteria, the entire table is loaded in to memory:

df = sqlctx.load(
  source="jdbc", 
  url="jdbc:postgresql:///?user=&password=",
  dbtable=".")
  1. source: “jdbc” specifies that we will be using the JDBC DataSource API.
  2. dbtable: The JDBC table we will read from, and possible a subquery (more about this, below)
  3. url: The DB to connect to.

Using the above code, the ‘load’ call will execute a ‘SELECT * FROM ‘ immediately.

In some cases, we didn’t want an entire DB table loaded in to memory, so it took a bit of digging to understand how the new API handles “where” clauses. They really act more like subqueries, where anything in the ‘FROM’ clauses will work.

query = "(SELECT email_address as email from schema.users WHERE user_id<=1000) as )"

df = sqlctx.load(
  source="jdbc", 
  url="jdbc:postgresql:///?user=&password=",
  dbtable=query)
  1.  query: This query contains our ‘WHERE’ clause. Note that you must specify an alias for this query.

Given the example above, Spark will consume a list of email addresses from our user table, for all users with an id <= 1000. Once we have a Dataframe in-hand, we can process the data using the API… converting it to an RDD or running SparkSQL calls over the data.

Complete example as Gist:

Forget the Big Data technology chat and ask “How will I use this?”

GEPost

The Industrial Internet of Things is going to revolutionize the way enterprises interact with their assets and equipment. Bill Ruh, VP of Global Software Center at GE, says GE expects there will be 17 Billion unique pieces of equipment by 2015 and only 10% of the devices are equipped with sensors today. Most of the equipment with sensors lacks the basic intelligence they hope to have in the future which is to tell them when something has gone very wrong.

Predikto Analytics is on the front lines working with asset intensive organizations that have equipment with sensors to help provide actionable predictions and prioritize preventative maintenance based on the equipment condition. We continue to see great products and tools hit the market with the goal of helping companies “Crunch” the big data, but the issue is not so much crunching the data, but being able to have a business or operational purpose to the data crunching.

We encourage enterprises to ask “How am I going to use this technology?” The actionable piece of the solution is sometimes left to an after thought. We at Predikto Analytics start the conversation with the action our customers will take with the big data predictive analytics solutions we offer.

Here comes the flood-analytics, Manufacturing Industries!

Manufacturing Industries

Yes – we have heard about the magic of predictive analytics and how it has helped various companies and industries in predicting the rise and fall of companies’ finances, social media and marketing and such but could the same predictive analytics methods aid manufacturing industries today?

According to Bala Deshpande’s article, manufacturing industries are not entirely oblivious to the idea of collecting data. As a matter of fact, you could call them ‘The Forefathers of Data Collecting’. Manufacturing industries have been collecting data for years on the company’s current operations and quality of their products. However, the time has come for manufacturing industries to start digging these data sets a bit deeper to improve their operations so much so that, companies would be able to improve notably their production yield.

The benefits of manufacturing industries engaging in predictive analytics can be seen when the production process becomes even more efficient and cuts unnecessary costs (i.e. unexpected machine failure). Deshpande highlighted a small company in the manufacturing industry that has already started to engage in predictive analytics by installing overhead GPS sensors that notes down the number of workers working on a particular project and if that project requires assembly so to calculate how extensive the machine is being used and predict any machine failures and such.

Whether manufacturing companies like it or not, engaging in analytics is inevitable. Especially if competitive manufacturing companies are using the same predictive analytics to measure how likely they are going to be performing much better than the other companies!