The Scalability of Data Science: Part 2 – The Reality of Deployment

To put my previous post into perspective, let me give you a for instance… An organization wants to develop a deployed predictive analytics solution for an entire class of commuter trains. Let’s be modest and go with 10 different instances from within the data (e.g.,  1) predicting engine failure, 2) turbo charger pressure loss, 3) door malfunction, … and so on…). We’ll focus on just one…

Data from dozens of assets (i.e., trains) are streaming in by the second or quicker and these data must be cleaned and aggregated with other data sources. It’s a big deal to get just this far. Next you have to become and expert in the data and begin cleaning and developing context-based feature data from the raw source data. This is where art comes into play and this part is difficult and time consuming for data scientists. Once a set of inputs has been established, then comes the easier part, applying an appropriate statistical model/s to predict something (e.g., event occurrence, time to failure, latent class, etc.) followed by validating and deploying the results. Oh yes, let’s not forget the oft unspoken reality of threshold settings for the customer (i.e., costs of TPs vs FPs, etc.). To this point, we’re assuming that the solution has value and it’s important to keep in mind that a data science team has probably never seen this sort of data ever before.

So on top of requiring computer programming skills, feature engineering prowess (which is art), understanding statistics/machine learning, and having good enough communication skills to both learn from the customer about their data and to be able to “sell” the solution, this must all be accomplished in a reasonable amount of time. We’re talking about 1 instance to this point, remember? And, we’re still not deployed. Do you have expertise in deploying data for the customer? Now repeat this situation ten times and you’re closer to reality. Your team may now just filled up the next 12 months of work and the utility of the solution is still unknown.