THINKING LIKE A DATA SCIENTIST

Author: Adam                            Updated: 24/05/2019

Data Science (DS) is a new, exponentially-growing field, which consists of a set of tools and techniques used to extract useful information from data. It is also an interdisciplinary, problem-solving oriented subject that learns to apply scientific techniques to practical problems.

1. A practitioners Guide to IoT Data Pipeline

AI technologies are yielding real business results for Manufacturers, but a successful AI requires robust data pipelines and a systematic approach. The interest in data science is to solve problems and answer questions using data and technology and to aim at improving future outcomes.

So, what is the data science process? First, we have to make sure what are we trying to achieve with Data Science. Data science serves two important but distinct sets of goals: improving the products our customers use, and improving the decisions our business makes. Data products use data science and engineering to improve product performance, typically in the form of better search results, recommendations and automated decisions. Decision science uses data to analyze business metrics — such as growth, engagement, profitability drivers, and user feedback — to inform strategy and key business decisions.

With the right Asset data in place, analytics provides the capability to make better, more informed decisions. Data Modelling is used to collect, store and cut the asset data in an efficient way. Visualisation is used to integrate, consolidate and present asset information in a meaningful way to the right people at the right time.

What analysics predit? - Predict asset degradation and exceedance - Predict failure likelihood - Predict impact of intervention type

What analysics optimise? - Optimising whole life cost for asset portfolio - Simulation of asset performance based on known environmental conditions - Optimise long term workbank

Solving business problems can be treated systematically by following a process with reasonably well-defined stages. Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages.

CRISP-DM

Success in today’s data-oriented business environment requires being able to think how these fundamental concepts apply to particular business problems-to think data-analytically. Formulating data mining solutions and evaluating the results involves thinking carefully about the context in which they will be used. Here, you could refenence a framework to structure our thinking about analytics problems in seven steps.

7 step process of data science thinking process

It's also important of evaluating insights generated from data science projects.

2. Best Practices When Starting And Working On A Data Science Project

Great data scientist is obsessed with solving problems, not new tools. The core function of any data-driven role is solving problems by extracting knowledge from data. A great data scientist first strives to understand the problem at hand, and then defines the requirements for the solution to the problem, and only then decides which tools and techniques are best fit for the task. In most business cases, the stakeholders you will interact with do not care about the tools – they only care about answering tough questions and solving problems. Knowing how to select, use and learn tools or techniques is a minimum requirement to becoming a data scientist. A great data scientist knows that understanding the underpinnings of the business case is paramount to data science project success.

Great data scientist wants to find the solution and knows it’s not perfect. A great data scientist understands that there’s almost never a perfect solution, and a simple imperfect solution delivered on time is much better than a hypothetically perfect one late. In fact the Agile software development methodology seeks to prevent analysis-paralysis by employing adaptive evolutionary planning, early delivery and continuous improvement. The mindset of a great data scientist works in the same way they think about solving their stakeholder problems and know that they need to be redefined when new insights are uncovered. The main piece of advice is that don’t overthink and over-analyze the problem. Instead, you need to phase out your analysis or modelling process in stages and get feedback from the problem owners. This way you will ensure that the learning process is continuous and it improves the decisions with each iteration.

It is the best practices when starting and working on a data science project. What you need to think about is to learn through the consult, communicate, plan, and also technical best practice.

In the aspect of consultant, the fact is you are probably acting as a consultant to some degree. Recognize that you are in a position where you need to influence stakeholders to use your insights and take action.

For your work to be successful, you need to be able to communicate what it is you have done and why your audience should care. When communicating,

There are almost unlimited ways to cut data and present it. This does not let you off the hook however. Your stakeholders and time with business is limited (they have day jobs). When running your project,

And finally, for technical best practice, - Keep Everything (Principle 1: Space is cheap, confusion is expensive) - Keep It Simple (Principle 2: Prefer simple, visual project structures and conventions) - Automate (Principle 3: Prefer automation with program code ) - Maintain Data Provenance (Principle 4 Maintain a link between data on the file system, data in the analytics environment, and data in work products) - Version Control Data and Code (Principle 5: Version control changes to data and analytics code) - Consolidate (Principle 6: Consolidate team knowledge in version-controlled builds) - Think like a developer (Principle 7: Prefer analytics code that runs from start to finish) - Test models. Always reviews the standard tests that accompany a model or algorithm. Run models against new data to make sure they perform.

3. Demo: Predictive Maintenance | Predicting the RUL(Remaining Useful Life) of a Railway Asset

How to generate insights from data of your rolling stock

Understanding the Business

To fully understand the problem, we need to assure the problem, clear the questions, and measure the outcome.

Predicting RUL is a central goal of predictive maintenance algorithms

The remaining useful life (RUL) of a component is the expected life or usage time remaining before the component requires repair or replacement. Predicting remaining useful life from system data is a central goal of predictive- maintenance algorithms. The term life time or usage time here refers to the life of the component defined in terms of whatever quantity you use to measure system life. Units of life time can be quantities such as the distance travelled (miles), fuel consumed (gallons), repetition cycles performed, or time since the start of operation (days). Similarly time evolution can mean the evolution of a value with any such quantity.Typically, you estimate the RUL of a system by developing a model that can perform the estimation based upon the time evolution or statistical properties of condition indicator values, such as:

Predictions from such models are statistical estimates with associated uncertainty. They provide a probability distribution of the RUL of the test component.

The Predictive Challenge

The main areas of application are as follows: - Use survival analysis (and advance probability distributions like Weibull) to predict the lifetime of railway components - Build (non-)parametric models for failure time data to explore lifetime of components based on age and usage data - Find important features that contribute to the failure of the component using Cox ́s proportional hazard model - Build component risk based models to identify the components that get over used in advance so that these can be maintained properly - Survival analysis attempts to answer these questions: - What is the average time to failure? - What is the probability at a particular time that failure will occur? - How do survival rates compare between different groups? - What impact do other features have on survival rates?

RUL App: Remaining Useful Life Prediction Using Weibull Proportional Hazard Model(WPHM) REST API

Analytics Approach

Express problem in context of statistical and Machine Learning techniques, - Regression : “predicting the Expected time to Failure” - Classification: “Does this machine have failure A, Failure B , or are they Normal?” - Clustering: “Are there groups of diagnostic messages that always precedes a failure?” - Recommendation: “How Action to take if i see this error message?” - Outlier Detection

Models for predicting Remaining Useful Life

Degradation models extrapolate past behavior to predict the future condition.

This type of RUL calculation fits a distribution model to degradation profile of a condition indicator, given the degradation profiles in our ensemble. It then uses the degradation profile of the test component to statistically compute the remaining time until the indicator reaches some prescribed threshold. These models are most useful when there is a known value of the condition indicator that indicates failure. Example degradation model types are:

  1. Linear degradation model — Describes the degradation behavior as a linear stochastic process with an offset term. Linear degradation models are useful when the system does not experience cumulative degradation.

  2. Exponential degradation model — Describes the degradation behavior as an exponential stochastic process with an offset term. Exponential degradation models are useful when the test component experiences cumulative degradation

Survival analysis is a statistical method used to model time-to-event data. It is useful when we do not have complete run-to- failure histories, but instead have:

Only data about the life span of similar components. For example, we might know how many miles each engine in the ensemble ran before needing maintenance, or how many hours of operation each component in the ensemble ran before failure. In this case, we use the historical information on failure times of a fleet of similar components, this model estimates the probability distribution of the failure times. The distribution is used to estimate the RUL of the test component.

Both life spans and some other variable data (covariates) that correlate swith the RUL. Covariates, also called environmental variables or explanatory variables, comprise information such as the component provider, regimes in which the component was used, or manufacturing batch.

Data Compilation

Data Preparation

Data preparation is arguably the most time-consuming step with 80% of the entire DS process is in data cleaning and preparation. Data preparation encompasses all activities to construct and clean the data set. Accelerate data preparation by automating common steps.

Modelling

Model Evaluation

Deployment and feedback

4. Final Thoughts

Getting the foundations in place; an integrated single source of accurate asset data, is key to delivering improved decision making. With the data in place we deliver insight that supports key investment decisions through analytics. It is then important to clearly articulate the business outcome and benefits that are driven from making better decisions. In order to deliver all the capabilities of a Data Analytics solution, a complete platform needs to be implemented, as shown below:

a complete platform of data analysis

The output of modeling may be strikingly rich, but it’s valuable only if managers and, in many cases, frontline employees understand and use it.

And finally increase the probability of success by involving key stakeholders from the beginning”.