Data Science Part I:

What is Data Science?

Among the hottest buzzwords in industry today are:

Data Science
Big Data
Data Mining
Statistics/Data Analytics
Modeling
Artificial Intelligence (AI)/Machine Learning

But what do they mean? Are they new, space-age discoveries that only a PhD in Statistics or Computer Science can understand?

Not really. Like so many things, the truth is hidden in a maze of jargon. The concepts are relatively simple and with the help of the personal computer, the calculations are much more manageable than in the past.

This little paper is designed to strip away much of the mystic and get down to the business utility of Data Science.

First – Some Background:

There are some things that have changed. The personal computer, the Internet, the rise of many, cheap methods for capturing data and using large databases have changed the landscape. It is now possible to capture incredibly detailed measurements about most processes, store them retrievably and analyze those data quickly and inexpensively. We have become obsessed with collecting data – mostly because we can. Data Science has recently come on the scene as an answer to the question, “Can we do anything useful with all these data?”

Data Science has been around since the Egyptians first started tracking the times and duration of the Nile flood and a bunch of other miscellaneous data. They first gathered information about the flood, crop practices, crop yields and the movements of the stars and planets. One of the first Data Scientists noticed that when the Star of Isis (Sirius) appeared to rise just before the sun, the Nile floods would start soon. Farmers planned their activities accordingly and the Egyptians flourished even though they lived in the center of a vast, dry wasteland. History indicates that Data Science was around at least by 3285 BCE.

Now this may be a bit of a stretch since the Egyptians added lots of religious and cultic notions around their science. Data Science was clouded in the occult and superstition until the experimental, scientific revolution that began in the 15^th century (or so). Galileo is often cited as the first experimenter that concerned himself with data accuracy and precision in an objective way around 1600 AD. It was the hard sciences of chemistry, physics and astronomy that pushed forward scientific data analysis. Later, botany, genetics and agronomy pushed data analysis into areas where understanding random, experimental “error” became critical.

Over time, it became clear that statistical analysis of experimental data was crucial for understanding the usefulness of the data. Initially this as attributed to unavoidable errors or limitations in measurement technologies. By the early 20^th century, however, it had become clear that all systems displayed probabilistic behavior to a greater or lesser degree. This was especially true with sub-microscopic phenomena (e.g. photons, electrons, etc.) and complex biological phenomena (e.g. mutation/variations, psychological trends and preferences, etc.).

We will return to these probabilistic issues in a later blog, but for now, be aware that the mathematics of data management – especially statistics – was a well-developed field long before we were talking about Big Data. Data Science has borrowed much of its analytical approach from the physical, biological and social sciences.

Data Science is essentially the:

Collection of data,
The categorizing of these data,
The creation of predictive models from the categorized data and
The application of the predictive models for useful purposes.

The various buzz words tend to focus on some aspect of this process and emphasizes specific tools.

Some of the Buzz Words:

Two closely related terms, Big Data and Data Mining began to emerge around 2000 and by 2010 had become the hot items in IT. The growth of data storage capabilities during the 1990’s made possible accessing terabytes of data. By the early 2000’s developments in parallel processing and large, volatile memory chips had dramatically increased the speed of data searching and handling. It became practical to investigate gigabytes of data simultaneously in fast, volatile memory rather than combing through multiple magnetic or optical discs. It became practical to do rapid statistical analysis on gigabytes of data hunting for potential relationships between data elements and including many simultaneous variables in linear and non-liner models.

By around 2010 the collection, storage, retrieval, investigation of gigabytes and terabytes of data became known as Big Data. The methods for collecting, storing, searching and retrieving relevant data from existing data became known as Data Mining. When we talk about Data Mining, we are focusing on the challenges of retrieving useful data from a variety of independent sources. Frequently, these sources are “legacy” databases created by old hardware and software. Sometimes these databases have little structure and are of questionable quality. The Data Mining expert deals with all these potential issues when trying to sort through giga- and terabytes of data.

Two other buzz words that are often used interchangeably are Statistics and Data Analytics. They are not the same. Statistics is a body of scientific, mathematical principles and theorems based on various types of data randomness. It first developed from studying games of chance (dice, cards, etc.), but became closely associated with the experimental sciences (especially agronomy, botany and chemistry).

Data Analytics uses Statistics extensively. Data and data trends are studied to see if “statistically valid relationships exist.” The question is, “Are there persistent relationships that cannot be explained simply by random chance?” The hope is that we can find relationships that are persistent enough to be used to predict the future. Ideally, the relationships are so persistent that they could be called “laws” that “cause” the results that we observe. We start by looking at “correlations” and then test them to see how “important” and persistent they are. In the hard sciences we can often run enough controlled experiments that we can start talking about “causality.” This is much less common in business and social sciences where we often are working “in the wild” or with “secondary” data – that is data we have “inherited” and not generated under conditions that we controlled.

Data Analytics is more than just statistical analysis, however. It includes evaluating the completeness and reliability of the data being used for statistical analysis. This has been one of the greatest challenges for Data Analysts. Historical data is never complete or without obvious and suspected errors. Data Analysts must decide what to do with such data without biasing the data set. There is constant tension between deciding which data are “good enough” and which data must be rejected as incomplete, inaccurate or irrelevant. The users of data should also be aware of the “judgements” that Data Analysts must make. Those “judgements” could make the analysis and conclusions biased or irrelevant.

Anyone who thinks that Data Science is purely objective does not understand it at all. The Data Scientist must “prejudge” many aspects. We can only hope that it does not lead to intention or even unintentional prejudice.

The goal of Data Science activities is to “learn” in the sense of building and validating models that allow the measurement of readily observable factors to predict important results. This notion is worth spending time pondering. When we “learn” we are creating “models” in our head that organize distinct observations and relate them to each other. We are hoping that A causes B with enough reliability that when we see A, we can get ready for B.

An example might be using data we have collected about buyer preferences. Assume we have done a survey of customer preferences that included satisfaction/dissatisfaction with how well our product lasted. Now suppose that our competition extended their warranty from ninety days to a year. We might wonder if we needed to extend our warranty program. We would be scrutinizing our data to see if it could be used to make a statistically valid model of how our customers will react to our competitor’s action. If we “knew” that, we might be able to react to minimize damage or even gain an advantage in the “exchange.” If we “knew” our customers were very satisfied with our product’s reliability, we might do something quite unexpected. We might run an ad campaign that emphasized our product longevity and make our competitor’s “need” for a warranty extension a negative in the minds of the consumer.

Modeling is rife with challenges, especially when all data are historical rather than experimental. Experiments are designed to control all factors that seem to be important and to randomly manipulate a few factors to see how results of interest are impacted. Good experimental design can often build a very strong case for causal relationships between a small number of very important factors and results. Historical data is rarely so nice and neat.

Historical data is collected as it happens. It frequently does not follow an experimental design nor is there much hope that all relevant factors were observed. There can be important factors that were not measured that “confound” the data. An example might be price and sales data that were taken during a period when the economy was slipping into a recession. Unless the analysis considered macro-economic factors, it could be useless. And finally, historical data is always plagued by errors and missing data points.

The famous statistician, Dr. George Box, was noted for saying, “All models are wrong, but some models are useful.” Models are useful, numerical approximation of reality if they can reliably predict important factors from easily measured ones. They may be useful for a season and pass away into history or they may indicate very persistent relationships that seem to always be reasonably true. Hence, the users of Models need to be careful to be neither too demanding of accuracy nor too certain of predictions. One of the arts in Modeling is for the Data Scientist to understand the data, the relationships and the data needs well enough to communicate to data users the reliability and limitations of any Model they create.

Pulling it all together – Artificial Intelligence (AI) and Machine Learning:

Perhaps the sexiest concepts in Data Science today are AI and Machine Learning. They are closely related concepts that either focus on the process of “thinking” (AI) or more on the gathering and using the “thinking” (Machine Learning). The concepts bring together much of what we have discussed so far. Let’s start with Machine Learning.

The clearest way to think about Machine Learning is to imagine an example – the Roomba. This amazing device uses Machine Learning to navigate your house and vacuum your floors. How does it do it?

The Roomba has multiple sensors for detecting when it runs into something and multiple pre-programmed actions to try to continue vacuuming. What it does differently from a “dumb machine” is that it remembers bumping into something and ties to map out a “better” performance next time. The Machine Learning part of Roomba is the sophisticated sensing equipment that is tied to a large memory and a fast, powerful processing system able to recall and use gigabytes of data. Roomba can “learn” to avoid obstacles by rote but optimizing the time and efficiency of vacuuming requires a bit more. This is where we tend to talk more about AI.

AI is focused on optimizing the algorithms for doing a task. It requires knowing about the task, guessing at reasonable optimizing actions, attempting many different actions, measuring results and then selecting and refining the decision algorithms that work the best. These algorithms include a decision tree for taking action and a Model to help select optimal solutions.

A simple algorithm might be, “If you can’t go forward, then try going right, repeat until you are going forward again.” A more sophisticated algorithm will look at many factors before just deciding to go right. Current position in the room and a history of what has happened in the past at that position would come into play. Furthermore, the outcomes of previous decisions made at that position would be considered.

AI isn’t very impressive if it is nothing more than remembering not to repeat mistakes. It becomes much more useful when algorithms are developed that successfully navigate “changes” to conditions. That is, when good decisions are made when a chair is arbitrary placed in the room. “Good” decisions generally follow from certain decision patterns. If we “learn” these patterns, incorporate them into the decision algorithm(s) and then modify them with the newest data, we are creating a much more powerful AI application.

Machine Learning and AI can also be illustrated in a data management project. Suppose we wish to use AI to set asking prices for houses we are selling. We would certainly want a large database of asking and selling prices for “relevant” houses. That could mean those in the neighborhood and of similar size but there could be a whole host of other features that could be relevant. Sadly, tastes, budgets, interest rates, consumer confidence and even our own financial goals are not static. If we could gather large chunks of data and measure how they interact to produce quick, profitable sales we might create an AI algorithm that would give us a distinct advantage.

We would use Machine Learning to go out and constantly get the “relevant data” and update our algorithms based on results we achieved in the marketplace. We would try many different optimizing algorithms to see which ones gave us the best result. Then we would have to try to find decision trees to integrate with our Models that effectively respond to the many changing features of the real estate business. Once we did that and verified that it produced improved decision making, we would be well on our way to having an AI based real estate business.

This all sounds easy enough until you look into the details. It is one thing to allow Roomba to stumble and fumble around learning how to vacuum a room, but a whole different matter to allow a new AI system to replace a real estate “expert” on setting house prices.

The history of AI and Machine Learning is full of spectacular failures. Nevertheless, there are many, less spectacular successes. We will talk about some of the failures and successes in future blogs. Like with any technology deployment, there are constraints and limitations that must be recognized and considered. We will talk about some of those in the next blog: What is Data Science? Part II

Stites & Associates, LLC (SALLC), is a technology development group founded by Ron Stites in 1996. The scientists and engineers at SALLC strive to help clients succeed at technology development and deployment by using Evidence-Based Decision-Making. Feel free to contact Ron Stites via email at ron@tek-dev.net.

What is Data Science?

Heading Goes Here

(912) 247-6120