This article is based on the Master essay title “A Survey of Predictive Analytics in Data Mining with Big Data” (Lam, 2014).
Predictive analytics is currently an emerging term, a term that emerged out of the more established domains of Business Intelligence and Data Mining. Predictive analytics is a field derived from Data Mining which is corroborated by the centuries old disciplines of mathematics and statistics.
This article aims at providing a brief overview of Predictive Analytics field and the common practice involved in this young but evolving field. The typical questions arise are “What is a prediction?” and “How does prediction differ from estimation?”. To answer these questions, one must first understand the differences between the explanatory modelling and predictive modelling. The goal of the explanatory modelling is to provide descriptive context of a situation, aimed at discovering and explaining the causal links between cause and effect constructs (Shmueli, 2010). Causality plays a central role in explanatory modelling as shown in Figure 1. Predictive modelling, on the other hand, does not fixate on causality but focus on the association relationships between measurable variables as shown in Figure 2. In other words, predictive modelling is grounded on the notion of measuring direct and indirect effects of any causes. This is important because measuring the effects of aggregate causes provide a higher degree of prospective indication to future events than explaining the causal effects of any pairs of causal links.
The reason behind the shift of focus from causal links to association relationships is that, accurate prediction demands large set of data and even to the level of Big Data. Since we are interested only in the association relationships of measurable variables, Big Data provides an abundant source of data necessary for building up a predictive model.
Simply put, the process of prediction is to infer unknown outcome through the analysis of known outcomes. A definition of Predictive analytics is shown below:
“To maximize the signal-to-noise ratio through the analysis of Big Data. To use the result of such analysis in combination of the advanced techniques of statistical modelling and the assistance of high performance computing devices, to derive meaningful information that provide a higher-than-guessing accuracy and precision. The derived information is capable of predicting trends and the validated result of each prediction will be used in updating the underlying statistical model continuously and perpetually.” (Lam, 2014)
From Classification To Prediction
Categorization is a very basic methodology for human to make sense of the world. Human use categorization to divide complex situations into meaningful and digestible forms, simplifying complications is the principle approach in general problem solving. Categorization sets the stage for classification where concept groups begin to emerge which provides distinguishing classes for object classification. Classification is the foundation of Data Mining, this is because classification allows us to group and nest objects together that exhibit similarities based on a posteriori knowledge. The prior knowledge that are formed as a result of categorization maturing into classification. Which is to say, if a group of objects exhibit certain degree of shared identifiable traits and that the group itself possesses enough distinguishing characteristics from other similarly grouped objects, then the characteristics of the group constituents become the differentiable property of a class.
Once we amass enough amount of classified information and presenting them as discrete classes, the building blocks for prediction are created. The reason behind this classification-to-prediction transformation is due to the nature of historic recurrence. New event has a tendency to occur in the same fashion as previously occurred events. Thus, naturally a repetitive pattern can be observed given enough time passes. The repetitive pattern in nature serves as a high level predictor for future events. This applies to everything from annual influenza virus strain prediction to financial engineering applications. To illustrate the relationship between classification and prediction, we use insurance company as an example.
Insurance companies employ classification methodology to identify risk level and to set a fair premium paid by their clients. A client’s attributes affect the premium to pay in order to be insured. For health insurance, these attributes are age, gender, income level, occupation, marital status, pre-existing and previous history of health problems are all examples of factors used in analyzing and classifying individuals into discrete classes with varying risk scores and pricing levels. We can say this is a form of risk adjusted prediction by the insurance company in order to maintain a positive revenue because an individual whom exhibits certain high risk factors would belong to a class that demands higher premium to offset the higher perceived risk. These high risk factors are often determined empirically, which is to say, past experience demonstrated a pattern where the individuals with certain attributes are predisposed to certain degree of risks. Therefore, a prediction is made on the likelihood of certain health risks (e.g. cancer, terminal illness and premature death) when the individual is assigned to a class. The above example illustrates the basic principle of how classification can aid in making predictions.
We have discussed how classification relates to prediction and many techniques exist that operationalize classification methodology. The simplest and well-known is the decision tree method. The basic form of a decision tree is derived from a binary tree structure with conditional nodes leading to breaches that allows for a top-down nodes walkthrough for the purpose of classification. This is done by starting at the topmost node and descend down the tree hierarchy where small and incremental decisions are made at each node pertaining to the object attributes in questions. When we arrive at one of the lowest nodes then we arrive at a class that the object belongs to. Of course, the above method applies a very coarse grain approach to predict. The logical operation is such, if an object is classified and to be enclosed by a set of predetermined attributes, this object is said to behave similarly in the ways that other objects belonging to the same class would and have behaved. This is the basic of classification-to-prediction approach, a simple approach rested on a deterministic thinking. However, a deeper dive into the details is important to transcend simple prediction to advanced Predictive Analytics.
A Probabilistic Approach
There can never be absolute certainty in prediction as the basis of prediction rests on statistical inference of what have been known and recorded. As such, a non-deterministic means of thinking about and measuring predictions provides an actionable and pragmatic approach to prediction, that is, stochastic based predictive methods. A probabilistic approach to answer questions that rest on the concept of chance.
A person who uses tobacco products have a 40% probability of developing lung cancer. The 40% figure portrays a non-deterministic outcome of a given cause, the figure also conveys the message that four in ten people would result in lung cancer due to consuming tobacco products. Further, it also tell us that the 40% figure is based on an aggregate measure of historical events from a sample dataset, a statistical reference derived from statistical analysis. However, what the figure does not convey is the specifics of why six in ten people will not develop lung cancer while the other four people will for the same reason (i.e. tobacco). The reason is because we cannot determine with certainty of the causalities of all possible factors involved. While tobacco use correlates with lung cancer incidents, there are many factors that can contribute to the same effect, such as hereditary reasons, prolonged exposure of harmful chemicals, emotional stress and dietary plan, etc. Each individual have a certain degree of susceptibility of developing lung cancer for different reasons and different combinations of reasons, with and without environmental factors. There is no certainty in this measure and as such, we present the 40% reference figure as a statistical and probabilistic measure based on prior incidents.
Augmenting Predictions with Big Data
Factors for predictions are called predictors and more predictors available generally produce higher accuracy predictions. Even a small number of predictors would result in an overwhelming number of permutations, arrangements and combinations of factors. Big Data accelerate this problem as the exponential growth of these predictors accelerate and the association relationships that exist among them begin to muliply.
Big Data is known for its properties of three Vs, that are, volume, variety and velocity (Gartner.com, 2014). Each of these properties increases the number of data dimensions for us to correlate factors and analyze relationships. This is both beneficial and detrimental to our ability to sift through the massive amount of data. More data comes with more noise in additional to the signal that would provide us with deeper context. The added data dimensions provided by Big Data inset the contextual information necessary for deep machine learning. For instance, suppose we are to diagnose a patient’s predisposition of Alzheimer decease. The genetic and health information are the core factors while the behavioural information are supplementary to the diagnostic. Behavioural information such as dietary habit, the level of physical and cognitive oriented exercise, the size of the patent’s social group, and even the patent’s personal interests could be correlated in many different ways. Not to mention the biometric data produced by health monitoring devices continues to supplement the overall personal Big Data in realtime.
More data means more context and thus Big Data provides us with the much needed context in any predictive modeling endeavor. More is not just more, more is different (Anderson, 1972). Building a context-aware predictive model requires the consumption of high degree of multi-dimensional data, contextual information that are deduced from the data outside of the set of core factors. Thus, Big Data enables us to extract the vital data embedded within the many sources of data combined with the criticality of realtime data for us to discover the underlying correlations and relationships.
Issues and Trends
The issue with Predictive Analytics and the complementary role of Big Data goes back to the definition in the beginning of this paper where the goal of Predictive Analytics is to maximize the signal-to-noise ratio within any given set of data. The objective reveals the ongoing issue of maximizing our ability to filter out the noise from the signal, to which we have made great advancements through the improvement of data modeling and algorithms. To that end, the study by (Niculescu-Mizil & Caruana, 2005) suggested a performance plateau involving various commonly used techniques through extensive benchmark and model calibration. The result indicated the performance of each individual machine learning method provides no significant gain from one another, which is to say, calibrated SVM has approximately the same predictive performance as the artificial neural network as well as decision tree when they are given the same dataset.
The result from such findings gave birth to the ensemble approach. The basic premise of ensemble approach is to embrace diversity in aggregate where collective intelligence can be achieved. Ensemble methods are gaining momentum, it is because the underpinning meta-learning idea of combining weak leaners to make a strong leaner means existing model performance can be improved by aggregation. The trend of ensemble modeling is exemplified by the winning team BellKor’s Pragmatic Chaos ensemble team in the Netflix Prize event (Koren, 2009).
The concept drift is another ongoing challenge that many researchers have faced (Venkatesan, Krishnan, & Panchanathan, 2010). Concept drift defines the phenomena involving changes in the predictive nature of the independent variables used in the underlying data and model. Concept drift describes data that exhibits a shift in variable relationship in concept, which is different from the training data used during the model’s supervised learning process. In other words, the training data used to train a model no longer representing the current data being processed by the model. The concept drift can be detect by different techniques and AdaBoost is one of the most commonly employed method.
Anderson, P. W. (1972). More Is Different. Retrieved from http://www.ph.utexas.edu/~wktse/Welcome_files/More_Is_Different_Phil_Anderson.pdf
Gartner.com. (2014, 06 30). Gartner IT Glossary – Big Data. Retrieved from Gartner.com: http://www.gartner.com/it-glossary/big-data/
Koren, Y. (2009). The BellKor Solution to the Netﬂix Grand Prize. Retrieved from http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
Lam, D. W. (2014). A Survey of Predictive Analytics in Data Mining with Big Data. Edmonton, Alberta, Canada: Athabasca University. Retrieved from https://www.academia.edu/8825157/A_Survey_of_Predictive_Analytics_in_Data_Mining_with_Big_Data
Niculescu-Mizil, A., & Caruana, R. (2005). Predicting Good Probabilities With Supervised Learning. Proceedings of the 22 nd International Conference. Cornell University. Retrieved from http://machinelearning.wustl.edu/mlpapers/paper_files/icml2005_Niculescu-MizilC05.pdf
Shmueli, G. (2010). To Explain or To Predict? Statistical Science. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1351252
Venkatesan, A., Krishnan, N. C., & Panchanathan, S. (2010). Cost-sensitive Boosting for Concept Drift.