Semi-automated variable selection in predictive analysis
- Intro
It is common in predictive analysis to be faced with a large amount of variables which could possibly be useful, but also possibly irrelevant or even contra productive if included in the final predictive model. Generally, a good variable selection simplifies the model, provides a better generalization for further usage as well as provides a faster and more cost-effective model generation. In practice a data set to be analyzed can consist of up to a several thousand or even more variables. A well performed manual selection when dealing with these kind of data sets results in a loss of a substantial amount of work hours and can as well result in varying degrees of human error. To tackle this problem efficiently we’ve decided to implement some of the advantages of Oracle R Enterprise to reach a reliable and also an efficient approach. The idea is to automate the whole process of variable usability grading and preselection to be presented to the user for a final selection. This solution was developed and implemented when working with a customer from the telecommunications sector, nevertheless it can be used in any other case for variable selection. Customer requested a churn model based on a data consisting of tables with more than 200 variables. Our new devised approach with a semi-automated selection resulted with a substantially smaller amount of variables whilst preserving the model accuracy. With using Oracle R Enterprise we are also surpassing the limitations of R and are thus able to generate the results for tables of up to several thousand variables in a matter of minutes.
- The solution
Our solution is iterative and starts with a single variable that shows to have the largest value of attribute importance. Subsequently and individually are the rest of the variables taken into account. According to the users input are the variables with attribute importance equal to 0 used or discarded. Predictive models for selected variables are built and tested on a build set in regards to preselected type of predictive model and subsequently cross-validated on a specified cross-validation set. With Oracle R Enterprise we’re enabled to use in-database predictive model generation (ore.odmGLM for GLM, ore.odmNB for Naive Bayes, ore.odmSVM for Support Vector Machine and ore.odmDT for Decision Tree) with which is the process significantly sped up as well as enabled for more complex situations than R could handle by itself. The resulting solution presents the user with a clearly outlined usability of all the available variables with non-zero attribute importance as well for a selection of variables which are graded to be potentially usable. Minimum usability is determined as ΔAUC to be larger than 0; with ΔAUC being the AUC of the predictive model including the selected variable subtracted by the AUC of a model without the selected variable. The results are made available in form of an ORE data frame which can be used with the helping functions m.VarSelectPlot and/or with m.VarSelectXLSX; to be presented in a graphical form or a spreadsheet form for a maximal usability as well as possible further analysis. With this in mind user can seemingly choose how many and which variables to use in order to maximize the predictive capabilities of the resulting model as well as minimize the number of variables. With only a few parameters to consider, the whole process of semi-automated variable selection can be done in a matter of minutes. User only has to keep a couple of things in mind, mainly:
1) What kind of model the user wishes to build (GLM, Decision tree, SVM or Naive Bayes)
2) Should variables with attribute importance equal to 0 be taken into account
3) Is cross validation determined or should it be taken from the build set
4) The type of sampling for obtaining the cross-validation set if to be taken from the build set
5) To observe all the available variables or only the ones which show a positive change in model AUC (area under the curve) when added to a predictive model
- Input
Current version of our variable selection approach is implemented in a function, m.VarSelect and supported by functions m.VarSelectPlot and m.VarSelectXLSX to be run in embedded execution mode. A typical call for m.VarSeelct function is as presented below:
R> results <- ore.doEval(FUN.NAME = “m.VarSelect”, model = “GLM”, build = “TABLE”, attribute_importance=TRUE, tgt = “TGT”, columns_to_skip=c(“USER_ID”,”SERIAL_NUMBER”), sample_type=”ST”, c_per=0.3, ore.connect=TRUE )
More precisely the arguments passed to the m.VarSelect are:
build | A set on which to build the models that are needed to grade the usefulness of individual variables |
model | Choice of a type of model: GLM (general linear model), DT (decision tree), NB (Naive Bayes) or SVM (support vector machine) |
tgt | Name of the target column |
cross_valid | Name of the cross-validation set (if not to be extracted from the build set) on which the models created with the selections of variables are to be tested |
attribute_importance | Choice weather to use only columns with positive attribute importance (TRUE/FALSE) |
columns_to_skip | To skip, if any of the provided columns from the BUILD set |
c_per | How large part of the BUILD set to be taken for cross validation if cross validation set has not been specified (0<=c_per<=1). |
sample_type | What kind of sample when/if taking a cross-validation set out of the build set: ST (stratified) or R (random sampling). |
- Output
The output data frame that allows for further usage with R and Oracle R Enterprise or visualization and analysis trough provided addon functions m.VarSelectPlot and m.VarSelectXLSX.
The output ORE data frame contains below listed columns:
sample_size | Size of the sample BUILD set |
var_number | Number of variables used for a specific predictive model with automatically determined to be useful variables (ΔAUC >0, attribute importance >0) |
var_number_full | Number of variables used for a specific predictive model with restriction of attribute importance >0. |
added_variable | Name of the variable being added for a build of a current model (ΔAUC >0, attribute importance >0) |
added_variable_full | The name of the variable being added for a build of a current model (AUC >0) |
R2_build | R-squared of a model on a build set |
R2_test | R-squared of a model on a cross validation set |
auc_build | Area under the curve for a current model on a build set (ΔAUC >0, attribute importance >0) |
auc_build_full | Area under the curve for a specific predictive model with restriction of attribute importance >0. |
auc_test | Area under the curve for a current model on a cross validation set |
m_start | Start time of building the predictive model |
Addon functions
Useful are also functions m.VarSelectPlot and m.VarSelectXLSX which allow the user to present the information from the resulting ORE data frame in a form of a line graph (for the case of m.VarSelectPlot) or a spreadsheet (in the case of m.VarSelectXLSX). This is achieved by using R packages “ggplot2” and “xlsx” directly with Oracle R Enterprise.
- Example
It is interesting to observe the variable selection process in a case of our customer from the telecommunications sector as it was tackled with both manual as well as semi-automated approach. Customer presented more than 200 variables to be used in a predictive churn model. Manual selection resulted in a usage of 75 variables. When using the semi-automated selection, after the first pass of the m.VarSelect which was computationally most demanding and lasted under 5 minutes, we were left with a model of only 31 variables. More importantly when compared to a model with 75 manually selected variables this model showed no noticeable loss of predictability and as well when plotted showed a possibility of further reduction of a number of variables. On a line graph one can observe an impact of each added variable on the predictibility of the model in a form of error (1- AUC).
Further dismissal of variables that show no significant impact on the model resulted in a selection of 18 variables; as picked from the plot of a second pass of m.VarSelect function:
A third and a final pass presented a possibility of using a predictive model with 12 or even less variables whilst acquiring only a minor loss in predictivity:
When considering these options it has been decided to proceed with a model consisting of 18 variables. Therefore has the model been put to compete against a manually selected 75 variables model on the churn data sets in a span of next 4 months. Finally the results conclusively showed no significant differences in predictivity between these 2 models.
- Conclusion
In conclusion, when considering the time lost to manual selection, cost effectiveness of model generation, simplification of the model interpretation as well as better generalization and thus a reduction of possible overfitting it is seemingly better to use a semi-automated variable selection if not as a primary mean of variable selection then at least as a fast and effective mean of a review for a manual variable selection.
Implementation of this approach can also be supplemented with our Oracle R Enterprise solutions like polynomial level selection with m.PolynomeSelect and/or build sample size selection with m.SampleSelect for further improvements in predictive model building process as well as betterment of the predictive model itself.