The aim of the Allstate Claim Severity project is twofold, first to predict the severity of insurance claims, and second to help the company to evaluate potential clients based on the importance of predictors.
The data is provided by the company-a training data set with 188,318 rows and a testing data set with 125,546 rows. The training data set consists of 188,318 observations and 131 variables which include 72 binary categorical variables, 43 non-binary variables with 3-326 levels, 14 continuous variables and the outcome variable, "loss". Since all the predictor variables are anonymized, no specific information about them are disclosed. There are no missing values.
To begin with, I checked the normality of the continuous values, and found out that the outcome variable, "loss", is right skewed, and the majority of it has the range of 0-15000.
Next, I mapped the correlation of continuous variables. Although some of variables are highly correlated, PCA found none variable that is not useful in explaining variation in the data set-- in this case, usefully differentiates between groups of original explanatory variables, so I kept all the continuous variables.
To prepare the training data set for analysis and modeling, I first conducted the log transformation on "loss" to normalize its distribution and remove its right skewness. Then, I dummified the categorical variables and got rid of near zero variables, using more conservative method to avoid removing important variables. Last, I splitted the training data set into the training and testing set with a ratio of 8:2.
The outcome variable, "loss", is continuous, so this is a regression problem. I chose to use RMSE to evaluate model accuracy. RMSE or Root Mean Squared Error is the average deviation of the predictions from the observations. It is useful to get a gross idea of how well (or not) an algorithm is doing.
The most commonly used algorithms for regression problems are linear regression, ridge regression, lasso regression, random forest, and boosting. Caret package is one of the best in tuning algorithm parameters and get the best performance from the algorithms. Its default metric for regression is RMSE. Since the data is not too small, I implemented the cross validation with 3 repeats via Caret control function.
The following are two examples of training setup for gradient boosting and extreme boosting models in Caret:
Result and Summary
As of this writing, Caret is still tuning random forest model. After I aggregated the results I have and selected tuning parameters, xgblinear looks to be the best fit for the data set due to the smallest RMSE(test).
Through feature importance function I found out that cat80.D, cat79.D, CAT12.A, CAT80.B AND CONT14 are the most important predictor variables in the model.
I'll let Caret keep tuning random forest model, and when it completes the job, I'll combine predictions of different caret models using caretEnsemble package. Although it will increase computational time and complexity for building models, it will also provide more accurate estimates of model performance and help company better predict the loss severity and evaluate potential clients.