My First Machine Learning Project
Posted by Austin Cheng
Updated: Dec 16, 2019
The Beginning of My Machine Learning Journey
In this blog I will walk-through how my teammates (Aron, Ashish, Gabriel) and I approached our very first machine learning project. The purpose of this blog is more for record-keeping sake-- keeping track of my journey as an aspiring data scientist, and noting down the thought process and reasoning behind the steps taken to arrive at our predictive models. I will keep the reasoning as general as possible because the intention is to establish a generalized workflow that I could build off of. The ultimate goal here is to someday return to this data set, apply better predictive models, see what I would have done differently, and allow myself to see my own growth as a data scientist.
The Data Set
The data set was taken from kaggle.com. The data consists of 79 features describing practically everything about the houses in Ames of Iowa. This data set is meant to be used as a toy-example for aspiring machine-learning practitioners to play with. The main lesson to be learnt from this data set is that simple linear models can be very powerful and that they can easily out-perform high-complexity models in the right scenarios. In the following, I will describe the workflow that we followed to tackle this data set, and verify that linear models indeed should always be in one's arsenal.
Workflow for Data Pre-Processing
Data Pre-Processing and Transformation
We adhered to the advice we were given right away: transform the target variable (sale price) into one that follows a normal distribution and removing outliers. The former is important because it ensures that the residuals of the target variable will be normally distributed (which is the underlying assumption of linear inference models), and the latter ensures that our model result doesn't get skewed (or become wrongly biased) by anomalous observations, particularly those that have high influence and leverage. Below we illustrate the log transformation (our manual box-cox transformation):
One step that we avoided to do was transforming the features so that they would also become normally distributed. Machine learning models could potentially benefit from normally distributed features but this would compromise the interpretability of the resultant model. For this reason we chose not to pursue it and instead, we move on to treating outliers. Below, we show the effect of removing outliers for a particular variable:
Missingness and Imputation
The second step, where a large portion of the time was spent on, was treating missingness. The imputation is tricky because it required deeper insight of each feature. Whether to impute with the mean, median, mode, zero, none, or simply to remove the observation or feature itself depended on some predetermined guideline that we thought was acceptable. This is where a lot of the human intuition is used. Below we show a qualitative summary of the missingness.
I will not delve into the specifics of how we treated the missingness of each variable (the reader can refer to our code posted in github for the exact treatment) but instead briefly go over the general idea. First, any variable with over 95% of definite missingness can in principle be safely discarded but we must take caution before doing so because the missingness may not necessarily be actual missingness. Drawing information from the correlation between the missingness of variables, we could deduce what some of the missingness meant. For instance, the highly correlated missingness associated with garage speak to the likeliness that these houses simply do not have a garage. A variable that are basically completely missing is pool area. For this, we take it that observations with missing information on pool area means that these houses do not have a pool. Here we gave flavors of how we treated variables with significant level of missingness (in general we picked the conservative option which was to keep any information we could). For variables where the number of missingness is relatively low (like less than 5% of the observations), we chose to impute with the mean if the variable was continuous (or ordinal) and with the mode if the variable was categorical. The reasoning behind the mean imputation is that the imputed data does not alter the fitted slopes and hence does not bias the model result. As for the mode or median (for categoral or numeric variables respectively), there is no particular good reason except to think that the observations belong to the most representative groups. There maybe flaws in this option but sometimes convenience outweighs precision especially when the amount of missingness is small (the number of missingness for these features are in the low tens range). In order to be precise in this imputation process, I would choose to impute based on k-nearest neighbors or other machine learning models. Another widely accepted imputation method is to impute with a grossly outlying number such as -999 (if all observations are real positive numbers). However, this imputation does not work with inference models where analytical equations are fitted. Because of this, the imputation of -999 was avoided.
First Round of Feature Selection
The curse of dimensionality is often preached to us. High dimensionality likely means collinear variables which causes inaccuracies in fitted coefficients as well as high variance. High dimensionality may potentially mean sparsity in data or may mean an unnecessary number of features which can cause overfit. Both of these are highly undesirable because they lead to a poor performing model.
Correlation Investigation: Getting Rid of Multi-Collinearity
The first attempt in feature selection is driven by the need to reduce multi-collinearity within the system. The methodology is to perform correlation investigation while either combining or removing features. Below we show the correlation plots before and after the treatment of multi-collinearity:
One can see that the correlation (represented by dark blue) is drastically reduced. This was achieved through the removal and/or combination of features. A guide that helped us in making whether we were making the right decisions was based on the constant evaluation of R-squared of the features:
There are categorical variables whose sub-categories can be clustered together. Below we show an example:
In this plot we can see that all the irregular sub-categories (IR1 to IR3) have means very close to each other but far from regular (Reg). This is a hint that maybe we should cluster the IR's together to reduce dimensionality after dummification.
In this particular example, we can see that it may be potentially beneficial if we grouped all the irregulars (IR1 to IR3) together into one big subcategory. This is beneficial because after dummification of the variables, the feature space will remain relatively small compared to if subcategories were not clustered. The process of clustering was not done manually but was done with K-means clustering (despite it being an unsupervised method) by clustering the subcategories according to a variable that was correlated with the target variable (in this dataset we used Gr living area).
Note on Feature Engineering: Which Arithmetic Operation to Use?
Feature engineering can be done through interaction. This interaction could be reflected as some sort of arithmetic operation of any two or more features. What we learnt was that one must take heed in choosing the type of operation. Multiplication and addition for instance can have drastic differences in the final model result. A good guideline that we concluded with is that one must always obey the natural physical units of the variable. For instance, number of garages and garage area, if combined, should be combined through multiplication and not addition. Addition in this case would not make physical sense and as a matter of fact, a test on the two types of operation did indeed show that multiplication of the two resulted in a drastic decrease in VIF whereas the addition did not.
Another example that deserves a description is the behavior for each neighborhood.
Looking at the neighborhood plots, we can see that each neighborhood behaves distinctly and each follows a very well-defined behavior. The neighborhoods warrant their own models. To achieve this, we created a switch-like interaction parameter by multiplying the dummified neighborhoods categories by Gr living area. This way, instead of being a simple interception shifter (which is what categorical variables do for generalized linear models), each neighborhood can have its own set of coefficients-- its own equation. Implementing this feature engineering led us to a drop in our Kaggle ranking.
Our pipeline can be summarized as follows:
The data set is split into a train and test set where the train set is then sent into five models: three linear (Lasso, Ridge, Elastic Net) and two nonlinear (Random Forest, Gradient Boosting). An extensive grid search is performed for each model where the best hyperparameters are chosen. With the best hyperparameters, we use the models to predict on the test set and compare the test scores. The following shows a summary of the initial feature engineering performed using the pipeline outlined above:
Many types of feature engineering and selection have been tried but apart from the ones shown above (up to dataset C where the feature engineering are performed sequentially on top of each other starting with A), all have yielded a worse Kaggle ranking though our own test MSE score is not always consistent with the Kaggle ranking. Below we show the results for data sets A through D:
We can see here that the elastic net has the slight edge over all other models. The linear models all perform much better than the non-linear tree models. This is a good verification of the statement mentioned in the beginning, that linear models will always have its place. In this particular data set, the behavior of the target variable behaves largely linear with the features, which gives the linear models good reason to outperform the non-linear ones. All this being said, even with the linear models, our Kaggle and MSE score can certainly be still improved. The reason we know this is from the plots below:
The plots above show that a huge discrepancy exist between the test and train data set MSE score. For a tree model it may make sense because tree models tend to overfit (though of course the point of random forest is precisely to avoid this problem); however, the penalized linear model should have mitigated this problem... and it did not. This means that we can definitely improve on our feature selection and engineering. However, we tried numerous feature selection actions, while also taking into account the suggestions based on feature importance shown below, but they have all given us negative feedback.
Seeing the futility of feature engineering, we chose to brute-force improve our models through recursively getting rid of features. The idea is demonstrated below:
The optimal number of features is indicated by the position where the test error suddenly jumps. With this recursive method, we were able to further improve our MSE score:
Finally, we chose to put everything together by ensembling all the different models. We did it as follows:
The ensembling is simply the linear combination of the predicted values of the different models. The weights of the different models is chosen by the set of weights that minimizes the test error score. Submitting our final result to Kaggle, we got a final score of 0.1214.
New Things to Try and My Conclusion
Being our first machine learning project, we certainly learnt a lot. First and foremost, we saw with our eyes the power of linear models. This was a fact we saw coming. The second and the tougher lesson was that we saw the limitation of human intuition. The many hours of futile feature engineering is a memorable lesson for us. In these machine learning problems, there should always be a balance in the reliance of human intuition as well as the machinery. We wasted too much time being faithful with the data set, trying to figure out what was statistically significant or not, and being too hesitant in dropping features. These actions may be good... if we were decisive but the problem is that the conclusions from these EDA and statistical tests are never black and white-- they seldom lead to actionable responses. What we should've done is to look more quickly into the feature importance given by the linear and non-linear models while comparing the importance with a random dummy variable. At the same time, we should've spent more time looking into performing PCA of a subset of associated features. We clearly still suffered from multi-collinearity at the end despite all the efforts in manual feature engineering. We needed to be cleverer with the machine learning techniques. And so the lesson here is clear to us. Certainly, we will be much better next time around.
Austin is an experienced researcher with a PhD in applied physics from Harvard University. His most notable work is engineering the first single electronic guided mode and explaining it with computational simulation. He is passionate about the growing field of artificial intelligence and is currently looking to contribute to data-driven companies within the technology sector.View all articles