The objective of the following research is to predict the number of inquiries a new listing receives in NYC based on the dataset provided by RentHop. Identifying the level of interest using multiple features for each listing would assist RentHop in the attainment of the following business targets:
- Optimize the way RentHop handles fraud control
- Identify much easily potential listing quality issues
- Allow owners and agents to better understand renters’ needs and preferences.
RentHop data comprises 49,352 observations for the training dataset and 74,659 for the official test dataset for rental listings in the city of New York for the period of April to June 2016. For each listing sample, there is a total number of 14 explanatory variables featuring multiple characteristics of the properties as presented below:
- 3 float type variables: number of Bathrooms, Latitude and Longitude
- 3 integer type variables: number of Bedrooms, listing_id and listing price
- 6 string type variables: building_id, date listing created, description, display address, street address and manager_id,
- 2 list type variables: features and photos.
The response variable, interest level, is a categorical type with classes/levels: "high interest", "medium interest" and "low interest". As visualized below, "low interest" is the most representative class in the training set with almost 70% included within this bucket.
The primary metrics of evaluation is multi-class log-loss. The formula is showed below. Secondary measures of appraisal have been used depending on the model tested. The multi-class-log measures the difference between the distribution of actual labels and the classifier probabilities. A best case classifier with 100% accuracy will have a 0 log-loss, while a classifier that assigns each observation to a k labels in a random fashion (prob = 1 /k) will have a log-loss of -log(1/k) tantamount to log(k).
Where N is the number of listings in the test set, M is the number of class labels (3 classes), j and 0 otherwise, and is the predicted probability that observation belongs to class .is the natural logarithm, is 1 if observation belongs to class
Two things worth considering at this early stage with regards maximizing prediction accuracy and minimizing the generalization error:
- The predicting model will have to ensure from the outset a minimization of the "low interest" class misclassification error.
- The predicting model will have to identify nuances between "high interest" and "medium interest" listings in order to gain a superior edge in terms of prediction accuracy.
New York Apartments for Rent: A Brief Industry Overview
- Listing Type Information: Key aspects of the NYC apartment rental market at a competitive level such as licensed professional real estate agents aka brokers that collect commissions, called broker fees. While some listings are no fee apartments, the majority of rentals, the tenant pays the broker fee. Finally, many rental brokers show open listings, meaning they do not have exclusive rights to their own inventory may also be in the inventory of a competing broker. Regrettably, the data provided by the RentHop does not include a feature describing the type of agent who posted the listing post that, undoubtedly, could have added significant value to our predicting model were it to be included.
- Seasonality: Apartment rental industry season normally softens by the end of each season (around September) with rental prices trending downward. However, the New York City rental market is very different from other big cities do to the structural supply deficit that mitigates the aforementioned seasonal effect, even though it does not eliminate it completely. The data provided by RentHop only spans from April to June 2016, which precluded the possibility of assessing the seasonal impact on listings' interest levels.
- Summer Season: Since the subprime crisis of 2007, personal income growth has lagged renting rates damaging tenants affordability levels. The cumulative effect of these events built up to a state where vacancy rates rose every month from July to September of 2015, a pattern that repeated in the 2016 summer season. This summer effect has become more significant of late, yet the lack of data for this particular time span prevented the team from testing it.
EDA: Highlights and New Features Engineering
The team's EDA (Exploratory Data Analysis) set up new features created out of the raw data in order to test specific ex-ante null hypothesis. Some of these new variables proved to be really useful in models while others failed to explain interest level diversity:
Bedrooms: Intuitively, there should be some predictive value embedded in this variable for situations of apparent mispricing given a certain number of bedrooms. The number of bedrooms played an integral role in some of other features that were engineered—namely its use as a normalizing variable in the engineering of the price vs. median new variable (see below).
Drilling down further, the number of bedrooms were almost evenly distributed for the three interest classes. That said, it is important to note that low interest apartments are more heavily weighted towards one bedrooms while medium and high interest level listings were more likely to be two bedroom listings. However, when the number of bedrooms was included in various models using the raw variable or boolean by-product, it did not provide huge gains in terms of predictability. Nonetheless, the number of bedrooms proved to be a good ingredient for the creation of new features.
Price: Unsurprisingly, one of the most important features in the dataset was the prices per listing. The price frequency distribution deviated from having a pure gaussian-shape featuring a remarkable skew on its right side due to the inherent structural supply deficit in the NYC apartment rental market. Moreover, the price histogram below showed a strong kurtosis due to the presence of several outliers on the right-side tail. Several transformation methods were tested with price per room and natural log price resulting as the best options to smooth the original raw variable and attain a more normally-distributed shape without losing explanatory power to discriminate between interest level classes.
Photos: As the histogram below highlights, the average listing at RentHop displays most of the times a number of pictures ranging from 3 to 8. The "number of photos" variable was created out of the original data set and follows a fairly normal distribution. When breaking down by interest level, one of the ex ante null hypothesis has been validated. Listings with no pictures were more prone to belong to the "low interest" class. In fact, the probability that a listing is classified as "low" interest is higher for those with no pictures (95.2%) than for those with at least one (67.5%) image available.
Given the strong effect of photos, a new variable named "no_photo" proved to be useful when added in linear models such as Logit, Linear Discrimination Analysis(LDA) or bayesian-based methods like Bernoulli Naive Bayes(BNB). On the other hand, "number of pictures" resulted in better results when applied to tree-based classification models like Random Forest or Gradient Boosting. The lesson here is that feature engineering is necessary to extract value from an original dataset; however, the newly featured variable marginal value added hinges on the type of model to be used.
Listing features: Another useful raw variable was "features", which was eventually important for the team in terms of feature engineering and model performance. The word cloud chart below highlights the most frequent key words used in listing advertisements with "Doorman"or "Elevator" among the most important. Three new variables were created using input from the "features" field:
- Number of Features: it simply counts the number of features per listing.
- Number of Key Features: it counts only key words considered as such according to the word cloud frequency importance.
- Number of Key Features Score: EDA analysis showed that the number of features' importance on interest levels soars dramatically as the number of key features per listing rises above five; thus a scoring system was created using "number of key features" in order to maximize this threshold effect.
5-in-1 new feature: RentHop website information suggests that listing interest may be related to a variable that measures the price of an apartment relative to nearby apartments:
- For each listing page, RentHop compares the price of an apartment to median price of apartments in the same neighborhood with the same amount of bedrooms and bathrooms.
- RentHop has a map search that allows users to find an apartment using a map.
To develop a variable that captures this impact, the team decided to compare the price of an apartment to the median of the thirty nearest apartments that had the same number of bedrooms and bathrooms. Using the thirty nearest points, each apartment had its own unique neighborhood based on the listings around it. This avoids an issue of fixed neighborhoods where boundary line points - apartments only a few blocks ways - would not be compared. Since this calculation was technically expensive, the RentHop data was loaded into MongoDB and setup a geographic index to optimize performance. This feature offered valuable information not contained in other variables and it was one of the top features in the team's prediction models.
Model Engineering: Validation and Ensemble
Firstly, multiple predicting models were built, fine-tuned and tested using raw data variables and new significant features. In this first stage accuracy (bias) and precision (variance) were studied on a stand-alone basis. During the second stage the more powerful models from each type were combined using Python's Brew library, a comprehensive tool to ensemble and stack predicting models in order to enhance their stand-alone predicting ability. Regrettably, the assemble output performance was not as encouraging as expected initially with minimal gains in terms of accuracy and variance.
The table below summarized the best models per model family type and highlight how tree-based models proved to be far superior not only when run on training sets and during validation; but also during the acid test using the test data. K-Nearest Neighbor and Support Vector Machines were also tested but with disappointing results that forced us to excluded them from the group below and focus our efforts in more optimal models:
The most important models per family are discussed below in order to go through a more thorough description for each method:
Linear and Discriminant Analysis Models
After trial of maximum-likelihood models such as Logit and multiple linear discriminant models, the results led to consider LDA as the best linear modeling option. LDA seeks to find a linear combination of features that characterizes or separates two or more classes of objects or events. The model assumes that the predicting features follow a normal distribution while assuming different means per class for each one of the features but similar standard deviation and covariances.
RentHop training sample contains few significant variables with non-normal distribution characteristics like excessive kurtosis or significant skew. Price was one of the most significant variables in the models aforementioned, for which reason transformations of this variable enhanced, not only its predicting ability, but also smoothed its non-gaussian shape. Last but not least, variance and correlation metrics analysis yields positive conclusions as the inter-class equal variance and covariance constraint could be a sensible one for the chosen LDA predictors as the tables below showcase:
Another family of models that were analyzed were Multinominal Naive Bayes and BNB. For the former model a series of frequency-like new predictors were generated; whilst for testing BNB only boolean features like "no phone" were considered. Regrettably,the main conclusion - after reviewing the results from both linear and Naive Bayes models - was that their accuracy performance and generalization error for the problem at hand were below the performance of non-parametric tree models such as Random Forest or Boosting Gradient.
Tree Models: Random Forest, Gradient Boosting and Extreme Gradient Boosting
With respect to selecting models, tree-based ones proved to be the best choice for explaining and predicting RentHop interest levels. They gave the flexibility needed to deal with the raw amounts of features of the original data set and were also able to handle the non linear decision boundaries needed. Python's sklearn library can be used to tune their parameters very easily and deal with the tremendous computational effort demanded by these sort of models.
Finally, the most important aspect of tree based models was that for the most part, they are still explainable in terms of how they can find answers to client questions. In this way, three tree-based models were effective for predicting our response while maintaining a competitive log-loss score: random forest, gradient boosting and extreme gradient boosting (XGBoost).
Unsurprisingly, XGBoost performed the best but there were still important reasons to be more favorable towards random forest. This is because with random forest, the output obtained was more easily interpretable and helpful in order to highlight what features were performing well, while it allowed us to maintain a high level of transparency and rigor during the parameter fine-tuning process. The accuracy was almost as good as the XGboost model but still able to be reproducible to answer any question coming from users without much technical knowledge.