Instagram “likes” Impose a Problem.
An internal Instagram study showed that teens delete up to half of all their Instagram posts due them not receiving enough likes. In fact, Instagram’s new “Instagram story” is in part an effort to counter the vanity imposed by the “likes” metric - Business Insider.
Web Scraping Instagram Posts
Gathering the data for this project was the largest challenge. In recent years, Instagram has evoked an increasingly stringent API policy, requiring developers to undergo a number of processes before being given a key. As such, we needed to scrape the data.
Instagram, like many other large organizations, has many built in tools to limit and trap web scrapers from getting too much data. Alternatively, there are a number of Instagram web viewers that display instagram content without the same regulations. For our project we used two sources: Instagim and Instaliga.
Instagim - Posts by Tag
Instagim allows photos to be displayed by tag on a single page, displaying username, likes, comments, caption, filter and some relevant hashtags. Due to the scope and timing of this project, we focused solely on photos under the “nature” tag.
Instaliga - User Information
While we had the photo and the likes it got, we were still missing a key metric: followers. We needed to find a website that could be easily crawled for followers and following, since they would be important features in predicting the amount of likes for a photo. Instaliga was easy to scrape since it allowed us to append usernames to the end of a URL, and then crawl for this information using scrapy.
EDA & Feature Engineering
However, many other features has more complex relationships. It seems fair to assume that followers would be correlated with likes, but although there is a relationship present it is subject to a large amount of variance. These fluctuations are likely to be accounted for by a number of other factors, hence our interest in the "previous likes" metrics.
Some features such as the filter applied to the photo seemed to have even less of an effect than originally thought. These features would likely have larger bias in subsets of photo type: for example selfie photos may commonly use a filter to make a person’s skin look healthier. However, for our “nature” photos there seemed to be little correlation.
Photo Features - Extraction
Our plan was to supplement all of the user data with information from the photos to achieve a more accurate prediction. The logic follows that if a user’s followers and general account popularity defines a range within a photo should fall, the features in that photo will aid in assigning a more precise prediction for “likes”. After turning the photos into arrays, the PIL library was used to to extract summary statistics for each color band. Other features such as luminance were easy to calculate given this data.
However, we also wanted to extract more complex features. Using the OpenCV library, we were able to use a pre trained model for facial recognition, and assign a number relative to the number of faces in each photo. We also measured the blur of each photo
Photo Features - EDA
Unfortunately, few of these photo features seemed key in finding a correlation for likes in our subset of data. We compared many of the features against likes, and likes/followers ratio to normalize likes, but the correlations still seemed somewhat weak.
In the future, we plan to extract even more features, and scrape data from multiple accounts to see if photo features matter more in the realm of a single user. Comparing some of these photo attributes against the mean and median likes for users may also have been beneficial.
Model Tuning and Selection
Neural Network
We began the modeling process by constructing a basic Multi-Layer Perceptron with a single layer of input nodes and a single output node - using the Rectified Linear Unit as our activation function. After setting up this basic model using the Keras API with TensorFlow backend, the hyper-parameter tuning was initiated. Having tuned the model extensively, other models were then constructed for comparison.
Gradient Boosting Regressor
Given the results of the Multi-Layer Perceptron, we sought to compare with less complex model. Utilizing the GradientBoostingRegressor in sklearn, we were able to obtain much better results in our cross validation processes and predictions on our test data.
Ultimately we were able to compare our predicted results to the actual amount of likes a post received in our test set and determined that 95% of the predictions were within 30 likes.
likePredict: Flask app for predicting the likes of a post
Finally, we created a front end application using Flask. This library lets you easily link backend python code with html templates to build interactive web apps. Once launched, this would allow users a simple interface to upload their image and type in their Instagram handle to receive predictions. It’s important to note that we are not web designers, so the temporary UX/UI leaves something to be desired.
Upon image upload, the back end model collects, analyzes, and exports a data frame. Upon handle input, the scrapy automatically collects the previous 20 posts and other relevant data.
These data frames are combined to match our models training columns. The model is then applied to the resulting data frame and an output prediction is generated.
Future Directions
In the future, we’d like to get access to the Instagram API since a scrapy based web application is not a model for stability and scalability. It would alleviate the majority of issues currently present in our process, and allow us to expand our models for different categories of photos. Currently, we are limited exclusively to public accounts due to data access. Additional features such as time of the week, follower involvement/network analysis, and a greater variety of image analysis would further reduce the error on our prediction within a given subset of photos. With the right amount of data and computational power, this applied model could solve some inherent issues with Instagram’s “likes” metric, and even expand to other platforms.