*(The dataset described in this post is currently publicly available in Kaggle.)*

# 1 Question

How can we tell the greatness of a movie before it is released in cinema? This question puzzled me for a long time since there is no universal way to claim the goodness of movies. Many people rely on critics to gauge the quality of a film, while others use their instincts. But it takes the time to obtain a reasonable amount of critics review after a movie is released. And human instinct sometimes is unreliable.

Given that thousands of movies were produced each year, is there a better way for us to tell the greatness of movie without relying on critics or our own instincts?

# 2 Scraping 5000+ movies from IMDB

The tools I used for scraping all 5000+ movies is a Python library called "scrapy". Below are some brief steps. The source codes and documentations can be found in github page here.

- Use scrapy in Python to obtain a list of 5043 movie titles of from "the-numbers" website. (code)
- Save the titles into a JSON file
- Search those titles from IMDB website to get the real IMDB movie links (code)
- Send HTTP request to each movie page using the links, and scrapy the page and get all data (code)
- Perform face detection for all posters to get face number (python code)
- Parse the aggregated data, clean it, and reformat it to CSV file (python code).
- The final CSV file can be found here.

Many important movie information were considered and scraped from IMDB website. For example, movie title, director name, cast list, genres, etc.

The scraping process took 2 hours to finish. The scraping of movie posters took a little longer than pure text data. In the end, I was able to obtain all needed variables for 5043 movies and 4096 posters. Overall, they span across 100 years in 66 countries. There are 2399 unique director names, and 30K actors/actresses.

The image below shows all the 28 variables that I scraped. Roughly speaking, half of the variables is directly related to movies themselves, such as title, year, duration, etc. Another half is related to the people who involved in the production of the movies, eg, director names, director facebook popularity, movie rating from critics, etc.

# 3 Face detection from movie posters

I am especially interested in knowing the answer to this question: Will the number of human faces in movie poster correlate with the movie rating?

Movie poster is an important way make public aware of the movie before its release. It is quite common to see faces in movie posters. It should be pointed out that, most movies have more than one posters. Some may argue it is unreliable to detect faces only from one poster. Well, it is indeed true. However, just like a great book usually having a single cover, I believe a great movie needs to have a "main" poster, the one that the director likes most, or long-remembered by viewers. I have no way to tell which posters are the "main" posters. I assume the poster that I webscraped from IMDB main page of a movie is the "main" poster.

Below are the movie posters from 8 great movies (IMDB rating scores are above 7.5). They all have only one human face.

Below are the the movie posters from 8 movies that are not so great in terms of IMDB rating score (below 5). They tend to have many faces.

It should be pointed out that, it is unfair to rate movie solely based on the number of human faces in poster, because there are great movies whose posters have many faces. For example, the poster of the movie "(500) Days of summer" has 43 faces, all from the same actress.

But remember that having large face number (> 10) in poster and simultaneously being a great movie is uncommon based on my findings.

Interestingly, many posters made my face recognition algorithm fail to work, such as:

Overall, nearly 95% of all the 4096 posters have less than 5 faces. Besides,

- Great movies tend to have fewer faces in posters
- If a poster has one or no human faces, we cannot tell if the movie is great simply from poster
- If a poster has more than 5 faces, the likelihood of the movie being great is low

# 4 IMDB rating score

Out of the 28 variables, I am especially interested in know how does the IMDB rating score correlate with other variables. From the 3D gross-country-rating plot below, we can see that United States produced the largest amount of movies across the past 100 years (1905-2015). The sheer amount dwarfs other countries in the number of produced movies. The points at the top corner of the plot denote the movies having the highest gross in the movie history. Many countries produced great movies, but still there were quite a few bad movies.

Movies having rating larger than 8.0 are listed in the IMDB top 250, and they are truly great movies from many perspective. Movies with rating from 7.0 to 8.0 are probably still good movies. Viewers can gain something from them. Movies with rating from 1 to 5 are sometimes considered as ones that "sucks", in one way or the other. One should avoid those movies unless they have to. Life is short.

## 4.1 IMDB score VS country

USA and UK are the two countries that produced the most number of movies in the past century, including a large amount of bad movies. The median IMDB scores for both USA and UK are, however, not the highest among all countries. Some developing countries, such as Libya, Iran, Brazil, and Afghanistan, produced a small amount of movies with high median IMDB scores.

## 4.2 IMDB score VS movie year

In the last century, it seems that the number of movies produced annually largely increased since 1960. This is understandable since the development of filming industry goes hand in hand with the development of science and technology. But we should be aware that along with the boom of movie industry since 2000, there are many movies with low IMDB score.

## 4.3 IMDB score VS movie facebook popularity

The social network is a good way to estimate the popularity of certain phenomena. Therefore it is interesting to know how does the IMDB score correlate with the movie popularity in the social network. From the scatter plot below, we can find that overall, the movies that have very high facebook likes tend to be the ones that have IMDB scores around 8.0. As we know, IMDB scores of higher than 8.0 are considered as the greatest movies in the IMDB top 250 list. It is interesting to see that those greatest movies do not have the highest facebook popularity.

I highlighted several movies to illustrate this finding. The movie "Mad Max" and "Batman vs Superman" both have very high facebook likes, but their IMDB scores are slightly above 8.0. The movie "The Godfather" is deemed as one of the greatest movies, but its facebook popularity is hugely dwarfed by that of the "Interstellar".

## 4.4 IMDB score VS director facebook popularity

It is plausible to believe that the greatness of a movie is highly affected by its director. How does the movie IMDB scores compare with the director facebook popularity? From the plot below, it can be seen that the directors who directed movies of rating higher than 6.0 tend to have more facebook popularity than the ones who directed movies of rating lower than 6.0. And I listed the top four directors who have the most number of facebook popularity (Christopher Nolan, David Fincher, Martin Scorsese, and Quentin Tarantino), along with their four representative movies.

## 4.5 IMDB score VS top 3 actors/actresses facebook popularity

Great actors/actresses make a movie great. They are the souls of movies. How does their facebook popularity look like?

For a given movie, I scraped all the available cast members in the IMDB movie page. After retrieved the number of facebook likes for all cast members, I ranked the numbers in descending order and picked the top 3 actors/actresses. This is based on a simple assumption: leading actor/actress tends to have more facebook popularity than supporting actor/actress; and no matter how great a movie is, there will be no more than 2 leading actors/actresses. For notation purpose, I named the facebook popularity for the top 3 actor/actress as "actor_1_facebook_likes", "actor_2_facebook_likes", and "actor_3_facebook_likes". Note also that the variable "cast_total_facebook_likes" is calculated by summing up the facebook popularity of all the available cast members.

The assumption indeed matches with the plotted graph below. The top first actor/actress has the most number of facebook popularity, while the second and the third actor/actress have much lower popularity. But it can also be shown that, high facebook popularity of the leading actor/actress does not mean that a movie is of high rating.

# 5 Movie rating prediction

The prediction of movie ratings in this article is based on the following assumptions:

- The IMDB score reflects the greatness of movies. The higher, the better.
- Watching good movies is preferable to bad ones for many people.

With those 28 variables available for all scraped movies, can we predict movie rating? Before we begin, it is necessary to investigate the correlation of those variables.

## 5.1 Correlation analysis

Choosing 15 continuous variables, I plotted the correlation matrix below. Note that "imdb_score" in the matrix denote the IMDB rating score of a movie. The matrix reveals that:

- The "cast_total_facebook_likes" has a strong positive correlation with the "actor_1_facebook_likes", and has smaller positive correlation with both "actor_2_facebook_likes" and "actor_3_facebook_likes"
- The "movie_facebook_likes" has strong correlation with "num_critic_for_reviews", meaning that the popularity of a movie in social network can be largely affected by the critics
- The "movie_facebook_likes" has relatively large correlation with the "num_voted_users"
- The movie "gross" has strong positive correlation with the "num_voted_users"

Surprisingly, there are some pairwise correlations that are perhaps counter-intuitive:

- The "imdb_score" has very small but positive correlation with the "director_facebook_likes", meaning a popular director does not necessarily mean his directed movie is great.
- The "imdb_score" has very small but positive correlation with the "actor_1_facebook_likes", meaning that an actor is popular in social network does not mean that a movie is high rating if he is the leading actor. So do supporting actors.
- The "imdb_score" has small but positive correlation with "duration". Long movies tend to have high rating.
- The "imdb_score" has small but negative correlation with "facenumber_in_poster". It is perhaps not a good idea to have many faces in movie poster if a movie wants to be great.
- The "imdb_score" has almost no correlation with "budget". Throwing money at a movie will not necessarily make it great.

## 5.2 Dimensionality reduction

The three-dimensional PCA plot shown below reveals more information than the correlation matrix. For the 15 continuous variables, we can see their relationship with the three principal components in space. The colorful points denotes all the movies. We can see that some variable vectors tend to cluster and point at similar directions, meaning that those 15 variables have multicollinearity between some variable pairs. This may lead to problem when we want to fit linear regression model to predict movie rating.

## 5.3 Multiple linear regression

Although initially I scraped 28 variables from IMDB website, many variables are not applicable to predict movie rating. I will therefore only select several critical variables.

Both the correlation matrix and the 3D PCA plot show that multicollinearity exists in the 15 continuous variables. When fitting a multiple linear regression model to predict movie rating, we need to further remove some variables to reduce multicollinearity. Therefore, I remove the following variables: "gross", "cast_total_facebook_likes", "num_critic_for_reviews", "num_voted_users", and "movie_facebook_likes". Some variables are not applicable for prediction, such as "num_voted_users" and "movie_facebook_likes", because these numbers will be unavailable before a movie is released.

The plot of the fitted multiple linear regression is illustrated below. From the "Normal Q-Q" plot, we find that the normality assumption of regression is somewhat violated.

Thus, I apply the box-cox transformation and refit the model. Although the model became uninterpretable, the assumptions of multiple linear regression, namely, no multicollinearity, normality, constant variability, and independence, are well-satisfied.

From the detailed information of the fitted model, we find that the model is significant since the p-value 2.2e-16 is very small. The "title_year" and "facenumber_in_poster" has negative weight. The "actor_3_facebook_likes" variable was not included in the model at all, meaning that the social network popularity of the third actor in the cast member is not significant to predict the movie rating. This model has multiple R-squared score of 0.201, meaning that around 20% of the variability can be explained by this model.

## 5.4 Random Forest regression

Random Forest model was fitted to predict movie rating using the following variables:

- imdb_score
- director_facebook_likes
- duration
- actor_1_facebook_likes
- actor_2_facebook_likes
- actor_3_facebook_likes
- facenumber_in_poster
- budget

The movie dataset was divided into two parts, 80% of the movies were treated as the training set, and the rest 20% belonged to the testing set. Up to 4000 trees were generated to fit the random forest. The number of variables tried at each split of the decision tree is 2. The mean of squared residuals is **0.89023**, and the percentage of variable explained is 27.21%, better than that of multiple linear regression.

From the fitted random forest model, the variable importance can be revealed in the graph below. It is interesting to see that duration is the most important variables, followed by the budget and the director facebook popularity. Different from the multiple linear regression model above, the "actor_3_facebook_likes" is considered as an important variable, even slightly more important than the "actor_1_facebook_likes".

# 6 Insights

Since the fitted Random Forest model explains more variability than that of multiple linear regression, I will use the results from Random Forest to explain the insights found so far:

- The most important factor that affects movie rating is the duration. The longer the movie is, the higher the rating will be.
- Budget is important, although there is no strong correlation between budget and movie rating.
- The facebook popularity of director is an important factor to affect a movie rating.
- The facebook popularity of the top 3 actors/actresses is important.
- The number of faces in movie poster has a non-neglectable effect to the movie rating.

## 10 Comments

What tool is used to draw the figures in your post? Thanks

I mostly used "plotly" to generate the graph.

Hi Chuan Sun,

A brilliant piece of analysis and lot of creative feature engineering. The dataset provided is very rich in information. Currently, I am writing a book on Machine Learning using R, does this dataset has any copyright or licensing terms if I would like to use the dataset and cite your name in the reference? Please let me know. You can contact me on my email.

Thank you! The dataset is under "Open Database License". Visit here for both the dataset and the licence information: https://www.kaggle.com/deepmatrix/imdb-5000-movie-dataset

Thanks Chuan !

Hi Chuan,

Excellent work. One issue I found, however, is with the Facebook analysis - using IMDB's Facebook likes is flawed, since they only count likes that have been "thumbed up" through the link on their website. This excludes any Facebook likes made outside of the IMDB links.

For example, let's look at the number of likes for The Godfather (1974) - on IMDB it is showing 44k likes, but on Facebook's verified movie page, it is over 9.2 million likes. https://www.facebook.com/thegodfather/

Just something to think about 🙂

Best Regards,

Ariella Katz

Student

Hi Ariella,

Thank you for your interests in the project! I totally agree with you that using solely the IMDB's facebook like seems not enough. But there were several practical considerations:

1. It is not easy to collect the verified movie pages on Facebook for all 5000 movies. Particularly, many unofficial pages were created by fans.

2. The number of some extremely popular movies like "The Godfather" could largely dwarf those not so popular but still great movies. To get unbiased facebook likes for movies is very challenging.

3. If we follow this direction, then how about all the directors/actors/actresses in Facebook?

Using only the IMDB's facebook likes is simple, easy to implement, but still meaningful. It gives us a baseline to begin with. If in the future we find it is really necessary to dig deeper into the precise relationship between social network and movie popularity, we can use a rather systematic approach, such as graph analysis, a.k.a, building a big graph consisting of nodes (directors, actors/actresses, and movies) and edges (if actor A shows up in movie B, then we have an edge from A to B). Many insights could be distilled from there.

Thanks!

Thanks for the reply. Interesting to think about.

Best,

Ariella

Hi Chuan Sun,

We actually used your IMDB dataset for an Advanced Data Mining class at Rockhurst University in Kansas City, MO. We love the data set and we really appreciate the time it took to create the it. However, we believe we found a small flaw in the data. Not all of the IMDB movie budget numbers are in US dollars, for example, the South Korean movie "The Host" has its budget numbers in S. Korean Won (Korean currency). But there is no data in the dataset that tells you the currency. The existance of foreign currencies skews the budget data for foreign films particularly for currencies with extreme exchange rates when compared to USD. For instance, many could assume the data set shows "The Host" cost $12 billion to make when it truthfully cost only 12 billion Won, but the dataset doesn't make the distinction. It is not just an issue with Korean movies we found Turkish and Japanese movies with the same issue.

Quinton

Hi Quinton,

Thank you very much for your suggestion! You are right. When I parsed the currency, I didn't take the Korean currency into consideration due to limited timespan. I will post your valuable comment into the Kaggle dataset page such that other users will be aware of it.

Thanks!