Analysis of a Podcast: The Joe Rogan Experience

Analysis of a Podcast: The Joe Rogan Experience

Posted by Michael Dollar

Updated: Jul 14, 2019

The Joe Rogan Experience has made frequent appearances in ranking lists of podcasts in the last few years.  It has enjoyed numerous awards and mentions for its overall popularity as well as being a comedy podcast.  While The Joe Rogan Experience invariably has some element of comedy in every episode, the topics of conversation vary considerably.  It is this versatility, in combination with the conversational format of the show, that peaks my interest in collecting data about this particular podcast.  What are the primary topics discussed on JRE?  Is it possible to label JRE with any one of its topics of conversation?  How does the popularity of the show vary with the topics discussed during each year of its run so far?


Data Acquisition and Cleaning

The data were scraped from a third-party website using Scrapy, and then was cleaned and organized using a combination of Python and R.  The variables collected were episode title, air date, runtime, likes, dislikes, and ratio.  The episode title included the name of  the guest(s), while the likes and dislikes were obtained via YouTube data by the third party.  The ratio is simply the ratio of likes to dislikes.  The runtime is the time-length of the episode in the format of hh:mm:ss.

A sample of the data set without tags.

A second scrape yielded a data set with a new column, 'tags.'  This column gives the category that the guest falls into.

Sample of the data set with 'tag' column.

There were many "Best of.." episodes that were snippets of full-length episodes, and because of their redundancy, they were dropped from the data set.  Also dropped were episodes less than 55 minutes, since they were mainly snippets that didn't sport the easily sorted title starting with "Best of."  Lastly, any episode that didn't explicitly name the guest in the title was dropped for simplicity, because they are a minority in the data set.


The initial look at the time series data for number of views, likes, and dislikes shows very tall spikes.  I believe these spikes are viewers who tuned in specifically to see a certain guest.  The guests that correspond with those spikes will be addressed later, but before I begin further analysis, I will remove outliers.

After removal of data points greater than three standard deviations, the time series looks a little more stable.

When looking at the scarcity of views between the years of 2013 and the middle of 2015, I think it can be inferred that something significant happened to begin a steady increase from then on, but I am not sure what it is.

The next few plots compares several variables across the tags which have at least 30 episodes associated with them.  The variables are number of views, likes, dislikes, runtime, and ratio.

All box plots are heavily skewed to the top.

All categories in this plot start at zero, probably because of the low numbers found in the earlier times of the podcast.  Because of this, I would like to compare across categories for the last two years.  In order to accomplish this, I eliminated all tags from the comparison that had less than thirty observations.

Here, authors has the highest minimum amount of views.
The number of likes within comedians and miscellaneous vary the most here, but at least a quarter of the podcasts in each category have more than 23000 likes.
Athletes-fighters is the category with the tightest spread of dislikes and the lowest median.
The ratio for authors has the highest variance while writers seems to have the lowest.
The duration of the episode seems to be consistent among categories, but with writers, there seems to be a sharper likelihood being around 160-175 min long.

The Outliers

From the first graph that included the outliers in the data set, I was curious to know who those guests were.  Below are lists of episodes that were outliers for their respective variables.  The length of each list represents the number of outliers in that particular variable.

That large spike in the time series data is Elon Musk.
Elon Musk also tops the chart with number of likes.
Jack Dorsey has an overwhelming lead in dislikes for this list.

There was only one outlier for the variable of runtime:  an episode with Alex Jones that ran approximately 280 minutes. 

Comparison by Category:  Episode Count and Average Number of Views

From the side-by-side plots above, you can see that the categories of comedians and athletes-fighters have the highest episode count while the categories of politics and business command the most attention.  In fact, the category of comedians doesn't even show up in the plot on the right.  This directed me to create a new table, grouped and indexed by category(column labeled 'tag') and add a new column for the average number of views per number of episodes.  I removed the categories that had less than 30 episodes.



While there is not a lot of variance of variables among categories, there is certainly a disconnect between the number of episodes per category and the number of views per category.  While the majority of podcasts are tagged with the category of comedians, the category of politics gets substantially higher numbers of views.  This suggests that JRE gets more attention when the topic of conversation or category of guest deviates from comedians and fighters.  This is especially true when the deviation is toward politics or business.

Michael Dollar

View all articles

Topics from this blog: Student Works

Interested in becoming a Data Scientist?

Get Customized Course Recommendations In Under a Minute