Contributed by Michael Todisco. He is currently in the NYC Data Science Academy 12 week full time Data Science Bootcamp program taking place between January 11th to April 1st, 2016. This post is based on his thirds class project - Web scraping and Shiny (due on the 6th week of the program).
Michael Todisco
Is there anything better than going to a baseball game? Provided it's during the day, sunny, above seventy degrees, on a Saturday, and there's a free hat giveaway
The majority of the data was very easy to obtain using Baseball-Reference's Play Index tool. The site has robust data that dates back to the 1800's. However, one aspect of the project that I was emphatic on including was promotional data. This refers to a teams pre-game promotions such as a hat or bobble-head giveaway
The promotional data I needed was in the form of calendars. I decided to use Python and Beautiful Soup to do the scraping and after inspecting the elements of the website, it didn't seem like it would be too much trouble to gather the information.
However, I ran into an immediate wall when I realized the calendars were interactive and written with Javascript. After a lot of searching for solutions, I found Dryscrape, which allowed me to access the text. Once I had that in place, it only took a few simple 'for' loops and appending to lists. Finally, I wrote the scraped data to a .csv file.
Once I had my scraped promotional data and the main dataset from Baseball-Reference, I decided to load them into R.
nyy_scraped = read.csv('nyy.csv')
yankees = read.csv('Yankees.csv')
There were a few column manipulations and additions that I made. Below is one example, which adds in the opponent's league column.
###Adding Column For Opponent's League###
AL = c('Baltimore Orioles', 'Boston Red Sox', 'Chicago White Sox',
'Cleveland Indians', 'Detroit Tigers', 'Houston Astros',
'Kansas City Royals', 'Los Angeles Angels', 'Minnesota Twins',
'New York Yankees', 'Oakland Athletics', 'Seattle Mariners',
'Tampa Bay Rays', 'Texas Rangers', 'Toronto Blue Jays')
league_func = function(x){
if(x %in% AL) return('American League')
else
return('National League')
}
yankees$Opp_league = sapply(yankees$Opp, league_func)
The main step I needed to do with the data was to merge the scraped dataset with the data I downloaded from baseball reference. I did this using 'Date' as the key.
yankees = merge(yankees, nyy_scraped, by = 'Date', all.x = TRUE)
Here's a shot of some of the data in the app.
With the data collected and prepared, my next step was to build the interactive tool using Shiny. With Shiny, it is necessary to create a front-end UI.R file, along with a back-end Server.R file. The two working together provide the interactive functionality. Below are a small snapshot of each, but the full code can be found here.
UI.R Server.R
Here is a shot of the main tab/graph of the Shiny App.
The functionality and features that I built into the Shiny App were the following:
The graphs and value boxes - average, maximum and minimum attendance - update as each filter is changed.
The best thing about Shiny and its interactive functionality is that there are endless amount of filter combinations to filter within the tool. In addition, a user can freely gather insights on overarching trends or get information as granular as a single game. The average attendance for games against the Orioles, on Tuesdays nights, with a promotion running and temperatures in the mid 50's, is only a few clicks away.
Here are a few of the high-level insights that I was able to take away. Some are fairly straight forward and common sense, but others are quite interesting.
The MLB 2016 schedule is of course already out and the Yankees have also released their promotional calendar for the upcoming season. Using this information and the past data that I was able to build the Shiny App, I was able to fit a model to predict attendance numbers for the 2016 season.
I used the below simple multiple linear regression model to regress attendance on to six variables.
train.model = lm(Attendance ~ Opening_Day + Month + DOW + DayNight + Opp + Promo, data = yankees.train)
Because this is a linear regression model, it was necessary to graphically check that the model is not violating linear assumptions such as Linearity, Normality, Constant Variance, and Independent Errors . I also check outliers and influence points.
The graphs aren't perfect and there is definitely some areas to be concerned with, but for the most part I can accept this model as adhering to the assumptions of linear regression.
Cross-Validation with training and test sets for the model was performed with the following code:
#Train and test set
set.seed(0)
training_test = c(rep(1, length = trunc((2/3) * nrow(yankees))),
rep(2, length = (nrow(yankees) - trunc((2/3) * nrow(yankees)))))
yankees$training_test = sample(training_test) #random permutation
yankees$training_test = factor(yankees$training_test,
levels = c(1,2), labels = c("TRAIN", "TEST"))
yankees.train = subset(yankees, training_test == 'TRAIN')
yankees.test = subset(yankees, training_test == 'TEST' & Month != 'March')
Here is a visual of the accuracy of the model on the training and test sets:
I graphed the models y-variable (Attendance) on a season long plot in the Shiny App on its separate tab. A user can select a specific game or a range of games to view their attendance.
There is a lot about the model and predicted attendance that looks promising. The highest predicted games for the yankees occur on Saturday day games against the rival Boston Red Sox and Baltimore Orioles. Games with low estimated attendance occur on Tuesday nights and do not involve a promotion. However, there will undoubtedly be error in the regression and it will need to be tuned moving forward.
I was very pleased with how this project turned out. I combined a myriad of aspects of data science that I have learned in this bootcamp; R, python, web scraping, machine learning and Shiny. Shiny is a really useful and well designed tool that I will continue to utilize in my projects and work moving forward.
Next steps for this project include scraping the rest of the MLB promotional calendars so that I can include every team and not just the Yankees. I also wan to train more accurate/complex models to predict the attendance numbers for the 2016 season. While multiple linear regression is easy to understand and interpret, I believe a more sophisticated supervised learning technique will yield better results.