Learning R: Analyzing the English Premier League (I)

Contributed by Bryan Valentini.Bryan took R003 class with Vivian Zhang(Data Science by R, Intensive beginner level) in Mar-Apr, 2014 and did great in class.The post was based on his week 4 Homework submission.


Introduction

Good morning, football (soccer) fans, we are gathered here for my inaugural post about learning R - aren't you excited?! Gosh, analyzing data just wakes me up the morning. This series of posts is a guide on doing some basic data manipulation and plotting using R.

Performance ChartGunners and Data

I've broken all the steps down to try to shed some light on my thinking, so they will be more painstakingly explicit and less about Arsenal or the English Premier League for the first few posts. But, but! I welcome all kinds of relevant football analysis questions or comments as they could be the seeds of further posts.

I am going to assume you have a working copy of the R language and the following packages installed: XML, RCurl, stringr, and plyr. Also, some basic knowledge of R and vectorized operations is beneficial when reading this post.

Side note An Arsenal supporter is affectionately called a "gooner" [wikipedia].

Getting the Data

<brrrreeeeeeep>

That was the first-half whistle, so let's get started. The lovely people at ESPNFC have a great deal of semi-structured data about football, and in particular we want to look at the team fixture results, a.k.a. the results of individual games. We want this type of data in order to do some opponent analysis later on. Let's take a look at a sample:

Source: 2001-02 Arsenal Fixtures & Results

2001/02 Premier League Fixtures
Date Status Home Score Away Attendance Competition
Aug 18 FT Middlesbrough 0-4 Arsenal 31,557 Premier League
Aug 21 FT Arsenal 1-2 Leeds United 38,062 Premier League
Aug 25 FT Arsenal 4-0 Leicester City 37,909 Premier League
Sep 22 FT Arsenal 1-1 Bolton Wanderers 38,014 Premier League
Sep 29 FT Derby County 0-2 Arsenal 29,200 Premier League
Oct 13 FT Southampton 0-2 Arsenal 29,759 Premier League
Jan 1 Postp Leicester City P - P Arsenal Filbert Street Premier League
Jan 13 FT Arsenal 1-1 Liverpool 38,132 Premier League

This is just a small subset of one season's worth of games to give us a feel for the original data set. If you click the source URL above, you'll see this subset is pulled from the 2001-2002 season, focusing on just the regular season 'Premier League' competition. Luckily, most of the data is in an HTML table, which will make the parsing tasks easier. On the other hand, we can already see some issues with the raw table: first, the scores and home/away statuses are conflated in their position within the table. Also, some rows have listed postponed games which we will probably want to remove  later on.

For my particular interest, we want to pull all the Arsenal-related data from ESPNFC's site for all the complete seasons starting from 2001 through 2012. I have tried to structure most of the code so that you could pull the data for any team in the EPL with just a few minor edits. Let's get to the code!

# import our libraries
library('RCurl')
library('stringr')

# create some constants for later use
team <- "Arsenal"
baseurl <- "http://espnfc.com/team/fixtures/_/id/359/season" #/2001/league/eng.1/arsenal
seasons <- 2001:2012

I want to pull all the data from ESPNFC in an automated fashion using the parameters we defined above, in order to take advantage of the URL scheme that they have defined (i.e. baseurl + season + '/league/eng.1/' + modified team name). If your favorite team has multiple words in it, then you have to take the additional step to convert the name to replace spaces with dashes and use lowercase letters, e.g. "Queens Park Rangers" becomes "queens-park-rangers".

For this step, we will define all the URLs we need to fetch and use the RCurl package to do all the heavy lifting:

# build all the URLs and fetch each season's web page
mod_name <- str_replace_all(tolower(team), " ", "-")
urls <- sapply(seasons, function(s) { return (paste(baseurl, s,"league/eng.1", mod_name, sep="/")); })
raw_data <- sapply(urls, getURL, .encoding="UTF-8")

# optional: write raw HTML data to local working directory
io <- file("rawdata_arsenal.txt")
cat(raw_data, file=io)
close(io)

# cleanup
rm("raw_data", "io")

The first three lines is where all the URL construction and HTML fetching happens. Let's tear these lines apart a bit. The seasons symbol is a vector that we defined above. Using the sapply function, we map an anonymous function over the vector of numbers, which should result in each of the URLs we want to fetch data from. You can try executing the code below to see each part:

> baseurl <- "http://espnfc.com/team/fixtures/_/id/359/season"
> mod_name <- str_replace_all(tolower(team), " ", "-")
> (seasons <- 2001:2012)
[1] 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
> (urls <- sapply(seasons, function(s) { return (paste(baseurl, s, "league/eng.1", mod_name, sep="/")); }))  
[1] "http://espnfc.com/team/fixtures/_/id/359/season/2001/league/eng.1/arsenal" 
[2] "http://espnfc.com/team/fixtures/_/id/359/season/2002/league/eng.1/arsenal" 
[...] 
[12] "http://espnfc.com/team/fixtures/_/id/359/season/2012/league/eng.1/arsenal"

The anonymous function simply pastes all the subparts that we need together, using the '/' character as the separator, and returns the whole URL. Do this once for each number in the source vector and we have the 12 URLs we need. Alternatively, you could have used the str_c method from the stringr package to remain consistent, but that function just wraps paste anyway. Important: We can go ahead and test one of the URLs in a browser, for example #12 above, and we can expect to see data from 2012-2013.

At this point, we could use the getURL method from the RCurl package to fetch one page at a time and process each returned HTML document to extract the fixtures table. Sometimes, this is a good way to go in order to test one sample page. But, I took an alternative approach and downloaded all the data at once using the same vector technique - sapply - to get a vector of (large) strings.

> rawData <- sapply(urls, getURL, .encoding="UTF-8")

Some caveats: We know that we don't expect a lot of data, so it's OK to load the entire set into memory. This is not always the case, so other techniques might be more appropriate for differently sized data sets. Second, I included the .encoding argument to ensure that all the data is treated as UTF-8. Most major sites usually send back their responses in UTF-8, an international character encoding standard, but not always. You don't need to know much about this issue, but character encoding issues can crop up unexpectedly, and its better to be safe than frustrated at bad data. A full treatment on characters sets is outside the scope of this post, but you can get some idea of it here.

I took the next step to store the entire raw_data blob to our local working directory, using the cat function:

# optional: write raw HTML data to local working directory
io <- file("rawdata_arsenal.txt")
cat(raw_data, file=io)
close(io)

Why did I do this? Mainly to avoid having to re-fetch the original data from its source and suffering through network latency over and over as I test my code. You will see in the next step that I got lucky in that I didn't have to separate each season into its own data file, which is another concern when doing these data-cleanup tasks. I verified that the 'rawdata_arsenal.txt' file has a bunch of HTML and Javascript in it, it should be about 500KB worth of character data.

<brrrreeeeeeep>

Well, that half went by fast. Let's recap the steps we took:

  1. Identified our data source, namely, static data from a website,
  2. Mapped out which page of the website had the data of interest  for a particular season,
  3. Downloaded all the data for parsing, using a safe character encoding set, and...
  4. Stored all the data to disk.

In the process, we took advantage of sapply to loop through the above steps for each season with only a few lines of code. Pretty cool! The next post will focus on cleaning up the data to make it easier to analyze and plotting it.

Code: All code can be found on my Learning R Github page, specifically the code for this post is here. Find me on twitter @ElDonEsteban.

2 Comments

  1. […] is a follow-up post to “Analyzing the English Premier League” part 1. This part is going to concern a little more with the nitty-gritty of cleaning the data, and […]

  2. […] is a follow-up post to “Analyzing the English Premier League” parts 1 and 2. This part is going to concern with the fun part of analysis, as in all the data has been […]

Leave a Reply

Your email address will not be published. Required fields are marked *