Scraping Kickstarter

Contributed by Gordon. Gordon took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on his third class project(due at 6th week of the program).

The Problem

"Kickstarter is the world's largest funding platform for creative projects," says the first line of the description on the company's website. Creators post projects on Kickstarter hoping for their work to be crowd funded by interested parties. If the project's goal is met before the projects expiry date, the money promised becomes money to spend. If not, the pledges go unfulfilled.

Kickstarter has an private API to access data from these projects, and several people have written their own APIs as wrappers over this hidden conduit. From a scraping perspective, getting data becomes a bit harder. Python's famous web scraping library, Beautiful Soup, is powerless against the Javascript foundations upon which Kickstarter's website is built. Thus, my first hurdle was the find a library powerful enough to glean data from websites built on Javascript.

I first looked at the library called grab, but the poor English language report proved to be too large a barrier to overcome. Next to enter my gaze was Scrapy, but that was always discarded to its perceived over-complexity. I finally settled on Selenium as my tool.

Methodology

Selenium's tagline is terse: It automates browsers. Out of the box Selenium allows one to open a web browser, goes to a page, and do any action a human could do (clicking button, filling in forms, etc), in addition to the base task of parsing html for information. One can partner Selenium with Phantom.js to do this surfing without opening a browser, but, for some reason, that proved to be slower on my machine.

The code below activates Selenium, navigates to Kickstarter's website, and then stores all the project categories and their urls.

from selenium import webdriver
browser = webdriver.Firefox()
browser.get('https://www.kickstarter.com/discover?ref=nav')
categories = browser.find_elements_by_class_name('category-container')
category_links = []
for category_link in categories:
#Each item in the list is a tuple of the category's name and its link.
category_links.append((str(category_link.find_element_by_class_name('h3').text),
category_link.find_element_by_class_name('bg-white').get_attribute('href')))

For each category I use its url to navigate to its page. The default is to show the first 20 active projects in a given category. Using Selenium I expand the results to show the first 20 projects of all the projects ever submitted for that category.

for category in category_links:
browser.get(category[1])
browser.find_element_by_class_name('sentence-open').click()
browser.find_element_by_id('category_filter').click()

I then went to each project's page and scraped the data I wanted. This included the project's name, funding goal, current money garnered, and description. There was some branching in the code to account for different project states: funded and finished, funded and not finished, etc.

The eagle-eye reader will notice a key omission here. I did indeed only scrape the first 20 projects of the 15 categories due to time restrictions. My conservative estimate put the time to scrape data on all of Kickstarter's 200,000 plus projects at four days. The difference between scraping data on 600 instead of 200,000+ projects was five lines of code.

while True:
try:
browser.find_element_by_class_name('load_more').click()
except:
break

What this snippet does is click the "Load More" button at the bottom of the category's page until every project is loaded, and then scrapes the data for each.

Next Steps: Short and Long Term

Once I have the full data I intend to do some extensive Machine Learning on the data to try to a build a predictive model to tell whether or not a Kickstarter project will be funded. Finally, I will build a web app that will allow a user to input the description of their Kickstarter project, and they will be able to receive a prediction of whether or not it will be funded.

That's still a long way off, though. Still, I decided to do some basic Machine Learning of the type that I want to do later.

My process involved separating the data into two: one with numeric data and the other with textual data. With the numeric data I used vanilla Logistic Regression on the entire data to achieve an 83% accuracy rate, a 23% increase over the baseline accuracy. Next I used Natural Language Processing to build a model on training data, and then tried to predict the test data set. Surprisingly, the accuracy was 97%.

I'm exciting to see what the more robust process will produce.

Links

Slides: https://slides.com/gfleetwood/kickstarter-project/
GitHub: https://github.com/gfleetwood/nyc-data-science-academy/tree/master/scraping_kickstarter

Gordon Fleetwood
Gordon Fleetwood
Gordon has a B.A in Pure Mathematics and a M.A. in Applied Mathematics from CUNY Queens College. He briefly worked for a early stage startup where he was involved in building an algorithm to analyze financial data. However, most of his time has been spent working in various roles in academia —the latest being as an Adjunct Mathematics Lecturer. He is equally comfortable in both the Python and R Data Science stacks, but is strictly Python for Software Engineering. Outside of traditional Data Science, he is also interested in Soccer Analytics and the Open Data movement. With regard to the latter, he has recently become involved with Beta NYC.

Leave a Reply

Your email address will not be published. Required fields are marked *