Blog - NYC Data Science Academy

What's for Dinner? - a web scraping project

Written by Lu Yu | May 24, 2020 6:57:08 PM

To do data analysis, we first need to have data. There is a vast amount of information floating on the internet that if gathered, could be structured and analyzed to extract insight.

Goal

With a fast pace living style, nowadays people may not want to spend too much time cooking dinner and shopping grocery, especially during weekdays. Many recipes websites collected a great number of recipes, and nicely categorized them so users could find the dishes they want to cook by main ingredient type (beef, chicken, vegetarian, etc.). However, few of them, if any, allow users to choose a recipe based on the time needed to get the dish ready, and/or the ingredients available at hand. I scrapped the simplyrecipes.com website for their dinner recipes and focused on the following:

  1. performed exploratory analysis on distribution and correlation of time (cooking and preparing) and ingredients with various dish type (beef, chicken, pork, seafood, lamb, and vegetarian)
  2. write a script to recommend the user with recipes based on one or more criteria:
    1. dish type desired
    2. time intended for preparing & cooking
    3. ingredients at hand

Analysis

Among the dinner recipes that are labeled with their main ingredients (dish type), chicken dishes ranked the first, followed by vegetarian, beef, and pork. Seafood, pasta, and lamb are less common. However, about 30% of the recipes were not clearly labeled, some have mixed main ingredients, but a majority could have been categorized into existing dish types. As a reminder to the website, lack of labeling may lead to those recipes falling throw the cracks during user searches based on labels.

In general, prep time for a dish averages around 17 minutes, and cook time averages around 50 minutes。 The distribution of prep time and cook time of a recipe is not correlated (Pearson coefficient 0.0384). However, ANOVA test revealed that the prep/cook time between different type of dishes do differ. By T-test, the prep/cook time could be roughly separated into 3 tiers: seafood is the fasted; followed by pasta, vegetarian, and chicken; while beef, pork, and lamb generally takes the longest to prepare and cook.

Natural language processing packages were used to analyze the ingredients section of each recipe. Non-nouns (adjectives, verbs, etc.) were identified based on part of speech and removed. As well as measurement units, spice, sauce, were removed manually by adding those to the stop words list. Thus, the ingredient section now mainly just reflect items of ingredients that could be analyzed by word cloud.

Word cloud analysis of the auxiliary ingredients (say ingredients other than beef for a beef dish, spice and sauce excluded) reveals that onion, pepper, tomato, cheese, and egg are dominating for all types of dishes. Such information could be used as a reference from many perspectives: a) for customers, if you get those ingredients, you could maximize the number of recipes you could make; b) for stores, consider stock up on those ingredients and put them in an easily accessible place, also considering packing some less popular ingredients together with those most popular ones as a sale package; c) for the recipe website, this could reflect diet preference of America, or might be a sign of lack of diversity in the recipes they collect.

The dominating auxiliary ingredients continue to appear at the top of the list for dominating auxiliary ingredients for each dish type. Although their ranks change around. Apparently lamb is most often cooked with tomato, beef with onion, and seafood with lemon and onion. So if you don’t have access to any recipes at some point, go with those general rules and you probably wouldn’t stray too far.

Results

            Finally, I wrote an interactive python script to help the user choose recipes based on the criteria provided. A user could choose the type of dish (beef, chicken, pork, seafood, lamb, or vegetarian), the maximum time of prep and cook in minutes, and what ingredients the user has at hand. Each step could be skipped if that is not a concern. As for the ingredients, the script considers it a match if half of the ingredients required by the recipe are matched. After entering all those requirements, the script will output a table of recipes that meets the user’s need, filtering from over 700 recipes in the original database.

Future Developments

  1. The database could be expanded to include other recipe websites, for websites that have nutrition and calorie information, those parameters could be added to the filtering script as well
  2. Export the cleaned database to SQL so other users could conduct queries
  3. Based on analysis of recipe contents and description, a recipe-wiring algorithm could be developed, newly created recipes could be validated by chefs before announced to users
  4. The filtering system could be incorporated with the recipe website and online grocery shopping app, e.g., Amazon fresh, so the user could select next week’s recipes, the app could calculate the amount and kinds of ingredients needed, add those to the shopping list in the recipe website, which connects to the shopping cart of say, Amazon fresh, and place the order, so all ingredients for next week’s dinners would be delivered in 2 hours.