- Keyfood and C-Town "on sale" prices are approximately the same in the Harlem neighborhood.
- freshdirect is 138% more expensive than C-Town on a basket of 18 common grocery items.
- freshdirect is 146% more expensive than Keyfood on a basket of 10 common grocery items.
Please note that only the same items on sale from Keyfood and C-Town were compared with the fresh direct regular prices.
--------------------------------------------------------------------------------------------------------------------------------------
- Author: XAVIER CAPDEPON
- Xavier was a student of the Data Science Bootcamp#2 (B002) - Data Science, Data Mining and Machine Learning - from June 1st to August 24th 2015. Teachers: Andrew, Bryan, Jason, Vivian.
- The post is based on his "Web scraping" (Python) project final submission.
--------------------------------------------------------------------------------------------------------------------------------------
A few years ago, I moved further North in Manhattan and ended up living in the heart of Harlem on 145th street. Unlike Upper West Side where I was living with roommates a few years earlier, it quickly appeared to me that all the grocery supermarkets were attracting customers based on hypothetical large discounts that they would advertise with large letters on their front windows or with paper weekly circulars that are distributed inside and outside the stores.
After learning the weekly pattern of the discounts in different stores during the first few months, I had memorized a list of prices - that could vary over time - for my most regular grocery items and tried to optimize my spending by shopping for the lowest price items in up to 5 closest supermarkets in my neighborhood. The process sounded quite rewarding to me in terms of spending but quite energy consuming, time consuming and sometimes frustrating.
For the first Python project, I realized that I could compare more quantitatively the prices in different grocery store using the web scraping tool to download all the available data. This first project gave me several technical and theoretical problems to solve that I will explain in this blog post.
During this grocery shopping experiment, I observed some curious coincidences:
1. The first one was a curious coordination in terms of items on sale between the different supermaket brands and some similarities in term of prices and period of sales.
2. The second coincidence that I noticed two years ago is the fact that some heavily discounted such as the Tropicana Orange Juice in the standard size (59oz) and all the organic milk brands were never on sale at the same time.
Prior to coding, I made some research and I realized that most of the large brands such as Fairway Market would not advertise their prices online at all, or, like Associated Supermarket, the weekly circular discount offerings will be published using a basic scanned picture making the web scraping impossible.
Among the possible choices for the supermarket brands, I selected two brands which are independent and competing in North Manhattan, Keyfood and C-Town, and that are known to attract client with weekly discounts:
- C-Town is "a chain of independently owned and operated supermarkets operating in the northeastern United States" (source Wikipedia) which most of the stores are located in New York, NJ, PA, CT and MA.
- Keyfood is a "cooperative of independently owned supermarkets" (source Wikipedia) which most of the stores are located in the tri-state area.
- freshdirect is an online grocery store which are mainly operating in New York area.
The Fresh Direct scraping wasn't part of the original plan. Surprised by the data scraped on the Keyfood and C-Town website and despite the time left before delivering the project, I decided to scrap Fresh Direct in order to get more data to compare to even if their business plan is slightly different: Fresh Direct hasn't any store and doesn't attract customers by using weekly discount.
The expected data to be captured for this analysis was all available information about grocery items:
- full item name
- full item description
- item price
- item weight/volume
- diverse: picture, price by weight/volume
At first, I scraped the Keyfood website, then worked on the C-Town website and finished on the Fresh Direct Website.
All 3 websites had some similarities in term of structures with the existence of a secondary website uncovered during the web scraping coding.
The solution retained was scraping the index page URL and modifying it about 90 times to download all the grocery items. To extract the data, Regular Expression and Beautiful Soup were used with respectively 20%-80% .
C-Town:
Using the secondary Index URL modified 4 times, an embedded Java script code that contained all the targeted text containing all the grocery items was downloaded using Beautiful Soup. Given the nature of the text, I had to extract all the data Regular Expression only. The website regexr.com
freshdirect
With the experience acquired by scraping Keyfood and C-Town websites, the fresh direct website was slightly easier than the previous two websites. The preliminary strategy to download the data was again through the index page and amend the index category 400 times into the URL.
Initialization:
## initiate packages needed for web scraping import pandas as pd import numpy as np from bs4 import BeautifulSoup, Tag import requests import re # define fonction to extract raw text from website using Beautiful Soup def extract_text(urlv, headersv): # Request by url and headers demande = requests.get(urlv, headers=headers).text # Request status code statut_demande = demande.status_code if statut_demande/100 in [4,5]: return 'error on requests with error: ', statut_demande return BeautifulSoup(demande)
Phase 1: Download the text from the Home page
# Home page URL of the Fresh Direct website url = 'https://www.freshdirect.com/index.jsp' # headers used to launch the "requests" fonction from the requests package headers = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language':'en-US,en;q=0.8', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', # cookie 'Cookie':'FDUser=1AIA7NG8IVEIH1SRL09ASVBXEK1CUNBK6PQA77J; cmTPSet=Y; CoreID6=80004000370614348352657&ci=90391309; RMID=0a3704045585d940; CM_DDX=pla14=1434835274869; JSESSIONID=tp0MVGQJVh7YTZHNJvrp1GVzjLj9pT1m31vwJw7ZTz5pcw2n8mp1!-1687515818!839018455; _gat=1; _ga=GA1.2.1044717069.1434835264; 90391309_clogin=v=1&l=1434845216&e=1434847301314; RES_TRACKINGID=345160034252330; RMFS=011Z6Su5U10451; RES_SESSIONID=373535453341901', 'Host':'www.freshdirect.com', 'Referer':'https://www.freshdirect.com/browse.jsp?pageType=browse&id=pate&pageSize=30&all=true&activePage=1&sortBy=Sort_PopularityUp&orderAsc=true&activeTab=product', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36' } #Using my own fonction defined earlier, the text of home page is extracted in a var named textextracted textextracted = extract_text(url, headers) print textextracted.prettify
The text downloaded contained all the grocery item categories that will be used later to get the targeted data about the grocery items.
Phase 2: Preparation of the extraction of the grocery item categories
## Selection of text containing pertinent information content0 = [i.__str__() for i in textextracted.find_all('script')] #print content0 #1. Extraction of Category String using Regex (using regexr.com as starter) print type(content0) #### Extraction of full strings cat = re.findall('("id":"[A-Za-z_]+","name":"[A-Za-z0-9s&é,-']+","image":"/media/images/navigation/department/[0-9a-zA-Z]+)', content0[20]) #### Delete unuseful part of strings cat3 = [re.sub('"id":"','',x) for x in cat] cat4 = [re.sub('","name":"[A-Za-z0-9s&é,-']+","image":"/media/images/navigation/department/[0-9a-zA-Z]+','',x) for x in cat3] print 'cat4:', cat4 cat5 = list(set(cat4)) print type(cat5), len(cat5) #2. Extraction of Category Full Name #### Extraction of full strings catname = re.findall('("id":"[A-Za-z_]+","name":"[A-Za-z0-9s&é,-']+","image":"/media/images/navigation/department/[0-9a-zA-Z]+)', content0[20]) #### Delete unuseful part of strings catname4 = [re.sub('"id":"[A-Za-z_]+","name":"[A-Za-z0-9s&é,-']+","image":"/media/images/navigation/department/','',x) for x in catname] print catname4 catname5= list(set(catname4)) print catname5, len(catname5) ## Simplified list of category index: cat5 = list(set(cat4))
Note: the above code doesn't catch all the pet food categories.
Phase 3: run tests using the item category "jarsause" that contains all the pasta sauces available at fresh direct.
## 'url2' is a generic URL used as a basis: ## - the page size is modified for 'pageSize=500' and ## - the category index is contained in the string 'id=jarsauce' url2 = 'https://www.freshdirect.com/browse.jsp?pageType=browse&id=jarsauce&pageSize=500&all=true&activePage=1&sortBy=Sort_PopularityUp&orderAsc=true&activeTab=product' headers2 = { 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding':'gzip, deflate, sdch', 'Accept-Language':'en-US,en;q=0.8', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Cookie':'FDUser=1AIA7NG8IVEIH1SRL09ASVBXEK1CUNBK6PQA77J; cmTPSet=Y; CoreID6=80004000370614348352657&ci=90391309; RMID=0a3704045585d940; CM_DDX=pla14=1434835274869; JSESSIONID=qyjWVGYQBJPQ2cL1SkSn4TcmLQTX78zlJHvGJP9pJdQSjnJyQ5qh!1352730708!-1129265067; RMFS=011Z6UNu; _ga=GA1.2.1044717069.1434835264; 90391309_clogin=v=1&l=1434851351&e=1434853650199; RES_TRACKINGID=345160034252330', 'Host':'www.freshdirect.com', 'Referer':'https://www.freshdirect.com/browse.jsp?pageType=browse&id=pate&pageSize=30&all=true&activePage=1&sortBy=Sort_PopularityUp&orderAsc=true&activeTab=product', 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.81 Safari/537.36' } textextracted2 = extract_text_keyfood(url2, headers2) #print textkeyfood2.prettify
Note: the standard pageSize=30 string in the URL is modified pageSize=500 to get all the information available at once.
Phase 4: tests and preliminary loops to extract the grocery items data
#### tests of data extraction content00 = textkeyfood2.find('div', class_="browse-sections transactional") #print 'len c00:', len(content00), content00 content001 = content00.find_all('li') print 'len c001:', len(content001), #content001 #print content00[0], 'n', content00[1], 'n', content00[2] content01 = [x.find('a', class_="portrait-item-header-name").get_text() for x in content001] #content02 = 0 #content001.find_all('div', class_="configDescr") #content03 = 0 #content001.find_all('div', class_="portrait-item-price") #### actual loop listitemFullName = [] listitemPrice = [] listitemDescription = [] i=0 for x in content001: Fullname = x.find('a', class_="portrait-item-header-name") Price = x.find('div', class_="portrait-item-price") Descr = x.find('div', class_="configDescr") if Fullname==None: listitemFullName.append('NA') else: listitemFullName.append(Fullname.get_text()) if Price==None: listitemPrice.append('NA') else: listitemPrice.append(Price.get_text()) if Descr==None: listitemDescription.append('NA') else: listitemDescription.append(Descr.get_text()) print 'name', len(listitemName) print 'name', len(listitemFullName) print 'price', len(listitemPrice) print 'descr', len(listitemDescription) print 'name', listitemName print 'fullname', listitemFullName print 'price', listitemPrice print 'desc', listitemDescription
Note 1: the code extract a list of redundant item categorical strings that is reduced to 347 item categorical strings that will be used to modify the generic URL (https://www.freshdirect.com/browse.jsp?pageType=browse&id=jarsauce&pageSize=500&all=true&activePage=1 &sortBy=Sort_PopularityUp&orderAsc=true&activeTab=product') by replacing the "jarsauce" string
Note 2: not all the pages have the same architectural structure under the "class_="browse-sections transactional"". In order to show and understand the details of the architecture, I used the following code which confirmed that a few pages had a different structure:
## Since the findAll function doesn't provide the expected data, ## an analysis of the architecture of the targeted class "browse-sections transactional" is needed ## Bryan code to draw the tree architecture: var: maxdepth horizontal, maxbreadth vertical and attrs,keys / items() content00 = textextracted2.find('div', class_="browse-sections transactional") def take(n, iterator): '''Take the first n items from a generator''' return [x for _, x in zip(range(n), iterator)] #hacky? def treeWalker(bsobj, level = 0, maxdepth = 5, maxbreadth = 20, indent = ' '): '''Pretty printer with limits on depth of printing''' if level >= maxdepth: print indent * level, 'reached max depth' return if hasattr(bsobj, 'children'): children = take(maxbreadth, bsobj.children) for item in children: if isinstance(item, Tag): print indent * level, item.name, if len(item.attrs) > 0: print item.attrs.items() ### change keys() or items() else: print '' treeWalker(item, level + 1, maxdepth, maxbreadth) if len(children) == maxbreadth: print indent * level, '[', len(bsobj.contents) - maxbreadth, 'tags not printed ]' treeWalker(content00)
for i in range(0,len(content00)): if content00.find_all('li', class_="portrait-item-header-name")<>None: listitemFullName.append(content00.find_all('a', class_="portrait-item-header-name").get_text()) else: listitemFullName.append('NA') if content00.find_all('li', class_="configDescr")<>None: listitemDescription.append(content00.find_all('div', class_="configDescr").get_text()) else: listitemDescription.append('NA') if content00.find_all('li', class_="portrait-item-price")<>None: listitemPrice.append(content00.find_all('div', class_="portrait-item-price").get_text()) else: listitemPrice.append('NA')
Phase 5: Generate the 354 URL corresponding to all the grocery item categories:
url3 = 'https://www.freshdirect.com/browse.jsp?pageType=browse&id=jarsauce&pageSize=1200&all=true&activePage=1&sortBy=Sort_PopularityUp&orderAsc=true&activeTab=product' print url3 urllist = [] for i in range(0, len(cat5)): category = 'id='+ cat5[i] ### defined above on phase 1 #print category new_url = re.sub('id=jarsauce',category,url3) urllist.append(new_url) print urllist print len(urllist), len(urllist)/10
Phase 6: finalization of the code to scrap the entire inventory of grocery items:
%%time ####### Complete download for Fresh Direct based on single address amended 400+ times ############## listitemName = [] listitemFullName = [] listitemPrice = [] listitemDescription = [] listitemCategory = [] for l in range(0, len(urllist)): if l in[10,50,100,200,300,400,500]: print l else: False textkeyfood2 = extract_text_keyfood(urllist[l], headers2) content00 = textkeyfood2.find('div', class_="browse-sections transactional") #print 'len c00:', len(content00), content00 if content00!=None: content001 = content00.find_all('li') if content001!=None: #print 'len c001:', len(content001), #content001 for x in content001: Fullname = x.find('a', class_="portrait-item-header-name") Price = x.find('div', class_="portrait-item-price") Descr = x.find('div', class_="configDescr") if Fullname==None: listitemFullName.append('NA') listitemCategory.append('NA') else: listitemFullName.append(Fullname.get_text()) listitemCategory.append(cat5[l]) if Price==None: listitemPrice.append('NA') else: listitemPrice.append(Price.get_text()) if Descr==None: listitemDescription.append('NA') else: listitemDescription.append(Descr.get_text()) else: False else: False print listitemName print listitemFullName print listitemPrice print listitemDescription print listitemCategory
Note: the code runs for about 15 min and generates all the grocery items description, prices and additional information
Phase 7: the data is transferred into a data frame and saved on the disk:
## creation of new dataframe of lists of caracteristiques created above import pandas as pd import numpy as np fdirect_data = pd.DataFrame({'name': listitemFullName, 'desc': listitemDescription, 'price': listitemPrice, 'category': listitemCategory }) fdirect_cat_data = pd.DataFrame({'item_name': cat4, 'category': catname4})
#listeCtownCat = pd.DataFrame({'name': listlinksCategories2}) fdirect_data.replace(u'’', u"'") fdirect_data.to_csv('./fdirect_data3.csv',sep=',',encoding='utf-8')
#listeCtownCat = pd.DataFrame({'name': listlinksCategories2}) fdirect_cat_data.replace(u'’', u"'") fdirect_cat_data.to_csv('./fdirectcatdata3.csv',sep=',',encoding='utf-8')
All the output files and the code for the 3 web scraping including the output file for the freshdirect web scraping can be found here.
Given the time frame to deliver the project and my background, I, unfortunately, used Excel to clean and match the data (heavy Vlookup combined with Paste value) that makes the cleaning and matching not completely reproducible.
The Cleaning consisted in homogenize the item names ,cleaning the prices in dollars versus the ones in cents.
The preliminary analysis consisted in finding exact matching between the three data set obtained by web scraping.
fig. 1 code:
data = pd.DataFrame({'Supermarkets' : ['Ctown','Keyfood','Fresh Direct'],'references' : [308,411,8370]}) plt.rcParams['figure.figsize'] = 12,9 data.plot('Supermarkets','references',color=['y', 'r', 'g'],kind='bar') plt.title('Numbers of product references n obtained by webscraping n') plt.text(-.02,500,'308') plt.text(.98,600,'411') plt.text(1.98,8600,'8,370')
fig.1:
fig. 2 code:
df = pd.concat([ctown.priceFinalCT,keyfood.PriceKFF,fdirect.priceFDt1], axis=1) df.columns=[ 'C Town (Weekly Circular)', 'Key Food (Weekly Circular)','Fresh Direct (Online)'] moyenne1 = df.mean() fdirectcatgroup = fdirect.groupby('categoryFDS') keyfoodcatgroup = keyfood.groupby('CategoryKF') ctowncatgroup = ctown.groupby('categoriesCT') plt.rcParams['figure.figsize'] = 12, 8 plt.subplot(221) df.boxplot(vert=False) plt.xlim([-2,25]) plt.xlabel('Price ($)') plt.title('Grocery Supermarket Prices n (week of June 23 2015)n') plt.scatter(moyenne1,[1,2,3], marker='+') plt.subplot(222) fdirectcatgroup.count().nameFD.plot(kind='bar') plt.xlabel('') plt.ylabel('number of items') plt.title('Fresh Direct Categories') plt.subplot(223) ctowncatgroup.count().nameCT.plot(kind='bar') plt.xlabel('') plt.ylabel('number of items') plt.title('C Town Categories') plt.subplot(224) keyfoodcatgroup.count().nameKF.plot(kind='bar') plt.xlabel('') plt.ylabel('number of items') plt.title('Key Food Categories') plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=.5) left = 0.125 # the left side of the subplots of the figure right = 0.9 # the right side of the subplots of the figure bottom = 0.1 # the bottom of the subplots of the figure top = 0.9 # the top of the subplots of the figure wspace = 0.2 # the amount of width reserved for blank space between subplots hspace = 2 # the amount of height reserved for white space between subplots
fig. 2:
Above (fig. 3): the graph shows that the price range of the items on sale at Keyfood and CTown are similar (about $5) whereas the average of the regular prices at freshdirect are twice higher (about $10). Looking at the fresh direct prices data, we can observe that there are a lot of items from the catering department that push the average price up. Despite my expectations, it is really difficult to do an analysis by grocery categories given the different distribution for each brand.
fig. 4-5-6 code:
#### plt.rcParams['figure.figsize'] = 12,3 plt.subplot(131) toto2 = kffd.Category.groupby(kffd.Category).count() toto2.plot( kind='pie') plt.title('Match Categories n') plt.subplot(132) kffd.boxplot() plt.scatter([1,2],[kffdm1,kffdm2], marker='+') plt.ylim([0,8]) plt.title('Avg, Median prices by Supermarket n') ax = plt.subplot(133) plt.bar([1,2],[kffds1,kffds2],0.4,label=('Key Food','FreshDirect'),align='center') ax.set_xticks([1,2]) ax.set_xticklabels( ('Key Food','FreshDirect') ) plt.text(.90,kffds1-5,'$32.43', color='w') plt.text(1.90,kffds2-5,'$47.12', color='w') plt.tick_params( axis='x', # changes apply to the x-axis which='both', # both major and minor ticks are affected bottom='on', # ticks along the bottom edge are off top='off', # ticks along the top edge are off labelbottom='on') plt.title('Basket Price by Supermarket n') #help(plt.bar) #plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=.5) left = None # the left side of the subplots of the figure right = None # the right side of the subplots of the figure bottom = None # the bottom of the subplots of the figure top = None # the top of the subplots of the figure wspace = .5 # the amount of width reserved for blank space between subplots hspace = None # the amount of height reserved for white space between subplots plt.subplots_adjust(left, bottom, right, top, wspace, hspace)
fig. 4-5-6:
Calculation are based on a basket of 10 identical items identify at both Keyfood and freshdirect
Calculation are based on a basket of 18 identical items identify at both Keyfood and freshdirect
Calculation are based on a basket of 17 identical items identify at both Keyfood and freshdirect
- Keyfood and C-Town "on sale" prices are approximately the same in the Harlem neighborhood, which confirmed my preliminary observation regarding the similarity of prices between the two brands.
- freshdirect is 138% more expensive than C-Town on a basket of 18 common grocery items.
- freshdirect is 146% more expensive than Keyfood on a basket of 10 common grocery items.
Please note that items on sale from Keyfood and C-Town are compared to the same items at the regular price at freshdirect.
It appears that the prices differences between Keyfood or Ctown are 70% of the regular prices at freshdirect and that a basket of 10 identical items at Keyfood and Ctown is similar in terms of price.
However, it was not possible to compare the full prices of the items available in the Keyfood or Ctown as the two brands would not disclose the data. After a few visits at Keyfood and CTown, it also appeared that the average price of the same item on sale and at the full price may be equivalent to the regular price at freshdirect (example: the price observed on the Tropicana Orange Juice (see presentation)). This would need to be confirmed by more observations.
The data are also showing the two coincidences mentioned in the introduction:
1. There is a curious coordination in terms of items on sale between the different supermaket brands and some similarities in term of prices and period of sales: the period of sales are similar and the data are showing similar prices on similar items.
2. Some heavily discounted such as the Tropicana Orange Juice in the standard size (59oz) and all the organic milk brands are never on sale at the same time: it was the case during the week of the web scraping.
Finally, the data raised the question about the margins on the different products by the different brands, their advertisement practices and their inter-dependence in term of prices.
It is disappointing to find only 10 to 20 exact matches between the three brands but a better coding algorithm than the excel vlookup may provide more matching.
It may be interesting to study the prices by zip code for the different brands.
the fresh direct website contains all the information about nutrition, so it would be possible to analyse the grocery item prices versus their nutriments.