In this notebook, I'll scrape data about chocolate bars from the web with Beautiful Soup, clean it with Pandas, and visualize/analyze it with Matplotlib and Pandas.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# import re
url='https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html'
webpage = requests.get(url)
soup = BeautifulSoup(webpage.content,'html.parser')
chocolate_dict = {}
# Original approach of scraping column names for dataframe
# col_names = []
# for table_data in table_rows[0].find_all('td'):
# col_name = re.sub(r'[ \n\xa0]+',' ',table_data.get_text()).rstrip()
# chocolate_dict[col_name] = []
# col_names.append(col_name)
# Declaring column names rather than scraping them is much simpler, and gives more succint column names.
col_names = ['Company','Specific Bean Origin','REF','Review Year','Percent Cocoa','Company Location','Rating','Bean Type','Broad Bean Origin']
for col_name in col_names:
chocolate_dict[col_name] = []
table_rows = soup.select('#cacaoTable')[0].find_all('tr')
for table_row in table_rows[1:]:
for i,table_data in enumerate(table_row.find_all('td')):
chocolate_dict[col_names[i]].append(table_data.get_text())
chocolate_df = pd.DataFrame.from_dict(chocolate_dict)
chocolate_df
Make data appropriate type, separate data into appropriate columns, make blank values NaN. View cleaned data.
chocolate_df['Review Year'] = chocolate_df['Review Year'].astype(int)
chocolate_df['Percent Cocoa'] = chocolate_df['Percent Cocoa'].str.replace('%','').astype(float)
chocolate_df['Maker'] = chocolate_df['Company'].str.extract(r'((?<=\().*(?=\)))',expand=False)
chocolate_df['Company'] = chocolate_df['Company'].str.extract(r'(.*(?= \(.*)|.*)',expand=False)
chocolate_df['Rating'] = chocolate_df['Rating'].astype(float)
chocolate_df['REF'] = chocolate_df['REF'].astype(int)
chocolate_df.replace('\xa0',np.NaN,inplace=True)
chocolate_df
plt.title('Distribution of ratings for all chocolate bars')
plt.hist(chocolate_df['Rating'], bins=20, range=(0,5))
plt.show()
chocolate_df[chocolate_df['Rating']>=4].Rating.count()
It appears that over half of the chocolates are rated between a 3 and 4, with only 100 out of 1795 (5.6%) rated at least 4.
average_rating_by_company = chocolate_df[['Company','Rating']].groupby('Company',as_index=False).Rating.mean()
average_rating_by_company = average_rating_by_company.rename(columns={'Rating':'Average Rating'}).sort_values('Average Rating',ascending=False).reset_index(drop=True)
average_rating_by_company[0:10]
It appears that Tobago chocolates are the best rated by far, having an average rating .125 greater than the closest competitor, while all other average ratings listed are within .035 of a competitor.
plt.title('Percent Cocoa vs. Rating')
plt.xlabel('Percentage of cocoa in chocolate')
plt.ylabel('Rating')
plt.scatter(chocolate_df['Percent Cocoa'],chocolate_df['Rating'])
plt.show()
There appears to be no significant correlation between a chocolate's rating and the percentage of cocoa.