Chocolate Soup

In this notebook, I'll scrape data about chocolate bars from the web with Beautiful Soup, clean it with Pandas, and visualize/analyze it with Matplotlib and Pandas.

Import necessary modules.
In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# import re
Set up the soup.
In [2]:
url='https://s3.amazonaws.com/codecademy-content/courses/beautifulsoup/cacao/index.html'
webpage = requests.get(url)
soup = BeautifulSoup(webpage.content,'html.parser')
Create dictionary, fill it with scraped data.
In [3]:
chocolate_dict = {}

# Original approach of scraping column names for dataframe
# col_names = []
# for table_data in table_rows[0].find_all('td'):
#   col_name = re.sub(r'[ \n\xa0]+',' ',table_data.get_text()).rstrip()
#   chocolate_dict[col_name] = []
#   col_names.append(col_name)

# Declaring column names rather than scraping them is much simpler, and gives more succint column names.
col_names = ['Company','Specific Bean Origin','REF','Review Year','Percent Cocoa','Company Location','Rating','Bean Type','Broad Bean Origin']
for col_name in col_names:
  chocolate_dict[col_name] = []

table_rows = soup.select('#cacaoTable')[0].find_all('tr')
for table_row in table_rows[1:]:
  for i,table_data in enumerate(table_row.find_all('td')):
    chocolate_dict[col_names[i]].append(table_data.get_text())
Read dictionary into DataFrame and view.
In [4]:
chocolate_df = pd.DataFrame.from_dict(chocolate_dict)
chocolate_df
Out[4]:
Company Specific Bean Origin REF Review Year Percent Cocoa Company Location Rating Bean Type Broad Bean Origin
0 A. Morin Agua Grande 1876 2016 63% France 3.75 Sao Tome
1 A. Morin Kpime 1676 2015 70% France 2.75 Togo
2 A. Morin Atsane 1676 2015 70% France 3 Togo
3 A. Morin Akata 1680 2015 70% France 3.5 Togo
4 A. Morin Quilla 1704 2015 70% France 3.5 Peru
5 A. Morin Carenero 1315 2014 70% France 2.75 Criollo Venezuela
6 A. Morin Cuba 1315 2014 70% France 3.5 Cuba
7 A. Morin Sur del Lago 1315 2014 70% France 3.5 Criollo Venezuela
8 A. Morin Puerto Cabello 1319 2014 70% France 3.75 Criollo Venezuela
9 A. Morin Pablino 1319 2014 70% France 4 Peru
10 A. Morin Panama 1011 2013 70% France 2.75 Panama
11 A. Morin Madagascar 1011 2013 70% France 3 Criollo Madagascar
12 A. Morin Brazil 1011 2013 70% France 3.25 Brazil
13 A. Morin Equateur 1011 2013 70% France 3.75 Ecuador
14 A. Morin Colombie 1015 2013 70% France 2.75 Colombia
15 A. Morin Birmanie 1015 2013 70% France 3 Burma
16 A. Morin Papua New Guinea 1015 2013 70% France 3.25 Papua New Guinea
17 A. Morin Chuao 1015 2013 70% France 4 Trinitario Venezuela
18 A. Morin Piura 1019 2013 70% France 3.25 Peru
19 A. Morin Chanchamayo Province 1019 2013 70% France 3.5 Peru
20 A. Morin Chanchamayo Province 1019 2013 63% France 4 Peru
21 A. Morin Bolivia 797 2012 70% France 3.5 Bolivia
22 A. Morin Peru 797 2012 63% France 3.75 Peru
23 Acalli Chulucanas, El Platanal 1462 2015 70% U.S.A. 3.75 Peru
24 Acalli Tumbes, Norandino 1470 2015 70% U.S.A. 3.75 Criollo Peru
25 Adi Vanua Levu 705 2011 60% Fiji 2.75 Trinitario Fiji
26 Adi Vanua Levu, Toto-A 705 2011 80% Fiji 3.25 Trinitario Fiji
27 Adi Vanua Levu 705 2011 88% Fiji 3.5 Trinitario Fiji
28 Adi Vanua Levu, Ami-Ami-CA 705 2011 72% Fiji 3.5 Trinitario Fiji
29 Aequare (Gianduja) Los Rios, Quevedo, Arriba 370 2009 55% Ecuador 2.75 Forastero (Arriba) Ecuador
... ... ... ... ... ... ... ... ... ...
1765 Zak's Belize, Batch 2 1578 2015 70% U.S.A. 3.5 Trinitario Belize
1766 Zak's House Blend, Batch 2 1582 2015 60% U.S.A. 3
1767 Zart Pralinen Millot P., Ambanja 1820 2016 70% Austria 3.5 Criollo, Trinitario Madagascar
1768 Zart Pralinen UNOCACE 1824 2016 70% Austria 2.75 Nacional (Arriba) Ecuador
1769 Zart Pralinen San Juan Estate 1824 2016 85% Austria 2.75 Trinitario Trinidad
1770 Zart Pralinen Kakao Kamili, Kilombero Valley 1824 2016 85% Austria 3 Criollo, Trinitario Tanzania
1771 Zart Pralinen Kakao Kamili, Kilombero Valley 1824 2016 70% Austria 3.5 Criollo, Trinitario Tanzania
1772 Zart Pralinen San Juan Estate, Gran Couva 1880 2016 78% Austria 3.5 Trinitario Trinidad
1773 Zokoko Guadalcanal 1716 2016 78% Australia 3.75 Solomon Islands
1774 Zokoko Goddess Blend 1780 2016 65% Australia 3.25
1775 Zokoko Alto Beni 697 2011 68% Australia 3.5 Bolivia
1776 Zokoko Tokiala 701 2011 66% Australia 3.5 Trinitario Papua New Guinea
1777 Zokoko Tranquilidad, Baures 701 2011 72% Australia 3.75 Bolivia
1778 Zotter Raw 1205 2014 80% Austria 2.75
1779 Zotter Bocas del Toro, Cocabo Co-op 801 2012 72% Austria 3.5 Panama
1780 Zotter Amazonas Frucht 801 2012 65% Austria 3.5
1781 Zotter Satipo Pangoa region, 16hr conche 875 2012 70% Austria 3 Criollo (Amarru) Peru
1782 Zotter Satipo Pangoa region, 20hr conche 875 2012 70% Austria 3.5 Criollo (Amarru) Peru
1783 Zotter Loma Los Pinos, Yacao region, D.R. 875 2012 62% Austria 3.75 Dominican Republic
1784 Zotter El Oro 879 2012 75% Austria 3 Forastero (Nacional) Ecuador
1785 Zotter Huiwani Coop 879 2012 75% Austria 3 Criollo, Trinitario Papua New Guinea
1786 Zotter El Ceibo Coop 879 2012 90% Austria 3.25 Bolivia
1787 Zotter Santo Domingo 879 2012 70% Austria 3.75 Dominican Republic
1788 Zotter Kongo, Highlands 883 2012 68% Austria 3.25 Forastero Congo
1789 Zotter Indianer, Raw 883 2012 58% Austria 3.5
1790 Zotter Peru 647 2011 70% Austria 3.75 Peru
1791 Zotter Congo 749 2011 65% Austria 3 Forastero Congo
1792 Zotter Kerala State 749 2011 65% Austria 3.5 Forastero India
1793 Zotter Kerala State 781 2011 62% Austria 3.25 India
1794 Zotter Brazil, Mitzi Blue 486 2010 65% Austria 3 Brazil

1795 rows × 9 columns

Data Cleaning

Make data appropriate type, separate data into appropriate columns, make blank values NaN. View cleaned data.

In [5]:
chocolate_df['Review Year'] = chocolate_df['Review Year'].astype(int)
chocolate_df['Percent Cocoa'] = chocolate_df['Percent Cocoa'].str.replace('%','').astype(float)
chocolate_df['Maker'] = chocolate_df['Company'].str.extract(r'((?<=\().*(?=\)))',expand=False)
chocolate_df['Company'] = chocolate_df['Company'].str.extract(r'(.*(?= \(.*)|.*)',expand=False)
chocolate_df['Rating'] = chocolate_df['Rating'].astype(float)
chocolate_df['REF'] = chocolate_df['REF'].astype(int)
chocolate_df.replace('\xa0',np.NaN,inplace=True)
chocolate_df
Out[5]:
Company Specific Bean Origin REF Review Year Percent Cocoa Company Location Rating Bean Type Broad Bean Origin Maker
0 A. Morin Agua Grande 1876 2016 63.0 France 3.75 NaN Sao Tome NaN
1 A. Morin Kpime 1676 2015 70.0 France 2.75 NaN Togo NaN
2 A. Morin Atsane 1676 2015 70.0 France 3.00 NaN Togo NaN
3 A. Morin Akata 1680 2015 70.0 France 3.50 NaN Togo NaN
4 A. Morin Quilla 1704 2015 70.0 France 3.50 NaN Peru NaN
5 A. Morin Carenero 1315 2014 70.0 France 2.75 Criollo Venezuela NaN
6 A. Morin Cuba 1315 2014 70.0 France 3.50 NaN Cuba NaN
7 A. Morin Sur del Lago 1315 2014 70.0 France 3.50 Criollo Venezuela NaN
8 A. Morin Puerto Cabello 1319 2014 70.0 France 3.75 Criollo Venezuela NaN
9 A. Morin Pablino 1319 2014 70.0 France 4.00 NaN Peru NaN
10 A. Morin Panama 1011 2013 70.0 France 2.75 NaN Panama NaN
11 A. Morin Madagascar 1011 2013 70.0 France 3.00 Criollo Madagascar NaN
12 A. Morin Brazil 1011 2013 70.0 France 3.25 NaN Brazil NaN
13 A. Morin Equateur 1011 2013 70.0 France 3.75 NaN Ecuador NaN
14 A. Morin Colombie 1015 2013 70.0 France 2.75 NaN Colombia NaN
15 A. Morin Birmanie 1015 2013 70.0 France 3.00 NaN Burma NaN
16 A. Morin Papua New Guinea 1015 2013 70.0 France 3.25 NaN Papua New Guinea NaN
17 A. Morin Chuao 1015 2013 70.0 France 4.00 Trinitario Venezuela NaN
18 A. Morin Piura 1019 2013 70.0 France 3.25 NaN Peru NaN
19 A. Morin Chanchamayo Province 1019 2013 70.0 France 3.50 NaN Peru NaN
20 A. Morin Chanchamayo Province 1019 2013 63.0 France 4.00 NaN Peru NaN
21 A. Morin Bolivia 797 2012 70.0 France 3.50 NaN Bolivia NaN
22 A. Morin Peru 797 2012 63.0 France 3.75 NaN Peru NaN
23 Acalli Chulucanas, El Platanal 1462 2015 70.0 U.S.A. 3.75 NaN Peru NaN
24 Acalli Tumbes, Norandino 1470 2015 70.0 U.S.A. 3.75 Criollo Peru NaN
25 Adi Vanua Levu 705 2011 60.0 Fiji 2.75 Trinitario Fiji NaN
26 Adi Vanua Levu, Toto-A 705 2011 80.0 Fiji 3.25 Trinitario Fiji NaN
27 Adi Vanua Levu 705 2011 88.0 Fiji 3.50 Trinitario Fiji NaN
28 Adi Vanua Levu, Ami-Ami-CA 705 2011 72.0 Fiji 3.50 Trinitario Fiji NaN
29 Aequare Los Rios, Quevedo, Arriba 370 2009 55.0 Ecuador 2.75 Forastero (Arriba) Ecuador Gianduja
... ... ... ... ... ... ... ... ... ... ...
1765 Zak's Belize, Batch 2 1578 2015 70.0 U.S.A. 3.50 Trinitario Belize NaN
1766 Zak's House Blend, Batch 2 1582 2015 60.0 U.S.A. 3.00 NaN NaN NaN
1767 Zart Pralinen Millot P., Ambanja 1820 2016 70.0 Austria 3.50 Criollo, Trinitario Madagascar NaN
1768 Zart Pralinen UNOCACE 1824 2016 70.0 Austria 2.75 Nacional (Arriba) Ecuador NaN
1769 Zart Pralinen San Juan Estate 1824 2016 85.0 Austria 2.75 Trinitario Trinidad NaN
1770 Zart Pralinen Kakao Kamili, Kilombero Valley 1824 2016 85.0 Austria 3.00 Criollo, Trinitario Tanzania NaN
1771 Zart Pralinen Kakao Kamili, Kilombero Valley 1824 2016 70.0 Austria 3.50 Criollo, Trinitario Tanzania NaN
1772 Zart Pralinen San Juan Estate, Gran Couva 1880 2016 78.0 Austria 3.50 Trinitario Trinidad NaN
1773 Zokoko Guadalcanal 1716 2016 78.0 Australia 3.75 NaN Solomon Islands NaN
1774 Zokoko Goddess Blend 1780 2016 65.0 Australia 3.25 NaN NaN NaN
1775 Zokoko Alto Beni 697 2011 68.0 Australia 3.50 NaN Bolivia NaN
1776 Zokoko Tokiala 701 2011 66.0 Australia 3.50 Trinitario Papua New Guinea NaN
1777 Zokoko Tranquilidad, Baures 701 2011 72.0 Australia 3.75 NaN Bolivia NaN
1778 Zotter Raw 1205 2014 80.0 Austria 2.75 NaN NaN NaN
1779 Zotter Bocas del Toro, Cocabo Co-op 801 2012 72.0 Austria 3.50 NaN Panama NaN
1780 Zotter Amazonas Frucht 801 2012 65.0 Austria 3.50 NaN NaN NaN
1781 Zotter Satipo Pangoa region, 16hr conche 875 2012 70.0 Austria 3.00 Criollo (Amarru) Peru NaN
1782 Zotter Satipo Pangoa region, 20hr conche 875 2012 70.0 Austria 3.50 Criollo (Amarru) Peru NaN
1783 Zotter Loma Los Pinos, Yacao region, D.R. 875 2012 62.0 Austria 3.75 NaN Dominican Republic NaN
1784 Zotter El Oro 879 2012 75.0 Austria 3.00 Forastero (Nacional) Ecuador NaN
1785 Zotter Huiwani Coop 879 2012 75.0 Austria 3.00 Criollo, Trinitario Papua New Guinea NaN
1786 Zotter El Ceibo Coop 879 2012 90.0 Austria 3.25 NaN Bolivia NaN
1787 Zotter Santo Domingo 879 2012 70.0 Austria 3.75 NaN Dominican Republic NaN
1788 Zotter Kongo, Highlands 883 2012 68.0 Austria 3.25 Forastero Congo NaN
1789 Zotter Indianer, Raw 883 2012 58.0 Austria 3.50 NaN NaN NaN
1790 Zotter Peru 647 2011 70.0 Austria 3.75 NaN Peru NaN
1791 Zotter Congo 749 2011 65.0 Austria 3.00 Forastero Congo NaN
1792 Zotter Kerala State 749 2011 65.0 Austria 3.50 Forastero India NaN
1793 Zotter Kerala State 781 2011 62.0 Austria 3.25 NaN India NaN
1794 Zotter Brazil, Mitzi Blue 486 2010 65.0 Austria 3.00 NaN Brazil NaN

1795 rows × 10 columns

Examine the distribution of ratings given.

In [6]:
plt.title('Distribution of ratings for all chocolate bars')
plt.hist(chocolate_df['Rating'], bins=20, range=(0,5))
plt.show()
In [7]:
chocolate_df[chocolate_df['Rating']>=4].Rating.count()
Out[7]:
100

It appears that over half of the chocolates are rated between a 3 and 4, with only 100 out of 1795 (5.6%) rated at least 4.

What are the highest rated chocolate companies?

In [8]:
average_rating_by_company = chocolate_df[['Company','Rating']].groupby('Company',as_index=False).Rating.mean()
average_rating_by_company = average_rating_by_company.rename(columns={'Rating':'Average Rating'}).sort_values('Average Rating',ascending=False).reset_index(drop=True)
average_rating_by_company[0:10]
Out[8]:
Company Average Rating
0 Tobago Estate 4.000000
1 Ocelot 3.875000
2 Amedei 3.846154
3 Matale 3.812500
4 Patric 3.791667
5 Idilio 3.775000
6 Un Dimanche A Paris 3.750000
7 Obolo 3.750000
8 Dole 3.750000
9 Chocola'te 3.750000

It appears that Tobago chocolates are the best rated by far, having an average rating .125 greater than the closest competitor, while all other average ratings listed are within .035 of a competitor.

Is the percentage of cocao correlated with the rating?

In [9]:
plt.title('Percent Cocoa vs. Rating')
plt.xlabel('Percentage of cocoa in chocolate')
plt.ylabel('Rating')
plt.scatter(chocolate_df['Percent Cocoa'],chocolate_df['Rating'])
plt.show()

There appears to be no significant correlation between a chocolate's rating and the percentage of cocoa.