https://github.com/davidhintelmann/yelp_investigation

Investigating Yelp Dataset
https://github.com/davidhintelmann/yelp_investigation
Last synced: about 1 year ago
JSON representation
Investigating Yelp Dataset
Host: GitHub
URL: https://github.com/davidhintelmann/yelp_investigation
Owner: davidhintelmann
Created: 2020-10-01T04:38:28.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2020-10-11T03:49:41.000Z (over 5 years ago)
Last Synced: 2025-02-03T20:03:28.600Z (over 1 year ago)
Language: Jupyter Notebook
Size: 1.42 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # Yelp Dataset

This dataset is a small portion of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset you'll find information about businesses across 11 metropolitan areas in two countries.

More information can be found at [Yelp](https://www.yelp.com/dataset/documentation/main).  

Some dataset examples can be found on their [GitHub](https://github.com/Yelp/dataset-examples) page.

Currently, the metropolitan areas centered on Montreal, Calgary, Toronto, Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, and Cleveland, are included in the dataset.  

Summary for contents of each JSON file:  

**business.json**  

Contains business data including location data, attributes, and categories

**review.json**  

Contains full review text data including the user_id that wrote the review and the business_id the review is written for.

**user.json**  

User data including the user's friend mapping and all the metadata associated with the user.

**checkin.json**  

Checkins on a business.

**tip.json**  

Tips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.

**photo.json**  

Contains photo data including the caption and classification (one of "food", "drink", "menu", "inside" or "outside").

# MongoDB 

This notebook is using the Yelp Dataset from [Kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset) which is first being downloaded as JSON files and then being inserted into a MongoDB database on my local hard drive. This step could be skipped and the json files can be analysed directly using

```pandas.read_json(JSON_file, lines=True, nrows=int)``` since these files are large is it recommended to use nrows.

This is a great chance to upload these files into MongoDB since it uses JSON-like documents.  

Below is a shortened list of SQL to MongoDB Mapping chart from the offical [docs](https://docs.mongodb.com/manual/reference/sql-comparison/)  

| SQL Terms/Concepts|MongoDB Terms/Concepts|

|:----------:|:-------------:|

| database |  database |

| table |    collection   |

| row | document or BSON object |

| column | filed |

| index | index |

| table joins | $lookup |

We start by import libraries and then setting up MongoDB database below using [pymongo](https://pymongo.readthedocs.io/en/stable/) driver

# Import Libraries

```python

import json

import math

from pymongo import MongoClient, ASCENDING

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

from mpl_toolkits.basemap import Basemap

from collections import Counter

```

# JSON data

Download data from [Kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset) and read JSON files into memory using json library.

```python

data_path0 = 'yelp-data/yelp_academic_dataset_business.json'

data_path1 = 'yelp-data/yelp_academic_dataset_checkin.json'

data_path2 = 'yelp-data/yelp_academic_dataset_review.json'

data_path3 = 'yelp-data/yelp_academic_dataset_tip.json'

data_path4 = 'yelp-data/yelp_academic_dataset_user.json'

```

```python

data_paths = [data_path0, data_path1, data_path2, data_path3, data_path4]

json_data = [[],[],[],[],[]]

for i, file in enumerate(data_paths,0):

    with open(file) as f:

        for line in f:

            json_data[i].append(json.loads(line))

```

```python

#pd.read_json(data_path0, lines=True, nrows=5)

```

# MongoDB

## Creating Collections and inserting documents

Create Yelp Database below, insert new collection called business and using `insert_many()` method the `json_data[0]` above is inserted into this collection.

```python

client = MongoClient('mongodb://localhost:27017/')

db = client['Yelp']

business = db['business']

business.insert_many(json_data[0])

```

    

```python

business.find_one()

```

    {'_id': ObjectId('5f73c43926a5b6392883c222'),

     'business_id': 'f9NumwFMBDn751xgFiRbNA',

     'name': 'The Range At Lake Norman',

     'address': '10913 Bailey Rd',

     'city': 'Cornelius',

     'state': 'NC',

     'postal_code': '28031',

     'latitude': 35.4627242,

     'longitude': -80.8526119,

     'stars': 3.5,

     'review_count': 36,

     'is_open': 1,

     'attributes': {'BusinessAcceptsCreditCards': 'True',

      'BikeParking': 'True',

      'GoodForKids': 'False',

      'BusinessParking': "{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}",

      'ByAppointmentOnly': 'False',

      'RestaurantsPriceRange2': '3'},

     'categories': 'Active Life, Gun/Rifle Ranges, Guns & Ammo, Shopping',

     'hours': {'Monday': '10:0-18:0',

      'Tuesday': '11:0-20:0',

      'Wednesday': '10:0-18:0',

      'Thursday': '11:0-20:0',

      'Friday': '11:0-20:0',

      'Saturday': '11:0-20:0',

      'Sunday': '13:0-18:0'}}

---

Repeat creating collections with other JSON data

In the end there will be 5 collections.

```python

checkin = db['checkin']

checkin.insert_many(json_data[1])

```

    

```python

checkin.find_one()

```

    {'_id': ObjectId('5f73c97226a5b6392886f413'),

     'business_id': '--1UhMGODdWsrMastO9DZw',

     'date': '2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:45:18, 2016-11-18 01:54:50, 2017-04-20 18:39:06, 2017-05-03 17:58:02, 2019-03-19 22:04:48'}

---

The review JSON file is 6.33 GBs in size and the `insert_many()` function will not be good approach due to memory limitations and looping through the data and inserting batches will be better.

```python

review = db['review']

```

```python

batches = 100 #number of batches

batch_length = math.ceil(len(json_data[2])/batches) #number of items in each batch

tmp_ = 0 #keep track for number of items in a batch

for i in range(batches):

    data = json_data[2][tmp_:tmp_+batch_length]

    review.insert_many(data)

    tmp_ += batch_length

review.find_one()

```

    {'_id': ObjectId('5f73c9a626a5b6392889a066'),

     'review_id': 'xQY8N_XvtGbearJ5X4QryQ',

     'user_id': 'OwjRMXRC0KyPrIlcjaXeFQ',

     'business_id': '-MhfebM0QIsKt87iDN-FNw',

     'stars': 2.0,

     'useful': 5,

     'funny': 0,

     'cool': 0,

     'text': 'As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!\n\nTucked away near the gelateria and the garden, the Gallery is pretty much hidden from view. It\'s what real estate agents would call "cozy" or "charming" - basically any euphemism for small.\n\nThat being said, you can still see wonderful art at a gallery of any size, so why the two *s you ask? Let me tell you:\n\n* pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top. For the space and the amount of art you can fit in there, it is a bit much.\n* it\'s not kid friendly at all. Seriously, don\'t bring them.\n* the security is not trained properly for the show. When the curating and design teams collaborate for exhibitions, there is a definite flow. That means visitors should view the art in a certain sequence, whether it be by historical period or cultural significance (this is how audio guides are usually developed). When I arrived in the gallery I could not tell where to start, and security was certainly not helpful. I was told to "just look around" and "do whatever." \n\nAt such a *fine* institution, I find the lack of knowledge and respect for the art appalling.',

     'date': '2015-04-15 05:21:16'}

---

The collection 'tip' below will be created and then the JSON data will be inserted into it.

```python

tip = db['tip']

tip.insert_many(json_data[3])

tip.find_one()

```

    {'_id': ObjectId('5f73de0226a5b639280404e8'),

     'user_id': 'hf27xTME3EiCp6NL6VtWZQ',

     'business_id': 'UYX5zL_Xj9WEc_Wp-FrqHw',

     'text': 'Here for a quick mtg',

     'date': '2013-11-26 18:20:08',

     'compliment_count': 0}

Not unlike the JSON file above, the user JSON file is 3.27 GBs in size and the `insert_many()` function below will take some time, and our solution will be the same.

```python

user = db['user'] #create collection

batches = 100

batch_length = math.ceil(len(json_data[4])/batches) #number of items in each batch

tmp_ = 0 #keep track for number of items in a batch

for i in range(batches):

    data = json_data[4][tmp_:tmp_+batch_length]

    user.insert_many(data)

    tmp_ += batch_length

user.find_one()

```

    {'_id': ObjectId('5f73e31226a5b63928182c21'),

     'user_id': 'ntlvfPzc8eglqvk92iDIAw',

     'name': 'Rafael',

     'review_count': 553,

     'yelping_since': '2007-07-06 03:27:11',

     'useful': 628,

     'funny': 225,

     'cool': 227,

     'elite': '',

     'friends': 'oeMvJh94PiGQnx_6GlndPQ, wm1z1PaJKvHgSDRKfwhfDg, IkRib6Xs91PPW7pon7VVig, A8Aq8f0-XvLBcyMk2GJdJQ, eEZM1kogR7eL4GOBZyPvBA, e1o1LN7ez5ckCpQeAab4iw, _HrJVzFaRFUhPva8cwBjpQ, pZeGZGzX-ROT_D5lam5uNg, 0S6EI51ej5J7dgYz3-O0lA, woDt8raW-AorxQM_tIE2eA, hWUnSE5gKXNe7bDc8uAG9A, c_3LDSO2RHwZ94_Q6j_O7w, -uv1wDiaplY6eXXS0VwQiA, QFjqxXn3acDC7hckFGUKMg, ErOqapICmHPTN8YobZIcfQ, mJLRvqLOKhqEdkgt9iEaCQ, VKX7jlScJSA-ja5hYRw12Q, ijIC9w5PRcj3dWVlanjZeg, CIZGlEw-Bp0rmkP8M6yQ9Q, OC6fT5WZ8EU7tEVJ3bzPBQ, UZSDGTDpycDzrlfUlyw2dQ, deL6e_z9xqZTIODKqnvRXQ, 5mG2ENw2PylIWElqHSMGqg, Uh5Kug2fvDd51RYmsNZkGg, 4dI4uoShugD9z84fYupelQ, EQpFHqGT9Tk6YSwORTtwpg, o4EGL2-ICGmRJzJ3GxB-vw, s8gK7sdVzJcYKcPv2dkZXw, vOYVZgb_GVe-kdtjQwSUHw, wBbjgHsrKr7BsPBrQwJf2w, p59u2EC_qcmCmLeX1jCi5Q, VSAZI1eHDrOPRWMK4Q2DIQ, efMfeI_dkhpeGykaRJqxfQ, x6qYcQ8_i0mMDzSLsFCbZg, K_zSmtNGw1fu-vmxyTVfCQ, 5IM6YPQCK-NABkXmHhlRGQ, U_w8ZMD26vnkeeS1sD7s4Q, AbfS_oXF8H6HJb5jFqhrLw, hbcjX4_D4KIfonNnwrH-cg, UKf66_MPz0zHCP70mF6p1g, hK2gYbxZRTqcqlSiQQcrtQ, 2Q45w_Twx_T9dXqlE16xtQ, BwRn8qcKSeA77HLaOTbfiQ, jouOn4VS_DtFPtMR2w8VDA, ESteyJabbfvqas6CEDs3pQ',

     'fans': 14,

     'average_stars': 3.57,

     'compliment_hot': 3,

     'compliment_more': 2,

     'compliment_profile': 1,

     'compliment_cute': 0,

     'compliment_list': 1,

     'compliment_note': 11,

     'compliment_plain': 15,

     'compliment_cool': 22,

     'compliment_funny': 22,

     'compliment_writer': 10,

     'compliment_photos': 0}

```python

db.command("dbstats")

```

    {'db': 'Yelp',

     'collections': 3,

     'views': 0,

     'objects': 8405702,

     'avgObjSize': 852.5322018315662,

     'dataSize': 7166131634.0,

     'storageSize': 4791033856.0,

     'numExtents': 0,

     'indexes': 3,

     'indexSize': 84541440.0,

     'scaleFactor': 1.0,

     'fsUsedSize': 460941709312.0,

     'fsTotalSize': 500068036608.0,

     'ok': 1.0}

```python

tmp = db.command( { 'collStats': 'business', 'scale': 1024000 } )

tmp['size'],tmp['count']

```

    (156, 209393)

```python

tmp = db.command( { 'collStats': 'checkin', 'scale': 1024000 } )

tmp['size'],tmp['count']

```

    (442, 175187)

```python

tmp = db.command( { 'collStats': 'review', 'scale': 1024000 } )

tmp['size'],tmp['count']

```

    (6398, 8021122)

```python

tmp = db.command( { 'collStats': 'tip', 'scale': 1024000 } )

tmp['size'],tmp['count']

```

    (289, 1320761)

```python

tmp = db.command( { 'collStats': 'user', 'scale': 1024000 } )

tmp['size'],tmp['count']

```

    (3272, 1968703)

## MongoDB Initialization

```python

client = MongoClient('mongodb://localhost:27017/')

db = client['Yelp']

business = db['business']

checkin = db['checkin']

review = db['review']

tip = db['tip']

user = db['user']

```

```python

pipeline = [

    {'$lookup':{'from' : 'tip',

                'localField' : 'business_id',

                'foreignField' : 'business_id',

                'as' : 'buis_tip'}},

    {'$replaceRoot':{'newRoot':{'$mergeObjects':[{'$arrayElemAt':["$buis_tip",0]}, "$$ROOT"]}}},

    {'$project': {'buis_tip':0, 'compliment_count':0, 'address':0, 'postal_code':0, 'latitude':0, 'longitude':0,'attributes':0, 'categories':0, 'hours':0}}

]

#x = business.aggregate(pipeline)

#df_agg = pd.DataFrame(list(x))

```

We will use the JSON data directly from files from now on since they are large and loading them into dataframes, from a mongoDB database, takes a long time using `pd.DataFrame(list(x))`

```python

df = pd.read_json(data_path0, lines=True)

```

```python

df.head(2)

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      business_id

      name

      address

      city

      state

      postal_code

      latitude

      longitude

      stars

      review_count

      is_open

      attributes

      categories

      hours

    

  

  

    

      0

      f9NumwFMBDn751xgFiRbNA

      The Range At Lake Norman

      10913 Bailey Rd

      Cornelius

      NC

      28031

      35.462724

      -80.852612

      3.5

      36

      1

      {'BusinessAcceptsCreditCards': 'True', 'BikePa...

      Active Life, Gun/Rifle Ranges, Guns & Ammo, Sh...

      {'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'...

    

    

      1

      Yzvjg0SayhoZgCljUJRF9Q

      Carlos Santo, NMD

      8880 E Via Linda, Ste 107

      Scottsdale

      AZ

      85258

      33.569404

      -111.890264

      5.0

      4

      1

      {'GoodForKids': 'True', 'ByAppointmentOnly': '...

      Health & Medical, Fitness & Instruction, Yoga,...

      None

    

  



# EDA & Visualizations

Lets take a look where most of the businesses are located in this data set by plotting them onto a globe.

```python

plt.figure(figsize=(20,10));

map = Basemap(projection='ortho',lat_0=25,lon_0=-100,resolution='l')

map.bluemarble()

# draw coastlines, country boundaries, fill continents.

map.drawcoastlines(linewidth=0.25)

map.drawcountries(linewidth=0.25)

map.fillcontinents(color='green',lake_color='blue')

# draw the edge of the map projection region (the projection limb)

map.drawmapboundary(fill_color='blue')

long_lat = map(df['longitude'].tolist(),df['latitude'].tolist())

map.scatter(long_lat[0], long_lat[1], s=3, c="orange", lw=3, alpha=1, zorder=5)

plt.title("World-wide Yelp Reviews")

plt.show();

```

    Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

![png](img/output_42_1.png)

```python

print(df['longitude'].min())

print(df['longitude'].max())

```

    -158.0255252123

    -72.80655

We note from negative values for longitude that all the businesses in the yelp database are found in North America, in particular Canada and the United States. 

Lets take a closer look at Greater Toronto Area (GTA) which is the most populous metropolitan area in Canada.

## Greater Toronto Area (GTA)

```python

lon_min, lon_max = -80, -78.8

lat_min, lat_max = 43.2, 44.2

TOR = ((df["longitude"]>lon_min) &(df["longitude"]lat_min) & (df["latitude"]

    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      business_id

      name

      address

      city

      state

      postal_code

      latitude

      longitude

      stars

      review_count

      is_open

      attributes

      categories

      hours

    

  

  

    

      65694

      r_BrIgzYcwo1NAuG9dLbpg

      Pai Northern Thai Kitchen

      18 Duncan Street

      Toronto

      ON

      M5H 3G8

      43.647866

      -79.388685

      4.5

      2758

      1

      {'RestaurantsTableService': 'True', 'BikeParki...

      Restaurants, Thai, Specialty Food, Food, Ethni...

      {'Monday': '11:30-22:0', 'Tuesday': '11:30-22:...

    

    

      183740

      RtUvSWO_UZ8V3Wpj0n077w

      KINKA IZAKAYA ORIGINAL

      398 Church St

      Toronto

      ON

      M5B 2A2

      43.660430

      -79.378927

      4.0

      1592

      1

      {'RestaurantsAttire': 'u'casual'', 'BusinessPa...

      Restaurants, Tapas/Small Plates, Japanese, Bar...

      {'Monday': '17:0-0:0', 'Tuesday': '17:0-0:0', ...

    

    

      143305

      aLcFhMe6DDJ430zelCpd2A

      Khao San Road

      11 Charlotte Street

      Toronto

      ON

      M5V 2H5

      43.646411

      -79.393480

      4.0

      1542

      1

      {'WiFi': 'u'no'', 'RestaurantsTakeOut': 'True'...

      Thai, Restaurants

      {'Monday': '17:0-22:0', 'Tuesday': '17:0-22:0'...

    

    

      133959

      iGEvDk6hsizigmXhDKs2Vg

      Seven Lives Tacos Y Mariscos

      69 Kensington Avenue

      Toronto

      ON

      M5T 2K2

      43.654341

      -79.400480

      4.5

      1285

      1

      {'RestaurantsGoodForGroups': 'False', 'Alcohol...

      Restaurants, Seafood, Mexican

      {'Monday': '11:0-19:0', 'Tuesday': '11:0-19:0'...

    

    

      172038

      N93EYZy9R0sdlEvubu94ig

      Banh Mi Boys

      392 Queen Street W

      Toronto

      ON

      M5V 2A9

      43.648827

      -79.396970

      4.5

      1097

      1

      {'Alcohol': 'u'none'', 'BikeParking': 'True', ...

      Sandwiches, Restaurants, Food, Vietnamese, Asi...

      {'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...

    

  

```python

TOR_business['review_count'].describe()

```

    count    36636.000000

    mean        24.111830

    std         54.647122

    min          3.000000

    25%          4.000000

    50%          8.000000

    75%         21.000000

    max       2758.000000

    Name: review_count, dtype: float64

```python

TOR_business['review_count'].hist();

```

![png](img/output_50_0.png)

The 'review_count' column is the number of reviews a business has received and we see in the histogram above that there are some business with thousands of reviews but 75% of the businesses in this dataset only have 21 reviews or less. 

The histogram below only shows business in Toronto will less than 100 reviews.

```python

t_ = TOR_business[TOR_business['review_count']<100]['review_count'].hist()

```

![png](img/output_52_0.png)

```python

fig, ax = plt.subplots(figsize=(20,10))

ax.spines['top'].set_visible(False)

ax.spines['right'].set_visible(False)

ax.tick_params(bottom=False, left=True)

ax.set_axisbelow(True)

ax.yaxis.grid(True)

ax.xaxis.grid(False)

plt.title('Top 10 Type of Businesses in Toronto')

plt.ylabel('Number of Business with category listed')

plt.xticks(np.arange(10), TOR_business['categories'].value_counts().head(10), rotation=45)

bars = ax.bar(

    x=np.arange(10),

    height=TOR_business['categories'].value_counts().head(10),

    color='teal',

    tick_label=TOR_business['categories'].value_counts().head(10).index

)

for bar in bars:

    plt.text(

      bar.get_x() + bar.get_width() / 2,

      bar.get_height() + 4,

      round(bar.get_height(), 1),

      horizontalalignment='center',

      color='teal',

      weight='bold',

      size=15

    )

```

![png](img/output_53_0.png)

We can see in the figure above the top ten buisnesses (by count of each category) are counted twice. For example the number one most common business is 'Coffee & Tea, Food' and the second more common is 'Food, Coffee & Tea'. This needs to be addressed by combining the categories which have the text in their descriptions mixed up.

```python

TOR_business['categories'].value_counts().head(10)

```

    Coffee & Tea, Food            306

    Food, Coffee & Tea            303

    Restaurants, Chinese          300

    Chinese, Restaurants          280

    Hair Salons, Beauty & Spas    244

    Beauty & Spas, Hair Salons    243

    Pizza, Restaurants            203

    Restaurants, Pizza            199

    Nail Salons, Beauty & Spas    173

    Grocery, Food                 169

    Name: categories, dtype: int64

```python

len(TOR_business['categories'].unique())

```

    19661

```python

TOR_business.loc[TOR_business['categories'] == 'Food, Coffee & Tea', 'categories'] = 'Coffee & Tea, Food'

TOR_business.loc[TOR_business['categories'] == 'Chinese, Restaurants', 'categories'] = 'Restaurants, Chinese'

TOR_business.loc[TOR_business['categories'] == 'Beauty & Spas, Hair Salons', 'categories'] = 'Hair Salons, Beauty & Spas'

TOR_business.loc[TOR_business['categories'] == 'Restaurants, Pizza', 'categories'] = 'Pizza, Restaurants'

TOR_business.loc[TOR_business['categories'] == 'Beauty & Spas, Nail Salons', 'categories'] = 'Nail Salons, Beauty & Spas'

TOR_business.loc[TOR_business['categories'] == 'Restaurants, Italian', 'categories'] = 'Italian, Restaurants'

TOR_business.loc[TOR_business['categories'] == 'Food, Grocery', 'categories'] = 'Grocery, Food'

TOR_business.loc[TOR_business['categories'] == 'Food, Bakeries', 'categories'] = 'Bakeries, Food'

TOR_business.loc[TOR_business['categories'] == 'Indian, Restaurants', 'categories'] = 'Restaurants, Indian'

TOR_business.loc[TOR_business['categories'] == 'Restaurants, Vietnamese', 'categories'] = 'Vietnamese, Restaurants'

TOR_business.loc[TOR_business['categories'] == 'Restaurants, Japanese', 'categories'] = 'Japanese, Restaurants'

TOR_business.loc[TOR_business['categories'] == 'Thai, Restaurants', 'categories'] = 'Restaurants, Thai'

```

```python

TOR_business['categories'].value_counts().head(10)

```

    Coffee & Tea, Food            609

    Restaurants, Chinese          580

    Hair Salons, Beauty & Spas    487

    Pizza, Restaurants            402

    Nail Salons, Beauty & Spas    340

    Grocery, Food                 323

    Italian, Restaurants          282

    Bakeries, Food                256

    Restaurants, Indian           239

    Japanese, Restaurants         212

    Name: categories, dtype: int64

---

Now we can plot the top 10 categories of businesses in Toronto:

```python

fig, ax = plt.subplots(figsize=(20,10))

ax.spines['top'].set_visible(False)

ax.spines['right'].set_visible(False)

ax.tick_params(bottom=False, left=True)

ax.set_axisbelow(True)

ax.yaxis.grid(True)

ax.xaxis.grid(False)

plt.title('Top 10 Type of Businesses in Toronto')

plt.ylabel('Number of Business with category listed')

plt.xticks(np.arange(10), TOR_business['categories'].value_counts().head(10), rotation=45)

bars = ax.bar(

    x=np.arange(10),

    height=TOR_business['categories'].value_counts().head(10),

    color='teal',

    tick_label=TOR_business['categories'].value_counts().head(10).index

)

for bar in bars:

    plt.text(

      bar.get_x() + bar.get_width() / 2,

      bar.get_height() + 4,

      round(bar.get_height(), 1),

      horizontalalignment='center',

      color='teal',

      weight='bold',

      size=15

    )

```

![png](img/output_61_0.png)

Below is a table for the average stars (1-5) for the top ten business categories above

```python

top_ten_list = TOR_business['categories'].value_counts().head(10).index.tolist()

TOR_business.groupby('categories').agg(['mean']).loc[top_ten_list,:].sort_values(by=('stars','mean'),ascending=False)['stars']

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      mean

    

    

      categories

      

    

  

  

    

      Bakeries, Food

      3.705078

    

    

      Italian, Restaurants

      3.489362

    

    

      Hair Salons, Beauty & Spas

      3.434292

    

    

      Coffee & Tea, Food

      3.428571

    

    

      Restaurants, Indian

      3.424686

    

    

      Japanese, Restaurants

      3.318396

    

    

      Restaurants, Chinese

      3.192241

    

    

      Pizza, Restaurants

      3.136816

    

    

      Grocery, Food

      3.082043

    

    

      Nail Salons, Beauty & Spas

      2.867647

    

  



Another approach to solving the problem above is to instead to split the strings by `,` (comma) in each row of the category column and using the `Counter()` function from the `collections` library, which we then increase the count for each string encountered.

```python

Tor_catlist = TOR_business['categories'].tolist()

```

```python

c = Counter()

for n in Tor_catlist:

    if n is not None:

        cat_list = n.split(', ')

        for cat in cat_list:

            c[cat] += 1

    else:

        c['N/A'] += 1

```

```python

c.most_common(10)

```

    [('Restaurants', 16227),

     ('Food', 7979),

     ('Shopping', 5596),

     ('Beauty & Spas', 3610),

     ('Nightlife', 2743),

     ('Coffee & Tea', 2456),

     ('Bars', 2452),

     ('Health & Medical', 1891),

     ('Chinese', 1817),

     ('Event Planning & Services', 1795)]

```python

TORcat = pd.DataFrame.from_dict(c, orient='index')

```

```python

TORcat.sort_values(by=0,ascending=False).head(10)

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      0

    

  

  

    

      Restaurants

      16227

    

    

      Food

      7979

    

    

      Shopping

      5596

    

    

      Beauty & Spas

      3610

    

    

      Nightlife

      2743

    

    

      Coffee & Tea

      2456

    

    

      Bars

      2452

    

    

      Health & Medical

      1891

    

    

      Chinese

      1817

    

    

      Event Planning & Services

      1795

    

  



```python

tmp_ = TORcat.sort_values(by=0,ascending=False).head(10)

fig, ax = plt.subplots(figsize=(20,10))

ax.spines['top'].set_visible(False)

ax.spines['right'].set_visible(False)

ax.tick_params(bottom=False, left=True)

ax.set_axisbelow(True)

ax.yaxis.grid(True)

ax.xaxis.grid(False)

plt.title('Top 10 Type of Businesses in Toronto')

plt.ylabel('Number of Business with category listed')

plt.xticks(np.arange(10), tmp_.value_counts(), rotation=45)

bars = ax.bar(

    x=np.arange(10),

    height=tmp_[0].values,

    color='teal',

    tick_label=tmp_.index

)

for bar in bars:

    plt.text(

      bar.get_x() + bar.get_width() / 2,

      bar.get_height() + 100,

      round(bar.get_height(), 1),

      horizontalalignment='center',

      color='teal',

      weight='bold',

      size=15

    )

```

![png](img/output_70_0.png)

## More Top 10 Business Categories Analysis for GTA

The top ten businesses, by number of a businesses in a unique category, have their average rating shown above in descending order.

```python

plt.figure(figsize=(12,6))

TOR_business['stars'].hist(bins=17)

plt.title('Histogram of 1-5 Star Reviews on Yelp within Toronto')

plt.xlabel('Stars')

plt.ylabel('Number of reviews');

```

![png](img/output_73_0.png)

```python

top_ten_list = TOR_business['categories'].value_counts().head(10).index.tolist()

TOR_business.groupby('categories').agg(['mean']).loc[top_ten_list,:].sort_values(by=('stars','mean'),ascending=False)['stars']

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      mean

    

    

      categories

      

    

  

  

    

      Bakeries, Food

      3.705078

    

    

      Italian, Restaurants

      3.489362

    

    

      Hair Salons, Beauty & Spas

      3.434292

    

    

      Coffee & Tea, Food

      3.428571

    

    

      Restaurants, Indian

      3.424686

    

    

      Japanese, Restaurants

      3.318396

    

    

      Restaurants, Chinese

      3.192241

    

    

      Pizza, Restaurants

      3.136816

    

    

      Grocery, Food

      3.082043

    

    

      Nail Salons, Beauty & Spas

      2.867647

    

  



```python

plt.figure(figsize=(12,6))

TOR_business[TOR_business['categories'] == 'Bakeries, Food']['stars'].plot.kde();

plt.title('KDE of Reviews for Bakeries within Toronto')

plt.xlabel('Number of Stars')

plt.ylabel('Number of reviews');

```

![png](img/output_75_0.png)

```python

TOR_business[TOR_business['categories'] == 'Bakeries, Food'][['stars']].describe()

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      stars

    

  

  

    

      count

      256.000000

    

    

      mean

      3.705078

    

    

      std

      0.779601

    

    

      min

      1.000000

    

    

      25%

      3.000000

    

    

      50%

      4.000000

    

    

      75%

      4.000000

    

    

      max

      5.000000

    

  



```python

plt.figure(figsize=(12,6))

TOR_business[TOR_business['categories'] == 'Italian, Restaurants']['stars'].plot.kde();

plt.title('KDE of Reviews for Italian Restaurants within Toronto')

plt.xlabel('Number of Stars')

plt.ylabel('Number of reviews');

```

![png](img/output_77_0.png)

```python

TOR_business[TOR_business['categories'] == 'Italian, Restaurants'][['stars']].describe()

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      stars

    

  

  

    

      count

      282.000000

    

    

      mean

      3.489362

    

    

      std

      0.678779

    

    

      min

      1.500000

    

    

      25%

      3.000000

    

    

      50%

      3.500000

    

    

      75%

      4.000000

    

    

      max

      5.000000

    

  



```python

plt.figure(figsize=(12,6))

TOR_business[TOR_business['categories'] == 'Nail Salons, Beauty & Spas']['stars'].plot.kde();

plt.title('KDE of Reviews for Nail Salons, Beauty & Spas within Toronto')

plt.xlabel('Number of Stars')

plt.ylabel('Number of reviews');

```

![png](img/output_79_0.png)

```python

TOR_business[TOR_business['categories'] == 'Nail Salons, Beauty & Spas'][['stars']].describe()

```



    .dataframe tbody tr th:only-of-type {

        vertical-align: middle;

    }

    .dataframe tbody tr th {

        vertical-align: top;

    }

    .dataframe thead th {

        text-align: right;

    }

  

    

      

      stars

    

  

  

    

      count

      340.000000

    

    

      mean

      2.867647

    

    

      std

      0.916972

    

    

      min

      1.000000

    

    

      25%

      2.000000

    

    

      50%

      3.000000

    

    

      75%

      3.500000

    

    

      max

      5.000000

    

  



## Review.json

```python

df_ = pd.read_json(data_path2, lines=True)

TOR_review = df_[TOR].copy()

```

```python

TOR_review.head(2)

```

```python

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/davidhintelmann/yelp_investigation

Awesome Lists containing this project

README