{"id":22647749,"url":"https://github.com/davidhintelmann/yelp_investigation","last_synced_at":"2025-03-29T06:48:18.817Z","repository":{"id":156030365,"uuid":"300144546","full_name":"davidhintelmann/Yelp_Investigation","owner":"davidhintelmann","description":"Investigating Yelp Dataset","archived":false,"fork":false,"pushed_at":"2020-10-11T03:49:41.000Z","size":1489,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-03T20:03:28.600Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidhintelmann.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-01T04:38:28.000Z","updated_at":"2020-10-11T03:49:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"975eefe5-9498-452e-9c01-77b184fa5267","html_url":"https://github.com/davidhintelmann/Yelp_Investigation","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidhintelmann%2FYelp_Investigation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidhintelmann%2FYelp_Investigation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidhintelmann%2FYelp_Investigation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidhintelmann%2FYelp_Investigation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidhintelmann","download_url":"https://codeload.github.com/davidhintelmann/Yelp_Investigation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246150409,"owners_count":20731419,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-09T07:34:29.991Z","updated_at":"2025-03-29T06:48:18.796Z","avatar_url":"https://github.com/davidhintelmann.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Yelp Dataset\n\nThis dataset is a small portion of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the dataset you'll find information about businesses across 11 metropolitan areas in two countries.\n\nMore information can be found at [Yelp](https://www.yelp.com/dataset/documentation/main).  \nSome dataset examples can be found on their [GitHub](https://github.com/Yelp/dataset-examples) page.\n\nCurrently, the metropolitan areas centered on Montreal, Calgary, Toronto, Pittsburgh, Charlotte, Urbana-Champaign, Phoenix, Las Vegas, Madison, and Cleveland, are included in the dataset.  \n\nSummary for contents of each JSON file:  \n\n**business.json**  \nContains business data including location data, attributes, and categories\n\n**review.json**  \nContains full review text data including the user_id that wrote the review and the business_id the review is written for.\n\n**user.json**  \nUser data including the user's friend mapping and all the metadata associated with the user.\n\n**checkin.json**  \nCheckins on a business.\n\n**tip.json**  \nTips written by a user on a business. Tips are shorter than reviews and tend to convey quick suggestions.\n\n**photo.json**  \nContains photo data including the caption and classification (one of \"food\", \"drink\", \"menu\", \"inside\" or \"outside\").\n\n\n# MongoDB \n\nThis notebook is using the Yelp Dataset from [Kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset) which is first being downloaded as JSON files and then being inserted into a MongoDB database on my local hard drive. This step could be skipped and the json files can be analysed directly using\n\n```pandas.read_json(JSON_file, lines=True, nrows=int)``` since these files are large is it recommended to use nrows.\n\nThis is a great chance to upload these files into MongoDB since it uses JSON-like documents.  \n\nBelow is a shortened list of SQL to MongoDB Mapping chart from the offical [docs](https://docs.mongodb.com/manual/reference/sql-comparison/)  \n\n| SQL Terms/Concepts|MongoDB Terms/Concepts|\n|:----------:|:-------------:|\n| database |  database |\n| table |    collection   |\n| row | document or BSON object |\n| column | filed |\n| index | index |\n| table joins | $lookup |\n\nWe start by import libraries and then setting up MongoDB database below using [pymongo](https://pymongo.readthedocs.io/en/stable/) driver\n\n# Import Libraries\n\n\n```python\nimport json\nimport math\nfrom pymongo import MongoClient, ASCENDING\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom mpl_toolkits.basemap import Basemap\nfrom collections import Counter\n```\n\n# JSON data\nDownload data from [Kaggle](https://www.kaggle.com/yelp-dataset/yelp-dataset) and read JSON files into memory using json library.\n\n\n```python\ndata_path0 = 'yelp-data/yelp_academic_dataset_business.json'\ndata_path1 = 'yelp-data/yelp_academic_dataset_checkin.json'\ndata_path2 = 'yelp-data/yelp_academic_dataset_review.json'\ndata_path3 = 'yelp-data/yelp_academic_dataset_tip.json'\ndata_path4 = 'yelp-data/yelp_academic_dataset_user.json'\n```\n\n\n```python\ndata_paths = [data_path0, data_path1, data_path2, data_path3, data_path4]\njson_data = [[],[],[],[],[]]\n\nfor i, file in enumerate(data_paths,0):\n    with open(file) as f:\n        for line in f:\n            json_data[i].append(json.loads(line))\n```\n\n\n```python\n#pd.read_json(data_path0, lines=True, nrows=5)\n```\n\n# MongoDB\n\n## Creating Collections and inserting documents\n\nCreate Yelp Database below, insert new collection called business and using `insert_many()` method the `json_data[0]` above is inserted into this collection.\n\n\n```python\nclient = MongoClient('mongodb://localhost:27017/')\ndb = client['Yelp']\nbusiness = db['business']\nbusiness.insert_many(json_data[0])\n```\n\n\n\n\n    \u003cpymongo.results.InsertManyResult at 0x5c2660700\u003e\n\n\n\n\n```python\nbusiness.find_one()\n```\n\n\n\n\n    {'_id': ObjectId('5f73c43926a5b6392883c222'),\n     'business_id': 'f9NumwFMBDn751xgFiRbNA',\n     'name': 'The Range At Lake Norman',\n     'address': '10913 Bailey Rd',\n     'city': 'Cornelius',\n     'state': 'NC',\n     'postal_code': '28031',\n     'latitude': 35.4627242,\n     'longitude': -80.8526119,\n     'stars': 3.5,\n     'review_count': 36,\n     'is_open': 1,\n     'attributes': {'BusinessAcceptsCreditCards': 'True',\n      'BikeParking': 'True',\n      'GoodForKids': 'False',\n      'BusinessParking': \"{'garage': False, 'street': False, 'validated': False, 'lot': True, 'valet': False}\",\n      'ByAppointmentOnly': 'False',\n      'RestaurantsPriceRange2': '3'},\n     'categories': 'Active Life, Gun/Rifle Ranges, Guns \u0026 Ammo, Shopping',\n     'hours': {'Monday': '10:0-18:0',\n      'Tuesday': '11:0-20:0',\n      'Wednesday': '10:0-18:0',\n      'Thursday': '11:0-20:0',\n      'Friday': '11:0-20:0',\n      'Saturday': '11:0-20:0',\n      'Sunday': '13:0-18:0'}}\n\n\n\n---\n\nRepeat creating collections with other JSON data\n\nIn the end there will be 5 collections.\n\n\n```python\ncheckin = db['checkin']\ncheckin.insert_many(json_data[1])\n```\n\n\n\n\n    \u003cpymongo.results.InsertManyResult at 0x5c3a3a240\u003e\n\n\n\n\n```python\ncheckin.find_one()\n```\n\n\n\n\n    {'_id': ObjectId('5f73c97226a5b6392886f413'),\n     'business_id': '--1UhMGODdWsrMastO9DZw',\n     'date': '2016-04-26 19:49:16, 2016-08-30 18:36:57, 2016-10-15 02:45:18, 2016-11-18 01:54:50, 2017-04-20 18:39:06, 2017-05-03 17:58:02, 2019-03-19 22:04:48'}\n\n\n\n---\n\nThe review JSON file is 6.33 GBs in size and the `insert_many()` function will not be good approach due to memory limitations and looping through the data and inserting batches will be better.\n\n\n```python\nreview = db['review']\n```\n\n\n```python\nbatches = 100 #number of batches\nbatch_length = math.ceil(len(json_data[2])/batches) #number of items in each batch\ntmp_ = 0 #keep track for number of items in a batch\n\nfor i in range(batches):\n    data = json_data[2][tmp_:tmp_+batch_length]\n    review.insert_many(data)\n    tmp_ += batch_length\n\nreview.find_one()\n```\n\n\n\n\n    {'_id': ObjectId('5f73c9a626a5b6392889a066'),\n     'review_id': 'xQY8N_XvtGbearJ5X4QryQ',\n     'user_id': 'OwjRMXRC0KyPrIlcjaXeFQ',\n     'business_id': '-MhfebM0QIsKt87iDN-FNw',\n     'stars': 2.0,\n     'useful': 5,\n     'funny': 0,\n     'cool': 0,\n     'text': 'As someone who has worked with many museums, I was eager to visit this gallery on my most recent trip to Las Vegas. When I saw they would be showing infamous eggs of the House of Faberge from the Virginia Museum of Fine Arts (VMFA), I knew I had to go!\\n\\nTucked away near the gelateria and the garden, the Gallery is pretty much hidden from view. It\\'s what real estate agents would call \"cozy\" or \"charming\" - basically any euphemism for small.\\n\\nThat being said, you can still see wonderful art at a gallery of any size, so why the two *s you ask? Let me tell you:\\n\\n* pricing for this, while relatively inexpensive for a Las Vegas attraction, is completely over the top. For the space and the amount of art you can fit in there, it is a bit much.\\n* it\\'s not kid friendly at all. Seriously, don\\'t bring them.\\n* the security is not trained properly for the show. When the curating and design teams collaborate for exhibitions, there is a definite flow. That means visitors should view the art in a certain sequence, whether it be by historical period or cultural significance (this is how audio guides are usually developed). When I arrived in the gallery I could not tell where to start, and security was certainly not helpful. I was told to \"just look around\" and \"do whatever.\" \\n\\nAt such a *fine* institution, I find the lack of knowledge and respect for the art appalling.',\n     'date': '2015-04-15 05:21:16'}\n\n\n\n---\n\nThe collection 'tip' below will be created and then the JSON data will be inserted into it.\n\n\n```python\ntip = db['tip']\ntip.insert_many(json_data[3])\ntip.find_one()\n```\n\n\n\n\n    {'_id': ObjectId('5f73de0226a5b639280404e8'),\n     'user_id': 'hf27xTME3EiCp6NL6VtWZQ',\n     'business_id': 'UYX5zL_Xj9WEc_Wp-FrqHw',\n     'text': 'Here for a quick mtg',\n     'date': '2013-11-26 18:20:08',\n     'compliment_count': 0}\n\n\n\nNot unlike the JSON file above, the user JSON file is 3.27 GBs in size and the `insert_many()` function below will take some time, and our solution will be the same.\n\n\n```python\nuser = db['user'] #create collection\nbatches = 100\nbatch_length = math.ceil(len(json_data[4])/batches) #number of items in each batch\ntmp_ = 0 #keep track for number of items in a batch\n\nfor i in range(batches):\n    data = json_data[4][tmp_:tmp_+batch_length]\n    user.insert_many(data)\n    tmp_ += batch_length\n\nuser.find_one()\n```\n\n\n\n\n    {'_id': ObjectId('5f73e31226a5b63928182c21'),\n     'user_id': 'ntlvfPzc8eglqvk92iDIAw',\n     'name': 'Rafael',\n     'review_count': 553,\n     'yelping_since': '2007-07-06 03:27:11',\n     'useful': 628,\n     'funny': 225,\n     'cool': 227,\n     'elite': '',\n     'friends': 'oeMvJh94PiGQnx_6GlndPQ, wm1z1PaJKvHgSDRKfwhfDg, IkRib6Xs91PPW7pon7VVig, A8Aq8f0-XvLBcyMk2GJdJQ, eEZM1kogR7eL4GOBZyPvBA, e1o1LN7ez5ckCpQeAab4iw, _HrJVzFaRFUhPva8cwBjpQ, pZeGZGzX-ROT_D5lam5uNg, 0S6EI51ej5J7dgYz3-O0lA, woDt8raW-AorxQM_tIE2eA, hWUnSE5gKXNe7bDc8uAG9A, c_3LDSO2RHwZ94_Q6j_O7w, -uv1wDiaplY6eXXS0VwQiA, QFjqxXn3acDC7hckFGUKMg, ErOqapICmHPTN8YobZIcfQ, mJLRvqLOKhqEdkgt9iEaCQ, VKX7jlScJSA-ja5hYRw12Q, ijIC9w5PRcj3dWVlanjZeg, CIZGlEw-Bp0rmkP8M6yQ9Q, OC6fT5WZ8EU7tEVJ3bzPBQ, UZSDGTDpycDzrlfUlyw2dQ, deL6e_z9xqZTIODKqnvRXQ, 5mG2ENw2PylIWElqHSMGqg, Uh5Kug2fvDd51RYmsNZkGg, 4dI4uoShugD9z84fYupelQ, EQpFHqGT9Tk6YSwORTtwpg, o4EGL2-ICGmRJzJ3GxB-vw, s8gK7sdVzJcYKcPv2dkZXw, vOYVZgb_GVe-kdtjQwSUHw, wBbjgHsrKr7BsPBrQwJf2w, p59u2EC_qcmCmLeX1jCi5Q, VSAZI1eHDrOPRWMK4Q2DIQ, efMfeI_dkhpeGykaRJqxfQ, x6qYcQ8_i0mMDzSLsFCbZg, K_zSmtNGw1fu-vmxyTVfCQ, 5IM6YPQCK-NABkXmHhlRGQ, U_w8ZMD26vnkeeS1sD7s4Q, AbfS_oXF8H6HJb5jFqhrLw, hbcjX4_D4KIfonNnwrH-cg, UKf66_MPz0zHCP70mF6p1g, hK2gYbxZRTqcqlSiQQcrtQ, 2Q45w_Twx_T9dXqlE16xtQ, BwRn8qcKSeA77HLaOTbfiQ, jouOn4VS_DtFPtMR2w8VDA, ESteyJabbfvqas6CEDs3pQ',\n     'fans': 14,\n     'average_stars': 3.57,\n     'compliment_hot': 3,\n     'compliment_more': 2,\n     'compliment_profile': 1,\n     'compliment_cute': 0,\n     'compliment_list': 1,\n     'compliment_note': 11,\n     'compliment_plain': 15,\n     'compliment_cool': 22,\n     'compliment_funny': 22,\n     'compliment_writer': 10,\n     'compliment_photos': 0}\n\n\n\n\n```python\ndb.command(\"dbstats\")\n```\n\n\n\n\n    {'db': 'Yelp',\n     'collections': 3,\n     'views': 0,\n     'objects': 8405702,\n     'avgObjSize': 852.5322018315662,\n     'dataSize': 7166131634.0,\n     'storageSize': 4791033856.0,\n     'numExtents': 0,\n     'indexes': 3,\n     'indexSize': 84541440.0,\n     'scaleFactor': 1.0,\n     'fsUsedSize': 460941709312.0,\n     'fsTotalSize': 500068036608.0,\n     'ok': 1.0}\n\n\n\n\n```python\ntmp = db.command( { 'collStats': 'business', 'scale': 1024000 } )\ntmp['size'],tmp['count']\n```\n\n\n\n\n    (156, 209393)\n\n\n\n\n```python\ntmp = db.command( { 'collStats': 'checkin', 'scale': 1024000 } )\ntmp['size'],tmp['count']\n```\n\n\n\n\n    (442, 175187)\n\n\n\n\n```python\ntmp = db.command( { 'collStats': 'review', 'scale': 1024000 } )\ntmp['size'],tmp['count']\n```\n\n\n\n\n    (6398, 8021122)\n\n\n\n\n```python\ntmp = db.command( { 'collStats': 'tip', 'scale': 1024000 } )\ntmp['size'],tmp['count']\n```\n\n\n\n\n    (289, 1320761)\n\n\n\n\n```python\ntmp = db.command( { 'collStats': 'user', 'scale': 1024000 } )\ntmp['size'],tmp['count']\n```\n\n\n\n\n    (3272, 1968703)\n\n\n\n## MongoDB Initialization\n\n\n```python\nclient = MongoClient('mongodb://localhost:27017/')\ndb = client['Yelp']\nbusiness = db['business']\ncheckin = db['checkin']\nreview = db['review']\ntip = db['tip']\nuser = db['user']\n```\n\n\n```python\npipeline = [\n    {'$lookup':{'from' : 'tip',\n                'localField' : 'business_id',\n                'foreignField' : 'business_id',\n                'as' : 'buis_tip'}},\n    {'$replaceRoot':{'newRoot':{'$mergeObjects':[{'$arrayElemAt':[\"$buis_tip\",0]}, \"$$ROOT\"]}}},\n    {'$project': {'buis_tip':0, 'compliment_count':0, 'address':0, 'postal_code':0, 'latitude':0, 'longitude':0,'attributes':0, 'categories':0, 'hours':0}}\n]\n\n#x = business.aggregate(pipeline)\n#df_agg = pd.DataFrame(list(x))\n```\n\nWe will use the JSON data directly from files from now on since they are large and loading them into dataframes, from a mongoDB database, takes a long time using `pd.DataFrame(list(x))`\n\n\n```python\ndf = pd.read_json(data_path0, lines=True)\n```\n\n\n```python\ndf.head(2)\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003ebusiness_id\u003c/th\u003e\n      \u003cth\u003ename\u003c/th\u003e\n      \u003cth\u003eaddress\u003c/th\u003e\n      \u003cth\u003ecity\u003c/th\u003e\n      \u003cth\u003estate\u003c/th\u003e\n      \u003cth\u003epostal_code\u003c/th\u003e\n      \u003cth\u003elatitude\u003c/th\u003e\n      \u003cth\u003elongitude\u003c/th\u003e\n      \u003cth\u003estars\u003c/th\u003e\n      \u003cth\u003ereview_count\u003c/th\u003e\n      \u003cth\u003eis_open\u003c/th\u003e\n      \u003cth\u003eattributes\u003c/th\u003e\n      \u003cth\u003ecategories\u003c/th\u003e\n      \u003cth\u003ehours\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e0\u003c/th\u003e\n      \u003ctd\u003ef9NumwFMBDn751xgFiRbNA\u003c/td\u003e\n      \u003ctd\u003eThe Range At Lake Norman\u003c/td\u003e\n      \u003ctd\u003e10913 Bailey Rd\u003c/td\u003e\n      \u003ctd\u003eCornelius\u003c/td\u003e\n      \u003ctd\u003eNC\u003c/td\u003e\n      \u003ctd\u003e28031\u003c/td\u003e\n      \u003ctd\u003e35.462724\u003c/td\u003e\n      \u003ctd\u003e-80.852612\u003c/td\u003e\n      \u003ctd\u003e3.5\u003c/td\u003e\n      \u003ctd\u003e36\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e{'BusinessAcceptsCreditCards': 'True', 'BikePa...\u003c/td\u003e\n      \u003ctd\u003eActive Life, Gun/Rifle Ranges, Guns \u0026amp; Ammo, Sh...\u003c/td\u003e\n      \u003ctd\u003e{'Monday': '10:0-18:0', 'Tuesday': '11:0-20:0'...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e1\u003c/th\u003e\n      \u003ctd\u003eYzvjg0SayhoZgCljUJRF9Q\u003c/td\u003e\n      \u003ctd\u003eCarlos Santo, NMD\u003c/td\u003e\n      \u003ctd\u003e8880 E Via Linda, Ste 107\u003c/td\u003e\n      \u003ctd\u003eScottsdale\u003c/td\u003e\n      \u003ctd\u003eAZ\u003c/td\u003e\n      \u003ctd\u003e85258\u003c/td\u003e\n      \u003ctd\u003e33.569404\u003c/td\u003e\n      \u003ctd\u003e-111.890264\u003c/td\u003e\n      \u003ctd\u003e5.0\u003c/td\u003e\n      \u003ctd\u003e4\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e{'GoodForKids': 'True', 'ByAppointmentOnly': '...\u003c/td\u003e\n      \u003ctd\u003eHealth \u0026amp; Medical, Fitness \u0026amp; Instruction, Yoga,...\u003c/td\u003e\n      \u003ctd\u003eNone\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n# EDA \u0026 Visualizations\n\nLets take a look where most of the businesses are located in this data set by plotting them onto a globe.\n\n\n```python\nplt.figure(figsize=(20,10));\nmap = Basemap(projection='ortho',lat_0=25,lon_0=-100,resolution='l')\nmap.bluemarble()\n# draw coastlines, country boundaries, fill continents.\nmap.drawcoastlines(linewidth=0.25)\nmap.drawcountries(linewidth=0.25)\nmap.fillcontinents(color='green',lake_color='blue')\n# draw the edge of the map projection region (the projection limb)\nmap.drawmapboundary(fill_color='blue')\nlong_lat = map(df['longitude'].tolist(),df['latitude'].tolist())\nmap.scatter(long_lat[0], long_lat[1], s=3, c=\"orange\", lw=3, alpha=1, zorder=5)\nplt.title(\"World-wide Yelp Reviews\")\nplt.show();\n```\n\n    Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).\n\n\n\n![png](img/output_42_1.png)\n\n\n\n```python\nprint(df['longitude'].min())\nprint(df['longitude'].max())\n```\n\n    -158.0255252123\n    -72.80655\n\n\nWe note from negative values for longitude that all the businesses in the yelp database are found in North America, in particular Canada and the United States. \n\nLets take a closer look at Greater Toronto Area (GTA) which is the most populous metropolitan area in Canada.\n\n## Greater Toronto Area (GTA)\n\n\n```python\nlon_min, lon_max = -80, -78.8\nlat_min, lat_max = 43.2, 44.2\n\nTOR = ((df[\"longitude\"]\u003elon_min) \u0026(df[\"longitude\"]\u003clon_max)) \u0026\\\n            ((df[\"latitude\"]\u003elat_min) \u0026 (df[\"latitude\"]\u003clat_max))\n\nTOR_business = df[TOR].copy()\n```\n\n\n```python\nplt.figure(figsize=(20,10));\nmap = Basemap(projection='merc',llcrnrlat=lat_min,urcrnrlat=lat_max,llcrnrlon=lon_min,urcrnrlon=lon_max, resolution='f')\n#map.bluemarble()\n# country boundaries, fill continents.\nmap.drawcountries(linewidth=0.25)\nmap.fillcontinents(color='black',lake_color='blue')\n# draw the edge of the map projection region (the projection limb)\nmap.drawmapboundary(fill_color='blue')\nTOR_ = map(TOR_business['longitude'].tolist(),TOR_business['latitude'].tolist())\nmap.scatter(TOR_[0], TOR_[1], s=3, c=\"orange\", lw=3, alpha=1, zorder=5)\nplt.title(\"World-wide Yelp Reviews\")\nplt.show();\n```\n\n\n![png](img/output_47_0.png)\n\n\n\n```python\nTOR_business.sort_values(by='review_count', ascending=False).head(5)\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003ebusiness_id\u003c/th\u003e\n      \u003cth\u003ename\u003c/th\u003e\n      \u003cth\u003eaddress\u003c/th\u003e\n      \u003cth\u003ecity\u003c/th\u003e\n      \u003cth\u003estate\u003c/th\u003e\n      \u003cth\u003epostal_code\u003c/th\u003e\n      \u003cth\u003elatitude\u003c/th\u003e\n      \u003cth\u003elongitude\u003c/th\u003e\n      \u003cth\u003estars\u003c/th\u003e\n      \u003cth\u003ereview_count\u003c/th\u003e\n      \u003cth\u003eis_open\u003c/th\u003e\n      \u003cth\u003eattributes\u003c/th\u003e\n      \u003cth\u003ecategories\u003c/th\u003e\n      \u003cth\u003ehours\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003e65694\u003c/th\u003e\n      \u003ctd\u003er_BrIgzYcwo1NAuG9dLbpg\u003c/td\u003e\n      \u003ctd\u003ePai Northern Thai Kitchen\u003c/td\u003e\n      \u003ctd\u003e18 Duncan Street\u003c/td\u003e\n      \u003ctd\u003eToronto\u003c/td\u003e\n      \u003ctd\u003eON\u003c/td\u003e\n      \u003ctd\u003eM5H 3G8\u003c/td\u003e\n      \u003ctd\u003e43.647866\u003c/td\u003e\n      \u003ctd\u003e-79.388685\u003c/td\u003e\n      \u003ctd\u003e4.5\u003c/td\u003e\n      \u003ctd\u003e2758\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e{'RestaurantsTableService': 'True', 'BikeParki...\u003c/td\u003e\n      \u003ctd\u003eRestaurants, Thai, Specialty Food, Food, Ethni...\u003c/td\u003e\n      \u003ctd\u003e{'Monday': '11:30-22:0', 'Tuesday': '11:30-22:...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e183740\u003c/th\u003e\n      \u003ctd\u003eRtUvSWO_UZ8V3Wpj0n077w\u003c/td\u003e\n      \u003ctd\u003eKINKA IZAKAYA ORIGINAL\u003c/td\u003e\n      \u003ctd\u003e398 Church St\u003c/td\u003e\n      \u003ctd\u003eToronto\u003c/td\u003e\n      \u003ctd\u003eON\u003c/td\u003e\n      \u003ctd\u003eM5B 2A2\u003c/td\u003e\n      \u003ctd\u003e43.660430\u003c/td\u003e\n      \u003ctd\u003e-79.378927\u003c/td\u003e\n      \u003ctd\u003e4.0\u003c/td\u003e\n      \u003ctd\u003e1592\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e{'RestaurantsAttire': 'u'casual'', 'BusinessPa...\u003c/td\u003e\n      \u003ctd\u003eRestaurants, Tapas/Small Plates, Japanese, Bar...\u003c/td\u003e\n      \u003ctd\u003e{'Monday': '17:0-0:0', 'Tuesday': '17:0-0:0', ...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e143305\u003c/th\u003e\n      \u003ctd\u003eaLcFhMe6DDJ430zelCpd2A\u003c/td\u003e\n      \u003ctd\u003eKhao San Road\u003c/td\u003e\n      \u003ctd\u003e11 Charlotte Street\u003c/td\u003e\n      \u003ctd\u003eToronto\u003c/td\u003e\n      \u003ctd\u003eON\u003c/td\u003e\n      \u003ctd\u003eM5V 2H5\u003c/td\u003e\n      \u003ctd\u003e43.646411\u003c/td\u003e\n      \u003ctd\u003e-79.393480\u003c/td\u003e\n      \u003ctd\u003e4.0\u003c/td\u003e\n      \u003ctd\u003e1542\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e{'WiFi': 'u'no'', 'RestaurantsTakeOut': 'True'...\u003c/td\u003e\n      \u003ctd\u003eThai, Restaurants\u003c/td\u003e\n      \u003ctd\u003e{'Monday': '17:0-22:0', 'Tuesday': '17:0-22:0'...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e133959\u003c/th\u003e\n      \u003ctd\u003eiGEvDk6hsizigmXhDKs2Vg\u003c/td\u003e\n      \u003ctd\u003eSeven Lives Tacos Y Mariscos\u003c/td\u003e\n      \u003ctd\u003e69 Kensington Avenue\u003c/td\u003e\n      \u003ctd\u003eToronto\u003c/td\u003e\n      \u003ctd\u003eON\u003c/td\u003e\n      \u003ctd\u003eM5T 2K2\u003c/td\u003e\n      \u003ctd\u003e43.654341\u003c/td\u003e\n      \u003ctd\u003e-79.400480\u003c/td\u003e\n      \u003ctd\u003e4.5\u003c/td\u003e\n      \u003ctd\u003e1285\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e{'RestaurantsGoodForGroups': 'False', 'Alcohol...\u003c/td\u003e\n      \u003ctd\u003eRestaurants, Seafood, Mexican\u003c/td\u003e\n      \u003ctd\u003e{'Monday': '11:0-19:0', 'Tuesday': '11:0-19:0'...\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e172038\u003c/th\u003e\n      \u003ctd\u003eN93EYZy9R0sdlEvubu94ig\u003c/td\u003e\n      \u003ctd\u003eBanh Mi Boys\u003c/td\u003e\n      \u003ctd\u003e392 Queen Street W\u003c/td\u003e\n      \u003ctd\u003eToronto\u003c/td\u003e\n      \u003ctd\u003eON\u003c/td\u003e\n      \u003ctd\u003eM5V 2A9\u003c/td\u003e\n      \u003ctd\u003e43.648827\u003c/td\u003e\n      \u003ctd\u003e-79.396970\u003c/td\u003e\n      \u003ctd\u003e4.5\u003c/td\u003e\n      \u003ctd\u003e1097\u003c/td\u003e\n      \u003ctd\u003e1\u003c/td\u003e\n      \u003ctd\u003e{'Alcohol': 'u'none'', 'BikeParking': 'True', ...\u003c/td\u003e\n      \u003ctd\u003eSandwiches, Restaurants, Food, Vietnamese, Asi...\u003c/td\u003e\n      \u003ctd\u003e{'Monday': '11:0-22:0', 'Tuesday': '11:0-22:0'...\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nTOR_business['review_count'].describe()\n```\n\n\n\n\n    count    36636.000000\n    mean        24.111830\n    std         54.647122\n    min          3.000000\n    25%          4.000000\n    50%          8.000000\n    75%         21.000000\n    max       2758.000000\n    Name: review_count, dtype: float64\n\n\n\n\n```python\nTOR_business['review_count'].hist();\n```\n\n\n![png](img/output_50_0.png)\n\n\nThe 'review_count' column is the number of reviews a business has received and we see in the histogram above that there are some business with thousands of reviews but 75% of the businesses in this dataset only have 21 reviews or less. \n\nThe histogram below only shows business in Toronto will less than 100 reviews.\n\n\n```python\nt_ = TOR_business[TOR_business['review_count']\u003c100]['review_count'].hist()\n```\n\n\n![png](img/output_52_0.png)\n\n\n\n```python\nfig, ax = plt.subplots(figsize=(20,10))\nax.spines['top'].set_visible(False)\nax.spines['right'].set_visible(False)\nax.tick_params(bottom=False, left=True)\n\nax.set_axisbelow(True)\nax.yaxis.grid(True)\nax.xaxis.grid(False)\nplt.title('Top 10 Type of Businesses in Toronto')\nplt.ylabel('Number of Business with category listed')\nplt.xticks(np.arange(10), TOR_business['categories'].value_counts().head(10), rotation=45)\n\nbars = ax.bar(\n    x=np.arange(10),\n    height=TOR_business['categories'].value_counts().head(10),\n    color='teal',\n    tick_label=TOR_business['categories'].value_counts().head(10).index\n)\n\nfor bar in bars:\n    plt.text(\n      bar.get_x() + bar.get_width() / 2,\n      bar.get_height() + 4,\n      round(bar.get_height(), 1),\n      horizontalalignment='center',\n      color='teal',\n      weight='bold',\n      size=15\n    )\n```\n\n\n![png](img/output_53_0.png)\n\n\nWe can see in the figure above the top ten buisnesses (by count of each category) are counted twice. For example the number one most common business is 'Coffee \u0026 Tea, Food' and the second more common is 'Food, Coffee \u0026 Tea'. This needs to be addressed by combining the categories which have the text in their descriptions mixed up.\n\n\n```python\nTOR_business['categories'].value_counts().head(10)\n```\n\n\n\n\n    Coffee \u0026 Tea, Food            306\n    Food, Coffee \u0026 Tea            303\n    Restaurants, Chinese          300\n    Chinese, Restaurants          280\n    Hair Salons, Beauty \u0026 Spas    244\n    Beauty \u0026 Spas, Hair Salons    243\n    Pizza, Restaurants            203\n    Restaurants, Pizza            199\n    Nail Salons, Beauty \u0026 Spas    173\n    Grocery, Food                 169\n    Name: categories, dtype: int64\n\n\n\n\n```python\nlen(TOR_business['categories'].unique())\n```\n\n\n\n\n    19661\n\n\n\n\n```python\nTOR_business.loc[TOR_business['categories'] == 'Food, Coffee \u0026 Tea', 'categories'] = 'Coffee \u0026 Tea, Food'\nTOR_business.loc[TOR_business['categories'] == 'Chinese, Restaurants', 'categories'] = 'Restaurants, Chinese'\nTOR_business.loc[TOR_business['categories'] == 'Beauty \u0026 Spas, Hair Salons', 'categories'] = 'Hair Salons, Beauty \u0026 Spas'\nTOR_business.loc[TOR_business['categories'] == 'Restaurants, Pizza', 'categories'] = 'Pizza, Restaurants'\nTOR_business.loc[TOR_business['categories'] == 'Beauty \u0026 Spas, Nail Salons', 'categories'] = 'Nail Salons, Beauty \u0026 Spas'\nTOR_business.loc[TOR_business['categories'] == 'Restaurants, Italian', 'categories'] = 'Italian, Restaurants'\nTOR_business.loc[TOR_business['categories'] == 'Food, Grocery', 'categories'] = 'Grocery, Food'\nTOR_business.loc[TOR_business['categories'] == 'Food, Bakeries', 'categories'] = 'Bakeries, Food'\nTOR_business.loc[TOR_business['categories'] == 'Indian, Restaurants', 'categories'] = 'Restaurants, Indian'\nTOR_business.loc[TOR_business['categories'] == 'Restaurants, Vietnamese', 'categories'] = 'Vietnamese, Restaurants'\nTOR_business.loc[TOR_business['categories'] == 'Restaurants, Japanese', 'categories'] = 'Japanese, Restaurants'\nTOR_business.loc[TOR_business['categories'] == 'Thai, Restaurants', 'categories'] = 'Restaurants, Thai'\n```\n\n\n```python\nTOR_business['categories'].value_counts().head(10)\n```\n\n\n\n\n    Coffee \u0026 Tea, Food            609\n    Restaurants, Chinese          580\n    Hair Salons, Beauty \u0026 Spas    487\n    Pizza, Restaurants            402\n    Nail Salons, Beauty \u0026 Spas    340\n    Grocery, Food                 323\n    Italian, Restaurants          282\n    Bakeries, Food                256\n    Restaurants, Indian           239\n    Japanese, Restaurants         212\n    Name: categories, dtype: int64\n\n\n\n---\n\nNow we can plot the top 10 categories of businesses in Toronto:\n\n\n```python\nfig, ax = plt.subplots(figsize=(20,10))\nax.spines['top'].set_visible(False)\nax.spines['right'].set_visible(False)\nax.tick_params(bottom=False, left=True)\n\nax.set_axisbelow(True)\nax.yaxis.grid(True)\nax.xaxis.grid(False)\nplt.title('Top 10 Type of Businesses in Toronto')\nplt.ylabel('Number of Business with category listed')\nplt.xticks(np.arange(10), TOR_business['categories'].value_counts().head(10), rotation=45)\n\nbars = ax.bar(\n    x=np.arange(10),\n    height=TOR_business['categories'].value_counts().head(10),\n    color='teal',\n    tick_label=TOR_business['categories'].value_counts().head(10).index\n)\n\nfor bar in bars:\n    plt.text(\n      bar.get_x() + bar.get_width() / 2,\n      bar.get_height() + 4,\n      round(bar.get_height(), 1),\n      horizontalalignment='center',\n      color='teal',\n      weight='bold',\n      size=15\n    )\n```\n\n\n![png](img/output_61_0.png)\n\n\nBelow is a table for the average stars (1-5) for the top ten business categories above\n\n\n```python\ntop_ten_list = TOR_business['categories'].value_counts().head(10).index.tolist()\nTOR_business.groupby('categories').agg(['mean']).loc[top_ten_list,:].sort_values(by=('stars','mean'),ascending=False)['stars']\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003emean\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003ecategories\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003eBakeries, Food\u003c/th\u003e\n      \u003ctd\u003e3.705078\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eItalian, Restaurants\u003c/th\u003e\n      \u003ctd\u003e3.489362\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eHair Salons, Beauty \u0026amp; Spas\u003c/th\u003e\n      \u003ctd\u003e3.434292\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eCoffee \u0026amp; Tea, Food\u003c/th\u003e\n      \u003ctd\u003e3.428571\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eRestaurants, Indian\u003c/th\u003e\n      \u003ctd\u003e3.424686\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eJapanese, Restaurants\u003c/th\u003e\n      \u003ctd\u003e3.318396\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eRestaurants, Chinese\u003c/th\u003e\n      \u003ctd\u003e3.192241\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003ePizza, Restaurants\u003c/th\u003e\n      \u003ctd\u003e3.136816\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eGrocery, Food\u003c/th\u003e\n      \u003ctd\u003e3.082043\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eNail Salons, Beauty \u0026amp; Spas\u003c/th\u003e\n      \u003ctd\u003e2.867647\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\nAnother approach to solving the problem above is to instead to split the strings by `,` (comma) in each row of the category column and using the `Counter()` function from the `collections` library, which we then increase the count for each string encountered.\n\n\n```python\nTor_catlist = TOR_business['categories'].tolist()\n```\n\n\n```python\nc = Counter()\nfor n in Tor_catlist:\n    if n is not None:\n        cat_list = n.split(', ')\n        for cat in cat_list:\n            c[cat] += 1\n    else:\n        c['N/A'] += 1\n```\n\n\n```python\nc.most_common(10)\n```\n\n\n\n\n    [('Restaurants', 16227),\n     ('Food', 7979),\n     ('Shopping', 5596),\n     ('Beauty \u0026 Spas', 3610),\n     ('Nightlife', 2743),\n     ('Coffee \u0026 Tea', 2456),\n     ('Bars', 2452),\n     ('Health \u0026 Medical', 1891),\n     ('Chinese', 1817),\n     ('Event Planning \u0026 Services', 1795)]\n\n\n\n\n```python\nTORcat = pd.DataFrame.from_dict(c, orient='index')\n```\n\n\n```python\nTORcat.sort_values(by=0,ascending=False).head(10)\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003e0\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003eRestaurants\u003c/th\u003e\n      \u003ctd\u003e16227\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eFood\u003c/th\u003e\n      \u003ctd\u003e7979\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eShopping\u003c/th\u003e\n      \u003ctd\u003e5596\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eBeauty \u0026amp; Spas\u003c/th\u003e\n      \u003ctd\u003e3610\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eNightlife\u003c/th\u003e\n      \u003ctd\u003e2743\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eCoffee \u0026amp; Tea\u003c/th\u003e\n      \u003ctd\u003e2456\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eBars\u003c/th\u003e\n      \u003ctd\u003e2452\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eHealth \u0026amp; Medical\u003c/th\u003e\n      \u003ctd\u003e1891\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eChinese\u003c/th\u003e\n      \u003ctd\u003e1817\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eEvent Planning \u0026amp; Services\u003c/th\u003e\n      \u003ctd\u003e1795\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\ntmp_ = TORcat.sort_values(by=0,ascending=False).head(10)\n\nfig, ax = plt.subplots(figsize=(20,10))\nax.spines['top'].set_visible(False)\nax.spines['right'].set_visible(False)\nax.tick_params(bottom=False, left=True)\n\nax.set_axisbelow(True)\nax.yaxis.grid(True)\nax.xaxis.grid(False)\nplt.title('Top 10 Type of Businesses in Toronto')\nplt.ylabel('Number of Business with category listed')\nplt.xticks(np.arange(10), tmp_.value_counts(), rotation=45)\n\nbars = ax.bar(\n    x=np.arange(10),\n    height=tmp_[0].values,\n    color='teal',\n    tick_label=tmp_.index\n)\n\nfor bar in bars:\n    plt.text(\n      bar.get_x() + bar.get_width() / 2,\n      bar.get_height() + 100,\n      round(bar.get_height(), 1),\n      horizontalalignment='center',\n      color='teal',\n      weight='bold',\n      size=15\n    )\n```\n\n\n![png](img/output_70_0.png)\n\n\n## More Top 10 Business Categories Analysis for GTA\n\nThe top ten businesses, by number of a businesses in a unique category, have their average rating shown above in descending order.\n\n\n```python\nplt.figure(figsize=(12,6))\nTOR_business['stars'].hist(bins=17)\nplt.title('Histogram of 1-5 Star Reviews on Yelp within Toronto')\nplt.xlabel('Stars')\nplt.ylabel('Number of reviews');\n```\n\n\n![png](img/output_73_0.png)\n\n\n\n```python\ntop_ten_list = TOR_business['categories'].value_counts().head(10).index.tolist()\nTOR_business.groupby('categories').agg(['mean']).loc[top_ten_list,:].sort_values(by=('stars','mean'),ascending=False)['stars']\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003emean\u003c/th\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003ecategories\u003c/th\u003e\n      \u003cth\u003e\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003eBakeries, Food\u003c/th\u003e\n      \u003ctd\u003e3.705078\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eItalian, Restaurants\u003c/th\u003e\n      \u003ctd\u003e3.489362\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eHair Salons, Beauty \u0026amp; Spas\u003c/th\u003e\n      \u003ctd\u003e3.434292\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eCoffee \u0026amp; Tea, Food\u003c/th\u003e\n      \u003ctd\u003e3.428571\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eRestaurants, Indian\u003c/th\u003e\n      \u003ctd\u003e3.424686\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eJapanese, Restaurants\u003c/th\u003e\n      \u003ctd\u003e3.318396\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eRestaurants, Chinese\u003c/th\u003e\n      \u003ctd\u003e3.192241\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003ePizza, Restaurants\u003c/th\u003e\n      \u003ctd\u003e3.136816\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eGrocery, Food\u003c/th\u003e\n      \u003ctd\u003e3.082043\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003eNail Salons, Beauty \u0026amp; Spas\u003c/th\u003e\n      \u003ctd\u003e2.867647\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nplt.figure(figsize=(12,6))\nTOR_business[TOR_business['categories'] == 'Bakeries, Food']['stars'].plot.kde();\nplt.title('KDE of Reviews for Bakeries within Toronto')\nplt.xlabel('Number of Stars')\nplt.ylabel('Number of reviews');\n```\n\n\n![png](img/output_75_0.png)\n\n\n\n```python\nTOR_business[TOR_business['categories'] == 'Bakeries, Food'][['stars']].describe()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003estars\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003ecount\u003c/th\u003e\n      \u003ctd\u003e256.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emean\u003c/th\u003e\n      \u003ctd\u003e3.705078\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003estd\u003c/th\u003e\n      \u003ctd\u003e0.779601\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emin\u003c/th\u003e\n      \u003ctd\u003e1.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e25%\u003c/th\u003e\n      \u003ctd\u003e3.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e50%\u003c/th\u003e\n      \u003ctd\u003e4.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e75%\u003c/th\u003e\n      \u003ctd\u003e4.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emax\u003c/th\u003e\n      \u003ctd\u003e5.000000\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nplt.figure(figsize=(12,6))\nTOR_business[TOR_business['categories'] == 'Italian, Restaurants']['stars'].plot.kde();\nplt.title('KDE of Reviews for Italian Restaurants within Toronto')\nplt.xlabel('Number of Stars')\nplt.ylabel('Number of reviews');\n```\n\n\n![png](img/output_77_0.png)\n\n\n\n```python\nTOR_business[TOR_business['categories'] == 'Italian, Restaurants'][['stars']].describe()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003estars\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003ecount\u003c/th\u003e\n      \u003ctd\u003e282.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emean\u003c/th\u003e\n      \u003ctd\u003e3.489362\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003estd\u003c/th\u003e\n      \u003ctd\u003e0.678779\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emin\u003c/th\u003e\n      \u003ctd\u003e1.500000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e25%\u003c/th\u003e\n      \u003ctd\u003e3.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e50%\u003c/th\u003e\n      \u003ctd\u003e3.500000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e75%\u003c/th\u003e\n      \u003ctd\u003e4.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emax\u003c/th\u003e\n      \u003ctd\u003e5.000000\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n\n```python\nplt.figure(figsize=(12,6))\nTOR_business[TOR_business['categories'] == 'Nail Salons, Beauty \u0026 Spas']['stars'].plot.kde();\nplt.title('KDE of Reviews for Nail Salons, Beauty \u0026 Spas within Toronto')\nplt.xlabel('Number of Stars')\nplt.ylabel('Number of reviews');\n```\n\n\n![png](img/output_79_0.png)\n\n\n\n```python\nTOR_business[TOR_business['categories'] == 'Nail Salons, Beauty \u0026 Spas'][['stars']].describe()\n```\n\n\n\n\n\u003cdiv\u003e\n\u003cstyle scoped\u003e\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003c/style\u003e\n\u003ctable border=\"1\" class=\"dataframe\"\u003e\n  \u003cthead\u003e\n    \u003ctr style=\"text-align: right;\"\u003e\n      \u003cth\u003e\u003c/th\u003e\n      \u003cth\u003estars\u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/thead\u003e\n  \u003ctbody\u003e\n    \u003ctr\u003e\n      \u003cth\u003ecount\u003c/th\u003e\n      \u003ctd\u003e340.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emean\u003c/th\u003e\n      \u003ctd\u003e2.867647\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003estd\u003c/th\u003e\n      \u003ctd\u003e0.916972\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emin\u003c/th\u003e\n      \u003ctd\u003e1.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e25%\u003c/th\u003e\n      \u003ctd\u003e2.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e50%\u003c/th\u003e\n      \u003ctd\u003e3.000000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003e75%\u003c/th\u003e\n      \u003ctd\u003e3.500000\u003c/td\u003e\n    \u003c/tr\u003e\n    \u003ctr\u003e\n      \u003cth\u003emax\u003c/th\u003e\n      \u003ctd\u003e5.000000\u003c/td\u003e\n    \u003c/tr\u003e\n  \u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\n\n\n## Review.json\n\n\n```python\ndf_ = pd.read_json(data_path2, lines=True)\nTOR_review = df_[TOR].copy()\n```\n\n\n```python\nTOR_review.head(2)\n```\n\n\n```python\n\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidhintelmann%2Fyelp_investigation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidhintelmann%2Fyelp_investigation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidhintelmann%2Fyelp_investigation/lists"}