Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rubydamodar/cognifyz-data-mastery-program

This repository contains the tasks and projects from the Cognifyz Data Internship Program, focused on data exploration, descriptive analysis, predictive modeling, feature engineering, and visualization. It offers hands-on experience with real-world datasets to help interns build valuable data science skills.
https://github.com/rubydamodar/cognifyz-data-mastery-program

Last synced: about 1 month ago
JSON representation

This repository contains the tasks and projects from the Cognifyz Data Internship Program, focused on data exploration, descriptive analysis, predictive modeling, feature engineering, and visualization. It offers hands-on experience with real-world datasets to help interns build valuable data science skills.

Awesome Lists containing this project

README

        

![image](https://github.com/user-attachments/assets/d4e24b8d-1600-4bc7-b8d2-1525eab3d756)

![image](https://github.com/user-attachments/assets/c2b2da51-f1d1-4d22-b3d9-c9d47b3110e7)

```
๐Ÿ“ฆ Data Science Internship Project
โ”‚
โ”œโ”€โ”€ ๐Ÿ“„ LICENSE
โ”œโ”€โ”€ ๐Ÿ“„ README.md
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ LEVEL 1 - Data Exploration and Preprocessing
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ DATAEXPLORATION AND PREPROCESSING.ipynb
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ DESCRIPTIVE ANALYSIS.ipynb
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ GEOSPATIAL ANALYSIS.ipynb
โ”‚ โ””โ”€โ”€ ๐ŸŒ restaurant_map.html
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ LEVEL 2 - Advanced Analysis
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ TABLE BOOKING AND ONLINE DELEIVERY.ipynb
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ PRICE RANGE ANALYSIS.ipynb
โ”‚ โ””โ”€โ”€ ๐Ÿ“„ FEATURE ENGINEERING.ipynb
โ”‚
โ”œโ”€โ”€ ๐Ÿ“ LEVEL 3 - Modeling and Visualization
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ PREDICTIVE MODELING.ipynb
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ CUSTOMER PREFERANCE ANALYSIS.ipynb
โ”‚ โ”œโ”€โ”€ ๐Ÿ“„ Data Visualization.ipynb
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š average_rating_by_cuisine.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š boxplot_ratings_by_cuisine.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š correlation_heatmap.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š jointplot_votes_rating.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š pair_plot.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š pairplot_votes_rating.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š rating_distribution_boxplot.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š rating_distribution_histogram.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š swarmplot_ratings_cuisines.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š top_cuisines_avg_rating.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š violinplot_votes_by_rating.png
โ”‚ โ”œโ”€โ”€ ๐Ÿ“Š votes_vs_aggregate_rating.png
โ”‚ โ””โ”€โ”€ ๐ŸŒ bubble_chart_votes_rating.html
โ”‚
โ””โ”€โ”€ ๐Ÿ“ DATASETS
โ””โ”€โ”€ (Dataset files)

```

```mermaid
graph LR
A[๐Ÿ“ฆ Data Science Internship Project] --> D[๐Ÿ“„ LICENSE]
A --> E[๐Ÿ“„ README.md]
A --> B1[๐Ÿ“ LEVEL 1 - Data Exploration and Preprocessing]
A --> B2[๐Ÿ“ LEVEL 2 - Advanced Analysis]
A --> B3[๐Ÿ“ LEVEL 3 - Modeling and Visualization]
A --> C[๐Ÿ“ DATASETS]

B1 --> B1A[๐Ÿ“„ DATAEXPLORATION AND PREPROCESSING.ipynb]
B1 --> B1B[๐Ÿ“„ DESCRIPTIVE ANALYSIS.ipynb]
B1 --> B1C[๐Ÿ“„ GEOSPATIAL ANALYSIS.ipynb]
B1 --> B1D[๐ŸŒ restaurant_map.html]

B2 --> B2A[๐Ÿ“„ TABLE BOOKING AND ONLINE DELIVERY.ipynb]
B2 --> B2B[๐Ÿ“„ PRICE RANGE ANALYSIS.ipynb]
B2 --> B2C[๐Ÿ“„ FEATURE ENGINEERING.ipynb]

B3 --> B3A[๐Ÿ“„ PREDICTIVE MODELING.ipynb]
B3 --> B3B[๐Ÿ“„ CUSTOMER PREFERENCE ANALYSIS.ipynb]
B3 --> B3C[๐Ÿ“„ Data Visualization.ipynb]
B3 --> B3D[๐Ÿ“Š average_rating_by_cuisine.png]
B3 --> B3E[๐Ÿ“Š boxplot_ratings_by_cuisine.png]
B3 --> B3F[๐Ÿ“Š correlation_heatmap.png]
B3 --> B3G[๐Ÿ“Š jointplot_votes_rating.png]
B3 --> B3H[๐Ÿ“Š pair_plot.png]
B3 --> B3I[๐Ÿ“Š pairplot_votes_rating.png]
B3 --> B3J[๐Ÿ“Š rating_distribution_boxplot.png]
B3 --> B3K[๐Ÿ“Š rating_distribution_histogram.png]
B3 --> B3L[๐Ÿ“Š swarmplot_ratings_cuisines.png]
B3 --> B3M[๐Ÿ“Š top_cuisines_avg_rating.png]
B3 --> B3N[๐Ÿ“Š violinplot_votes_by_rating.png]
B3 --> B3O[๐Ÿ“Š votes_vs_aggregate_rating.png]
B3 --> B3P[๐ŸŒ bubble_chart_votes_rating.html]

C --> C1[(Dataset files)]
```

# ๐ŸŒŸ Cognifyz Technologies: Internship Guidelines and Best Practices

## ๐Ÿ”น About Cognifyz Technologies
Cognifyz Technologies is a leading technology company specializing in data science, artificial intelligence (AI), machine learning (ML), and data analytics solutions. We are committed to delivering innovative, impactful projects and offering skill-enhancing training programs that prepare professionals for industry challenges.

## ๐Ÿ’ผ Enhancing Your Professional Presence
Maximize your professional growth by sharing your achievements on LinkedIn. Highlight your offer letter, completed tasks, or internship completion certificate to showcase your experience. Ensure to tag **Cognifyz Technologies** and use these hashtags for greater reach:
- #cognifyz
- #cognifyzTech
- #cognifyzTechnologies

## ๐Ÿ“‹ Key Guidelines
1. **๐Ÿ” Maintain Academic Integrity**: Submitting original work is essential. Plagiarism or copying code may lead to termination of your internship and restrict future opportunities.
2. **๐ŸŽฅ Project Showcasing**:
- Create a professional video highlighting your completed tasks and achievements.
- Post the video on LinkedIn to establish credibility among peers.
- Tag **Cognifyz Technologies** and use relevant hashtags for visibility.

## ๐Ÿ† Task Levels and Submission
Choose and complete any 2 out of the 3 levels below. Successfully completing **Level 3** (2 out of 4 tasks) may improve your chances of receiving a stipend.

### ๐Ÿ”น Level 1: Data Exploration and Preprocessing
**Task 1**:
- ๐Ÿ”Ž Explore the dataset and identify the number of rows and columns.
- ๐Ÿšซ Check and handle missing values in the dataset.
- ๐Ÿ”„ Perform data type conversions if necessary.
- ๐Ÿ“Š Analyze the distribution of the target variable (e.g., "Aggregate rating") and check for class imbalances.

### ๐Ÿ”น Level 2: Descriptive Analysis
**Task 1**:
- ๐Ÿ“ˆ Calculate key statistical measures (mean, median, standard deviation, etc.) for numerical columns.
- ๐Ÿ“‹ Explore the distribution of categorical variables such as "Country Code," "City," and "Cuisines."
- ๐Ÿฝ๏ธ Identify the top cuisines and cities with the most restaurants.

### ๐Ÿ”น Level 3: Advanced Analysis (Optional for Stipend)
- Complete any 2 out of 4 advanced tasks to qualify for the stipend.

## ๐Ÿ“ฅ Submission Process
A submission form will be shared at a later date. Until then, continue your tasks and maintain separate files for each level.

## ๐ŸŒ Best Practices for LinkedIn Posts
- **๐Ÿ“ Create Quality Content**: Provide detailed explanations and visual evidence of your work.
- **๐Ÿ”– Tagging and Hashtags**: Tag **Cognifyz Technologies** and include hashtags like #cognifyz, #cognifyzTech, and #cognifyzTechnologies.
- **๐ŸŽฌ Video Demonstrations**: A well-made video can significantly boost engagement and establish your credibility.

### Overview of Directories:
- **LEVEL 1**: Initial data exploration and preprocessing with geospatial visualizations.
- **LEVEL 2**: Includes feature engineering and in-depth analysis of pricing, bookings, and delivery trends.
- **LEVEL 3**: Advanced predictive modeling and visualization with detailed image outputs and plots.
- **DATASETS**: Contains all the data used for analysis.

### ๐Ÿ“Š Data Exploration and Preprocessing

First, we need to load the data using **Pandas** for initial exploration:

```python
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

# Load the dataset
file_path = r'Cognifyz-Data-Mastery-Program\DATASETS\Dataset .csv'
DATASET = pd.read_csv(file_path)

# Display the first few rows to get a sense of the data
print("Initial preview of the dataset:")
print(DATASET.head())
```

**Sample Output:**
```
Initial preview of the dataset:
Restaurant ID Restaurant Name Country Code City \
0 6317637 Le Petit Souffle 162 Makati City
1 6304287 Izakaya Kikufuji 162 Makati City
2 6300002 Heat - Edsa Shangri-La 162 Mandaluyong City
3 6318506 Ooma 162 Mandaluyong City
4 6314302 Sambo Kojin 162 Mandaluyong City

Address \
0 Third Floor, Century City Mall, Kalayaan Avenu...
1 Little Tokyo, 2277 Chino Roces Avenue, Legaspi...
2 Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...
3 Third Floor, Mega Fashion Hall, SM Megamall, O...
4 Third Floor, Mega Atrium, SM Megamall, Ortigas...

Locality \
0 Century City Mall, Poblacion, Makati City
1 Little Tokyo, Legaspi Village, Makati City
2 Edsa Shangri-La, Ortigas, Mandaluyong City
3 SM Megamall, Ortigas, Mandaluyong City
4 SM Megamall, Ortigas, Mandaluyong City

Locality Verbose Longitude Latitude \
0 Century City Mall, Poblacion, Makati City, Mak... 121.027535 14.565443
1 Little Tokyo, Legaspi Village, Makati City, Ma... 121.014101 14.553708
2 Edsa Shangri-La, Ortigas, Mandaluyong City, Ma... 121.056831 14.581404
3 SM Megamall, Ortigas, Mandaluyong City, Mandal... 121.056475 14.585318
4 SM Megamall, Ortigas, Mandaluyong City, Mandal... 121.057508 14.584450

Cuisines ... Currency Has Table booking \
0 French, Japanese, Desserts ... Botswana Pula(P) Yes
1 Japanese ... Botswana Pula(P) Yes
2 Seafood, Asian, Filipino, Indian ... Botswana Pula(P) Yes
3 Japanese, Sushi ... Botswana Pula(P) No
4 Japanese, Korean ... Botswana Pula(P) Yes
```

# ๐Ÿฝ๏ธ Restaurant Dataset Analysis & Visualization

Welcome to the **Restaurant Dataset Analysis** project! ๐ŸŽ‰ This project dives deep into a dataset containing various restaurant details. The goal is to explore, analyze, and visualize essential insights that can guide restaurant-related decisions and trends. ๐Ÿด๐Ÿ“Š

## ๐Ÿ“š Libraries Used

- **Pandas** ๐Ÿผ: For data manipulation and analysis.
- **Matplotlib** ๐Ÿ“ˆ: For creating beautiful visualizations.
- **Seaborn** ๐ŸŽจ: For advanced and attractive plots.
- **Warnings** โš ๏ธ: To suppress unnecessary warnings during execution.

## ๐Ÿ—‚๏ธ Dataset Overview

The dataset includes valuable information about restaurants such as:

- **Restaurant ID**: Unique identifier for each restaurant. ๐Ÿ†”
- **Country Code**: The code representing the country of the restaurant. ๐ŸŒ
- **Longitude & Latitude**: Geographical coordinates. ๐ŸŒ
- **Average Cost for Two**: The typical cost for a meal for two people. ๐Ÿ’ธ
- **Price Range**: Price category of the restaurant. ๐Ÿ’ฐ
- **Aggregate Rating**: The overall rating from customers. โญ
- **Votes**: The total number of votes the restaurant has received. ๐Ÿ—ณ๏ธ

## ๐Ÿ“Š Basic Statistical Measures

First, we use the `describe()` method to get an overview of numerical columns, providing insights like mean, standard deviation, minimum, and maximum values.

```python
print("Basic statistical measures for numerical columns:")
print(DATASET.describe())
```

### Sample Output:
```plaintext
Restaurant ID Country Code Longitude Latitude Average Cost for two Price range Aggregate Rating Votes
count 9,551 9,551 9,551 9,551 9,551 9,551 9,551 9,551
mean 9,051,128 18.37 64.13 25.85 1,199 1.80 2.67 156.91
std 8,791,521 56.75 41.47 11.00 16,121 0.91 1.52 430.17
min 53 1.00 -157.95 -41.33 0 1.00 0.00 0
25% 301,962 1.00 77.08 28.48 250 1.00 2.50 5
50% 6,004,089 1.00 77.19 28.57 400 2.00 3.20 31
75% 18,352,290 1.00 77.28 28.64 700 2.00 3.70 131
max 18,500,650 216.00 174.83 55.98 800,000 4.00 4.90 10,934
```

## ๐ŸŽจ Distribution of Categorical Variables

We explore the distribution of key categorical columns such as **Country Code**, **City**, and **Cuisines**. We use `value_counts()` to uncover the most frequent categories.

```python
categorical_columns = ['Country Code', 'City', 'Cuisines']
for column in categorical_columns:
print(f"\nDistribution of {column}:")
print(DATASET[column].value_counts().head(10))
```

### Sample Output:
```plaintext
Distribution of Country Code:
Country Code
1 8,652
216 434
215 80
...

Distribution of City:
City
New Delhi 5,473
Gurgaon 1,118
Noida 1,080
...

Distribution of Cuisines:
Cuisines
North Indian 936
North Indian, Chinese 511
Chinese 354
...
```

### Visualizing Categorical Data ๐Ÿ“Š

We use Seabornโ€™s `countplot()` to visualize the distribution of categorical variables like **City** and **Cuisines**.

#### Top 10 Cuisines ๐Ÿ
```python
plt.figure(figsize=(10, 6))
sns.countplot(y='Cuisines', data=DATASET, order=DATASET['Cuisines'].value_counts().index[:10], palette='coolwarm')
plt.title("Top 10 Cuisines by Number of Restaurants")
plt.xlabel("Count")
plt.ylabel("Cuisines")
plt.show()
```

#### Top 10 Cities ๐Ÿ™๏ธ
```python
plt.figure(figsize=(10, 6))
sns.countplot(y='City', data=DATASET, order=DATASET['City'].value_counts().index[:10], palette='magma')
plt.title("Top 10 Cities by Number of Restaurants")
plt.xlabel("Count")
plt.ylabel("City")
plt.show()
```

## ๐Ÿฅ‡ Top Cuisines and Cities with the Highest Number of Restaurants

### Most Common Cuisines ๐Ÿฒ
We find the top cuisines by the number of restaurants offering them:

```python
print("Top 10 most common cuisines:")
print(DATASET['Cuisines'].value_counts().head(10))
```

### Most Common Cities ๐ŸŒ†
Similarly, we explore which cities have the highest number of restaurants:

```python
print("Top 10 cities with the highest number of restaurants:")
print(DATASET['City'].value_counts().head(10))
```
# ๐ŸŒ Geospatial Restaurant Analysis

Welcome to the **Geospatial Restaurant Analysis** project! ๐Ÿฝ๏ธ This project focuses on analyzing the geographical distribution of restaurants based on their **latitude** and **longitude** data, exploring patterns and concentrations of restaurant locations across different cities. ๐ŸŒ†

## ๐Ÿ“š Libraries Used

- **Folium** ๐Ÿ—บ๏ธ: For creating interactive maps to visualize restaurant locations.
- **Pandas** ๐Ÿผ: For data manipulation and processing.
- **Matplotlib** ๐Ÿ“Š: For plotting static graphs to visualize restaurant distributions.
- **Seaborn** ๐ŸŽจ: For enhanced visualizations and creating scatter plots.
- **Scikit-learn** ๐Ÿค–: For applying machine learning algorithms like KMeans clustering to group restaurant locations.

## ๐Ÿ—‚๏ธ Dataset Overview

The dataset contains the following columns relevant for geospatial analysis:

- **Latitude** ๐Ÿ“: The latitude of the restaurantโ€™s location.
- **Longitude** ๐ŸŒ: The longitude of the restaurantโ€™s location.
- **Restaurant Name** ๐Ÿด: The name of the restaurant.
- **City** ๐Ÿ™๏ธ: The city in which the restaurant is located.

## ๐Ÿ“ Interactive Map of Restaurant Locations

The first step is to visualize restaurant locations on an interactive map, centered on the average latitude and longitude of all restaurants. The map includes markers for each restaurant, which can be clicked for more details.

```python
import folium
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Load dataset
file_path = r'path_to_your_dataset.csv'
DATASET = pd.read_csv(file_path)

# Calculate average latitude and longitude
avg_lat = DATASET['Latitude'].mean()
avg_lon = DATASET['Longitude'].mean()
map_restaurants = folium.Map(location=[avg_lat, avg_lon], zoom_start=12)

# Add markers for each restaurant
for _, row in DATASET.iterrows():
folium.Marker(
location=[row['Latitude'], row['Longitude']],
popup=f"{row['Restaurant Name']} - {row['City']}",
icon=folium.Icon(color='blue', icon='cutlery', prefix='fa')
).add_to(map_restaurants)

# Save map to HTML
map_restaurants.save("restaurant_map.html")
```

๐Ÿ“ **Result**: The map has been saved as **'restaurant_map.html'**. You can open this file in any browser to view an interactive map of restaurant locations.

---

## ๐ŸŒ Distribution Analysis Using Seaborn

We can also visualize the geographical distribution of restaurants on a static map. Using **Seaborn**, we plot the density of restaurants, colored by their city:

```python
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 8))
sns.scatterplot(x='Longitude', y='Latitude', data=DATASET, hue='City', palette='Set1', legend=False)
plt.title('Geospatial Distribution of Restaurants')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.grid(True)
plt.show()
```

### ๐Ÿ” Insights:
- **Scatter Plot**: Displays restaurant locations with colors representing different cities.
- **Grid & Labels**: Helps understand the geographical spread and clusters of restaurants.

---

## ๐Ÿ”Ž Geospatial Clustering (Advanced)

To identify high-density restaurant areas, we apply **KMeans clustering** using **Scikit-learn** to group restaurant locations. This can help uncover hot spots or regions with a high concentration of restaurants.

### KMeans Clustering Code:
```python
from sklearn.cluster import KMeans
import numpy as np

# Extract coordinates for clustering
coordinates = DATASET[['Latitude', 'Longitude']].dropna()

# Perform KMeans clustering
kmeans = KMeans(n_clusters=5) # Adjust the number of clusters as needed
kmeans.fit(coordinates)

# Assign cluster labels to dataset
DATASET['Cluster'] = kmeans.predict(DATASET[['Latitude', 'Longitude']])

# Plot the clustered data
plt.figure(figsize=(12, 8))
sns.scatterplot(x='Longitude', y='Latitude', hue='Cluster', data=DATASET, palette='viridis')
plt.title('Geospatial Clustering of Restaurants')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.show()
```

### ๐Ÿ”‘ Key Concepts:
- **KMeans Clustering**: Groups restaurant locations into clusters to identify regions with higher restaurant density.
- **Visualization**: Different colors represent different clusters of restaurant concentrations.
- **High-density Areas**: Clusters help reveal popular dining districts and potential new restaurant zones.

---
This **Geospatial Restaurant Analysis** gives you the tools to visualize and analyze the distribution and clustering of restaurants based on geographical data. Using **Folium** for interactive maps, **Seaborn** for static density plots, and **KMeans** clustering for grouping similar locations, we uncover important patterns:

- **Restaurant Concentration**: Where are the most popular restaurant zones?
- **Clustering**: Identifying potential high-density areas for business opportunities.
- **Outliers**: Discovering restaurants in less populated areas, which could point to underserved markets.

Geospatial analysis provides powerful insights into the spatial distribution of restaurants and can aid in making data-driven decisions for expanding or optimizing restaurant locations. ๐ŸŒŸ

# ๐Ÿ’ธ **Price Range Analysis for Restaurants** ๐Ÿด

Welcome to the **Price Range Analysis for Restaurants** project! In this analysis, we will explore the relationship between **price range categories**, **average ratings**, and the **most common rating color** for different price ranges in the dataset. We'll help you uncover insights like which price ranges have the highest ratings and identify the color that represents the best ratings for each price range. ๐ŸŽจ

## ๐Ÿ“š Libraries Used

- **Pandas** ๐Ÿผ: For data processing and manipulation.
- **Matplotlib** ๐Ÿ“Š: For plotting bar charts to visualize results.
- **Seaborn** ๐ŸŽจ: For creating beautiful and informative bar charts.
- **Warnings** โš ๏ธ: To suppress unnecessary warnings during analysis.

---

## ๐Ÿ—‚๏ธ Dataset Overview

The dataset includes important columns for this analysis:

- **Average Cost for Two** ๐Ÿ’ธ: The average cost of dining for two people.
- **Aggregate Rating** โญ: The average rating of the restaurant.
- **Rating Color** ๐ŸŒˆ: The color associated with the restaurant's rating.

---

## ๐Ÿ”ข **Price Range Categories**

We categorize restaurants into different **price ranges** based on the average cost for two people:

- **Low** ๐Ÿ’ฐ: Cost between โ‚น0 and โ‚น500
- **Medium** ๐Ÿ’ต: Cost between โ‚น501 and โ‚น1000
- **High** ๐Ÿ’ณ: Cost between โ‚น1001 and โ‚น1500
- **Very High** ๐Ÿ’Ž: Cost above โ‚น1500

### Create Price Range Categories:

```python
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

# Load dataset
file_path = r'path_to_your_dataset.csv'
DATASET = pd.read_csv(file_path)

# Create price range categories
DATASET['Price Range Category'] = pd.cut(DATASET['Average Cost for two'], bins=[0, 500, 1000, 1500, 5000], labels=['Low', 'Medium', 'High', 'Very High'])
```

### ๐Ÿ“Š **Most Common Price Range**

We calculate the most common price range based on the number of restaurants in each category:

```python
most_common_price_range = DATASET['Price Range Category'].value_counts()
print("Most common price range:")
print(most_common_price_range)

# Plot the distribution of price ranges
plt.figure(figsize=(10, 5))
sns.barplot(x=most_common_price_range.index, y=most_common_price_range.values, palette='Set2')
plt.title('Most Common Price Range')
plt.xlabel('Price Range Category')
plt.ylabel('Number of Restaurants')
plt.show()
```

### Insights:
- **Low Price Range** ๐Ÿ’ฐ: Most common range, with 6056 restaurants.
- **Very High Price Range** ๐Ÿ’Ž: Least common, with only 552 restaurants.

---

## โญ **Average Rating for Each Price Range**

Now we calculate the **average rating** for each price range category to understand how ratings vary across price ranges:

```python
avg_rating_per_price_range = DATASET.groupby('Price Range Category')['Aggregate rating'].mean().round(2)
print("Average rating for each price range:")
print(avg_rating_per_price_range)

# Plot the average rating for each price range
plt.figure(figsize=(10, 5))
sns.barplot(x=avg_rating_per_price_range.index, y=avg_rating_per_price_range.values, palette='Blues')
plt.title('Average Rating for Each Price Range')
plt.xlabel('Price Range Category')
plt.ylabel('Average Rating')
plt.ylim(0, 5) # Set y-axis limit for ratings
plt.show()
```

### Insights:
- **Low Price Range** ๐Ÿ’ฐ: Average rating of 2.32.
- **Very High Price Range** ๐Ÿ’Ž: Highest average rating of 3.67.

---

## ๐ŸŒˆ **Color Representing the Highest Average Rating**

Next, we identify the **color** representing the highest average rating for each price range:

```python
highest_avg_color_per_price_range = DATASET.groupby('Price Range Category')['Rating color'].agg(lambda x: x.value_counts().idxmax())
print("Color representing the highest average rating for each price range:")
print(highest_avg_color_per_price_range)

# Plot the color associated with the highest average rating
plt.figure(figsize=(10, 5))
sns.barplot(x=highest_avg_color_per_price_range.index, y=avg_rating_per_price_range.values, palette=highest_avg_color_per_price_range.values)
plt.title('Color Representing Highest Average Rating by Price Range')
plt.xlabel('Price Range Category')
plt.ylabel('Average Rating')
plt.show()
```

### Insights:
- **Low Price Range** ๐Ÿ’ฐ: Orange ๐ŸŒŸ color represents the highest rating.
- **Medium Price Range** ๐Ÿ’ต: Orange ๐ŸŒŸ color again represents the highest rating.
- **Very High Price Range** ๐Ÿ’Ž: Yellow ๐ŸŒŸ color is associated with the highest ratings.

---

Through this **Price Range Analysis** for restaurants, we have learned the following key insights:

1. **Most Common Price Range**: Restaurants in the **Low Price Range** ๐Ÿ’ฐ are the most common.
2. **Average Ratings**: **Very High Price Range** ๐Ÿ’Ž restaurants tend to have the highest average ratings.
3. **Color Insights**: The color **Orange** ๐ŸŒŸ represents the highest average ratings for **Low** and **Medium** price ranges, while **Yellow** ๐ŸŒŸ is most common for **Very High** price ranges.

This analysis can help us understand how price influences customer ratings and how restaurants are perceived across different price ranges. ๐ŸŽฏ

# Table Booking and Online Delivery Analysis

**1. Percentage of Restaurants Offering Table Booking and Online Delivery**

- **Percentage of Restaurants Offering Table Booking:**

```python
table_booking_percentage = (DATASET['Has Table booking'].value_counts(normalize=True) * 100).round(2)
print("Percentage of restaurants with and without table booking:")
print(table_booking_percentage)
```

**Output:**
```
Percentage of restaurants with and without table booking:
Has Table booking
No 87.88
Yes 12.12
Name: proportion, dtype: float64
```

- **Percentage of Restaurants Offering Online Delivery:**

```python
online_delivery_percentage = (DATASET['Has Online delivery'].value_counts(normalize=True) * 100).round(2)
print("\nPercentage of restaurants with and without online delivery:")
print(online_delivery_percentage)
```

**Output:**
```
Percentage of restaurants with and without online delivery:
Has Online delivery
No 74.34
Yes 25.66
Name: proportion, dtype: float64
```

**Plot for Distribution:**

```python
plt.figure(figsize=(8, 5))
sns.countplot(x='Has Table booking', data=DATASET, palette='pastel')
plt.title('Distribution of Restaurants with Table Booking')
plt.xlabel('Table Booking Available')
plt.ylabel('Number of Restaurants')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()

sns.countplot(x='Has Online delivery', data=DATASET, palette='coolwarm')
plt.title('Distribution of Restaurants with Online Delivery')
plt.xlabel('Online Delivery Available')
plt.ylabel('Number of Restaurants')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
```

**Explanation:**
- The `value_counts(normalize=True)` method calculates the relative frequency of each category.
- A `countplot` is used to visualize how many restaurants offer Table Booking and Online Delivery.

**2. Compare Average Ratings Based on Table Booking Availability**

```python
avg_rating_with_table_booking = DATASET[DATASET['Has Table booking'] == 'Yes']['Aggregate rating'].mean()
avg_rating_without_table_booking = DATASET[DATASET['Has Table booking'] == 'No']['Aggregate rating'].mean()

print("Average rating of restaurants with table booking:", round(avg_rating_with_table_booking, 2))
print("Average rating of restaurants without table booking:", round(avg_rating_without_table_booking, 2))
```

**Output:**
```
Average rating of restaurants with table booking: 3.44
Average rating of restaurants without table booking: 2.56
```

**Plot for Boxplot:**

```python
plt.figure(figsize=(10, 6))
sns.boxplot(x='Has Table booking', y='Aggregate rating', data=DATASET, palette='muted')
plt.title('Impact of Table Booking on Aggregate Rating')
plt.xlabel('Table Booking Available')
plt.ylabel('Aggregate Rating')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
```

**Explanation:**
- This compares the average ratings for restaurants that offer table booking with those that do not, using a boxplot for better visualization.

**3. Analyze Online Delivery Availability by Price Range**

```python
DATASET['Price Range Category'] = pd.cut(DATASET['Average Cost for two'], bins=[0, 500, 1000, 1500, 5000], labels=['Low', 'Medium', 'High', 'Very High'])
online_delivery_by_price = DATASET.groupby('Price Range Category')['Has Online delivery'].value_counts(normalize=True).unstack().fillna(0) * 100
print("Percentage of restaurants offering online delivery by price range:")
print(online_delivery_by_price.round(2))
```

**Output:**
```
Percentage of restaurants offering online delivery by price range:
Has Online delivery No Yes
Price Range Category
Low 82.12 17.88
Medium 54.91 45.09
High 64.81 35.19
Very High 77.90 22.10
```

**Plot for Stacked Bar Chart:**

```python
online_delivery_by_price.plot(kind='bar', stacked=True, figsize=(10, 6), colormap='viridis')
plt.title('Online Delivery Availability by Price Range')
plt.xlabel('Price Range Category')
plt.ylabel('Percentage (%)')
plt.legend(title='Online Delivery Available')
plt.show()
```

**Explanation:**
- The dataset is categorized into price ranges, and the percentage of restaurants offering online delivery is calculated for each category.
- A stacked bar chart visualizes the distribution of online delivery services across different price ranges.

# **Feature Engineering for Restaurant Dataset ๐Ÿฝ๏ธ**

## **Introduction ๐ŸŽฏ**

Feature Engineering is a crucial step in enhancing the predictive power of machine learning models. It involves creating new features or modifying existing ones to better represent the underlying patterns in the data. This process helps improve model performance by providing additional useful information derived from raw data. ๐Ÿ“Š

In this analysis, we explore how feature engineering can be applied to a restaurant dataset by creating new features and encoding categorical variables for easier model interpretation. ๐Ÿ”

## **Dataset Overview ๐Ÿ“‹**

The dataset used in this analysis contains restaurant information, including details such as:

- **Restaurant Name ๐Ÿข**
- **Address ๐Ÿ“**
- **Has Table Booking ๐Ÿ“…**
- **Has Online Delivery ๐Ÿš—**
- **Average Cost for Two ๐Ÿ’ธ**
- **Aggregate Rating โญ**

We will focus on transforming and encoding some key columns to generate new insights. ๐Ÿ’ก

## **Feature Engineering Steps โš™๏ธ**

### 1. **Extracting Features: Length of Restaurant Name and Address ๐Ÿ“**

We begin by creating new features that capture the length of the restaurant's name and the address. These features may provide insights into the level of detail in the restaurant's branding and location. ๐Ÿท๏ธ

#### **Code:**
```python
# Create new features for the length of the restaurant name and address
DATASET['Restaurant Name Length'] = DATASET['Restaurant Name'].apply(len)
DATASET['Address Length'] = DATASET['Address'].apply(len)

# Display the new columns for verification
print("Sample data with new length features:")
print(DATASET[['Restaurant Name', 'Restaurant Name Length', 'Address', 'Address Length']].head())
```

#### **Sample Data:**

| Restaurant Name | Restaurant Name Length | Address | Address Length |
|------------------------|------------------------|-----------------------------------------------------|----------------|
| Le Petit Souffle | 16 | Third Floor, Century City Mall, Kalayaan Avenue... | 71 |
| Izakaya Kikufuji | 16 | Little Tokyo, 2277 Chino Roces Avenue, Legaspi... | 67 |
| Heat - Edsa Shangri-La | 22 | Edsa Shangri-La, 1 Garden Way, Ortigas, Mandaluyong | 56 |
| Ooma | 4 | Third Floor, Mega Fashion Hall, SM Megamall, Ortigas| 70 |
| Sambo Kojin | 11 | Third Floor, Mega Atrium, SM Megamall, Ortigas... | 64 |

### 2. **Visualizing the Length Features ๐Ÿ“Š**

We plot the distributions of restaurant name length and address length to understand their variations. ๐Ÿ“ˆ

#### **Code:**
```python
# Plot distribution of 'Restaurant Name Length'
plt.figure(figsize=(12, 6))
sns.histplot(DATASET['Restaurant Name Length'], kde=True, color='GREEN')
plt.title('Distribution of Restaurant Name Length')
plt.xlabel('Length of Restaurant Name')
plt.ylabel('Frequency')
plt.show()

# Plot distribution of 'Address Length'
plt.figure(figsize=(12, 6))
sns.histplot(DATASET['Address Length'], kde=True, color='orange')
plt.title('Distribution of Address Length')
plt.xlabel('Length of Address')
plt.ylabel('Frequency')
plt.show()
```

### 3. **Encoding Categorical Variables: Table Booking and Online Delivery ๐Ÿงฎ**

Next, we encode the categorical features "Has Table Booking" and "Has Online Delivery" into binary numerical values. This transformation helps to use these features in predictive models. ๐Ÿ”ข

#### **Code:**
```python
# Encoding 'Has Table booking' as a binary feature
DATASET['Has Table Booking (Encoded)'] = DATASET['Has Table booking'].map({'Yes': 1, 'No': 0})

# Encoding 'Has Online delivery' as a binary feature
DATASET['Has Online Delivery (Encoded)'] = DATASET['Has Online delivery'].map({'Yes': 1, 'No': 0})
print("Sample data with new encoded features:")
print(DATASET[['Has Table booking', 'Has Table Booking (Encoded)', 'Has Online delivery', 'Has Online Delivery (Encoded)']].head())
```

#### **Sample Data:**

| Has Table booking | Has Table Booking (Encoded) | Has Online delivery | Has Online Delivery (Encoded) |
|-------------------|-----------------------------|---------------------|-------------------------------|
| Yes | 1 | No | 0 |
| Yes | 1 | No | 0 |
| Yes | 1 | No | 0 |
| No | 0 | No | 0 |
| Yes | 1 | No | 0 |

### 4. **Visualizing the Encoded Features ๐Ÿ“Š**

We visualize the distribution of the binary encoded features to understand the proportion of restaurants offering table booking and online delivery. ๐Ÿ“…๐Ÿ“ฆ

#### **Code:**
```python
# Visualize the proportion of restaurants with table booking
plt.figure(figsize=(8, 5))
sns.countplot(x='Has Table Booking (Encoded)', data=DATASET, palette='pastel')
plt.title('Distribution of Restaurants with Table Booking')
plt.xlabel('Has Table Booking (Encoded)')
plt.ylabel('Count')
plt.xticks([0, 1], ['No', 'Yes'])
plt.show()

# Visualize the proportion of restaurants with online delivery
plt.figure(figsize=(8, 5))
sns.countplot(x='Has Online Delivery (Encoded)', data=DATASET, palette='coolwarm')
plt.title('Distribution of Restaurants with Online Delivery')
plt.xlabel('Has Online Delivery (Encoded)')
plt.ylabel('Count')
plt.xticks([0, 1], ['No', 'Yes'])
plt.show()
```

## **Summary of Feature Engineering ๐Ÿ”ง**

### **New Features Created:**
1. **Restaurant Name Length ๐Ÿ“:** The number of characters in the restaurant's name.
2. **Address Length ๐Ÿ“:** The number of characters in the restaurant's address.

### **Categorical Features Encoded:**
1. **Has Table Booking (Encoded) ๐Ÿ“…:** Binary variable (1 for Yes, 0 for No) indicating whether a restaurant offers table booking.
2. **Has Online Delivery (Encoded) ๐Ÿš—:** Binary variable (1 for Yes, 0 for No) indicating whether a restaurant offers online delivery.

### **Visual Insights:**
- The distribution of the length of restaurant names and addresses shows variations in branding and location descriptions. ๐ŸŒ
- The encoded features reveal the proportion of restaurants offering table booking and online delivery services. ๐Ÿ“…๐Ÿ“ฆ

### 1. **Data Preparation** ๐Ÿ“Š

- **Loading the Dataset**: First, we load the dataset using pandas.

```python
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns

file_path = r'Cognifyz-Data-Mastery-Program\DATASETS\Dataset .csv'
DATASET = pd.read_csv(file_path)
```

- **Drop Irrelevant Columns**: Drop columns that aren't useful for prediction, like `Restaurant Name`, `Address`, `Locality`, `Longitude`, and `Latitude`.

```python
# Drop irrelevant columns
data = DATASET.drop(columns=['Restaurant Name', 'Address', 'Locality', 'Longitude', 'Latitude'])
```

- **Label Encoding**: For categorical columns such as `Price Range Category`, we use Label Encoding to convert them into numeric values.

```python
label_encoder = LabelEncoder()
data['Price Range Category'] = label_encoder.fit_transform(data['Price Range Category'])
```

- **Selecting Relevant Features**: We now select the features (X) and target variable (y) for the model.

```python
# Select features (X) and target (y)
X = DATASET[['Price Range Category', 'Average Cost for two', 'Has Table booking', 'Has Online delivery', 'Votes']]
y = DATASET['Aggregate rating']
```

- **Train-Test Split**: Split the data into training and testing sets (80% training, 20% testing).

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

---

### 2. **Model Building** ๐Ÿค–

We will experiment with the following three models:

- **Linear Regression**
- **Decision Tree Regressor**
- **Random Forest Regressor**

#### **Linear Regression Model**:
```python
from sklearn.linear_model import LinearRegression

# Instantiate and train the Linear Regression model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Predict on the test set
lr_pred = lr_model.predict(X_test)
```

#### **Decision Tree Regressor**:
```python
from sklearn.tree import DecisionTreeRegressor

# Instantiate and train the Decision Tree model
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)

# Predict on the test set
dt_pred = dt_model.predict(X_test)
```

#### **Random Forest Regressor**:
```python
from sklearn.ensemble import RandomForestRegressor

# Instantiate and train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict on the test set
rf_pred = rf_model.predict(X_test)
```

---

### 3. **Model Evaluation** ๐Ÿ“ˆ

We will evaluate each model using the following metrics:

- **R-Squared**: Measures how well the model explains the variance in the target variable. (Higher is better)
- **Mean Squared Error (MSE)**: Measures the average squared difference between predicted and actual values. (Lower is better)

```python
from sklearn.metrics import mean_squared_error, r2_score

# Linear Regression evaluation
lr_r2 = r2_score(y_test, lr_pred)
lr_mse = mean_squared_error(y_test, lr_pred)

# Decision Tree evaluation
dt_r2 = r2_score(y_test, dt_pred)
dt_mse = mean_squared_error(y_test, dt_pred)

# Random Forest evaluation
rf_r2 = r2_score(y_test, rf_pred)
rf_mse = mean_squared_error(y_test, rf_pred)

# Print the results
print("Linear Regression - R2:", lr_r2, "MSE:", lr_mse)
print("Decision Tree - R2:", dt_r2, "MSE:", dt_mse)
print("Random Forest - R2:", rf_r2, "MSE:", rf_mse)
```

The results will likely show that Random Forest performs better due to its ability to handle non-linearity and interactions in the data.

---

### 4. **Compare the Models** ๐Ÿค”

Compare the performance of the models based on **R-squared** and **MSE**:

- **R-squared**: The closer it is to 1, the better the model fits the data.
- **MSE**: The smaller it is, the more accurate the model.

You will typically see that Random Forest will have the highest R-squared and the lowest MSE.

---

### 5. **Feature Importance (Random Forest)** ๐Ÿ”‘

Random Forest allows us to visualize which features are contributing the most to predicting the target variable (`Aggregate rating`).

```python
importances = rf_model.feature_importances_
sorted_idx = importances.argsort()

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(X.columns[sorted_idx], importances[sorted_idx])
plt.title("Feature Importance (Random Forest)")
plt.xlabel("Feature Importance")
plt.show()
```

This plot will give you an idea of which features have the most significant impact on predicting restaurant ratings.

---

### 6. **Hyperparameter Tuning for Decision Tree and Random Forest** ๐Ÿ› ๏ธ

We can use **GridSearchCV** to find the best hyperparameters for the models. Here's an example for Random Forest:

```python
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter grid for Random Forest
param_grid_rf = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'max_features': ['auto', 'sqrt', 'log2']
}

# Initialize GridSearchCV
grid_search_rf = GridSearchCV(estimator=rf_model, param_grid=param_grid_rf, cv=5, scoring='neg_mean_squared_error', n_jobs=-1, verbose=2)

# Fit the model
grid_search_rf.fit(X_train, y_train)

# Get the best hyperparameters
print("Best hyperparameters for Random Forest:", grid_search_rf.best_params_)
```

Once we find the best hyperparameters, we use the tuned model to predict:

```python
# Use the best model to predict
best_rf_model = grid_search_rf.best_estimator_
rf_pred_tuned = best_rf_model.predict(X_test)

# Evaluate the tuned Random Forest model
rf_r2_tuned = r2_score(y_test, rf_pred_tuned)
rf_mse_tuned = mean_squared_error(y_test, rf_pred_tuned)

print("Tuned Random Forest - R2:", rf_r2_tuned, "MSE:", rf_mse_tuned)
```

---

### 7. **Model Residual Analysis** ๐Ÿ”

Residual analysis helps check the modelโ€™s performance and if it's underfitting or overfitting.

```python
# Calculate residuals for the Random Forest model (or any other model)
residuals_rf = y_test - rf_pred_tuned

# Plot residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_test, residuals_rf, color='blue', edgecolor='k', alpha=0.7)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residuals vs Actual Values (Random Forest)')
plt.xlabel('Actual Aggregate Rating')
plt.ylabel('Residuals')
plt.show()

# Histogram of residuals
plt.figure(figsize=(10, 6))
plt.hist(residuals_rf, bins=20, color='green', edgecolor='black', alpha=0.7)
plt.title('Distribution of Residuals (Random Forest)')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()
```

---

### 8. **Model Interpretability (SHAP or LIME)** ๐ŸŒŸ

If you want to understand how the model makes predictions, you can use **SHAP** or **LIME** for interpretability.

Here's how to use **SHAP** for the Random Forest model:

```python
import shap

# Create the SHAP explainer
explainer = shap.TreeExplainer(best_rf_model)

# Calculate SHAP values for the test set
shap_values = explainer.shap_values(X_test)

# Plot summary of SHAP values
shap.summary_plot(shap_values, X_test)
```

This summary plot will show you how much each feature contributes to each individual prediction.

By the end of this process, you will have built, evaluated, and tuned three regression models: **Linear Regression**, **Decision Tree**, and **Random Forest**. Random Forest is likely to be the best-performing model due to its ability to handle complex relationships in the data. You will also have visualized feature importance, performed residual analysis, and used SHAP for model interpretability.

Feature engineering has provided us with insightful new features and encoded variables that can be used in predictive models. These transformations help capture meaningful patterns and relationships, making the data more suitable for machine learning algorithms. ๐Ÿ’ป

By understanding and applying these techniques, you can significantly improve the performance of your models and gain deeper insights into the data. ๐Ÿš€

Certainly! Hereโ€™s a detailed and in-depth README for your repository with all the provided code and an explanation of each part:

---

# ๐Ÿฝ๏ธ **Cuisine Ratings Analysis** ๐Ÿด

aims to analyze cuisine ratings, votes, and uncover patterns between them. We use clustering techniques and visualization to explore how votes and ratings interact and which cuisines perform best across various metrics. Below, youโ€™ll find the full analysis workflow, from data cleaning to clustering, along with in-depth explanations and visualizations.

## ๐Ÿš€ **Workflow Overview**

### **1. Data Loading and Preprocessing** ๐Ÿงน

We begin by loading the data and cleaning it. The dataset contains several cuisines with aggregate ratings, votes, and cuisine names. We clean the data, fill missing values, and prepare it for further analysis.

```python
import pandas as pd
import numpy as np

# Load the dataset
cuisine_data = pd.read_csv("Cuisine_Rating_Votes.csv")

# Fill missing values
cuisine_data.fillna(method='ffill', inplace=True)

# Summary of the dataset
cuisine_data.info()
```

- **Missing Values Handling**: The `fillna(method='ffill')` method is used to forward-fill any missing values.
- **Dataset Overview**: We get a basic overview of the dataset with `info()` to understand its structure.

---

### **2. Exploratory Data Analysis (EDA)** ๐Ÿ”

#### **Cuisines with Consistent Ratings** ๐Ÿ’ฏ

Next, we identify the cuisines that have consistent ratings by calculating the standard deviation of the aggregate ratings.

```python
# Calculate standard deviation of ratings for each cuisine
rating_std = cuisine_data.groupby('Cuisines')['Aggregate rating'].std()

# Cuisines with lowest standard deviation (consistent ratings)
consistent_cuisines = rating_std.sort_values().head(10)
```

- **Consistent Ratings**: Cuisines like `Italian`, `Hawaiian`, and `American` are identified as having the most consistent ratings, with low standard deviation.

#### **Top Cuisines by Average Rating** ๐ŸŒŸ

We then calculate the average rating for each cuisine to find out which ones have the best average rating.

```python
# Calculate the average rating by cuisine
avg_rating_by_cuisine = cuisine_data.groupby('Cuisines')['Aggregate rating'].mean()

# Top 10 cuisines with highest average ratings
top_cuisines = avg_rating_by_cuisine.sort_values(ascending=False).head(10)
```

- **Top Cuisines**: This code highlights the cuisines with the highest average ratings, such as `Italian`, `Hawaiian`, and `American`.

#### **Cuisines Rated by the Most People** ๐Ÿ‘ฅ

We now identify which cuisines have the most number of ratings, as more ratings usually indicate more popularity.

```python
# Count the number of ratings for each cuisine
ratings_count = cuisine_data.groupby('Cuisines')['Votes'].sum()

# Top 10 cuisines rated by the most people
top_cuisines_by_votes = ratings_count.sort_values(ascending=False).head(10)
```

- **Most Rated Cuisines**: The most rated cuisines are those that have the highest number of votes, such as `American` and `Italian`.

---

### **3. Data Visualization** ๐Ÿ“Š

#### **Distribution of Aggregate Ratings** ๐Ÿ“‰

We visualize the distribution of ratings using a histogram to see the overall spread.

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Histogram for Aggregate Ratings
sns.histplot(cuisine_data['Aggregate rating'], kde=True)
plt.title('Distribution of Aggregate Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()
```

- **Histogram**: The histogram shows the distribution of ratings across all cuisines, with a clear concentration of ratings between 4 and 5.

#### **Votes vs. Aggregate Rating** ๐Ÿ“ˆ

We use a scatter plot to visualize how the number of votes relates to the aggregate ratings.

```python
sns.scatterplot(x=cuisine_data['Votes'], y=cuisine_data['Aggregate rating'])
plt.title('Votes vs. Aggregate Rating')
plt.xlabel('Number of Votes')
plt.ylabel('Aggregate Rating')
plt.show()
```

- **Scatter Plot**: The plot shows that as the number of votes increases, the aggregate rating generally increases, with some outliers.

#### **Cuisines with the Most Consistent Ratings** ๐Ÿ“

We create a bar plot to display the cuisines with the most consistent ratings.

```python
sns.barplot(x=consistent_cuisines.index, y=consistent_cuisines.values)
plt.title('Cuisines with Most Consistent Ratings')
plt.xlabel('Cuisine')
plt.ylabel('Standard Deviation of Ratings')
plt.xticks(rotation=90)
plt.show()
```

- **Bar Plot**: The plot highlights the top cuisines with the lowest standard deviations in their ratings, indicating consistency.

---

### **4. Clustering Cuisines** ๐Ÿค–

We apply KMeans clustering to group cuisines based on their `Votes` and `Aggregate rating` values. This allows us to find patterns in how cuisines are rated and voted upon.

```python
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

# Select relevant features for clustering
X = cuisine_data[['Votes', 'Aggregate rating']]

# Normalize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=0)
cuisine_data['Cluster'] = kmeans.fit_predict(X_scaled)

# Visualizing the clusters
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1], hue=cuisine_data['Cluster'], palette='viridis')
plt.title('Clustering of Cuisines Based on Votes and Ratings')
plt.xlabel('Normalized Votes')
plt.ylabel('Normalized Aggregate Rating')
plt.show()
```

- **Clustering**: We use KMeans clustering to categorize cuisines into three groups based on their vote count and rating.
- **Visualization**: The scatter plot visualizes how different cuisines are clustered based on these features.

---

### **5. Insights and Summary** ๐Ÿ’ก

From the analysis, we gain the following insights:

- **Top Rated Cuisines**: `Italian`, `American`, and `Hawaiian` are among the top-rated cuisines.
- **Consistency**: Cuisines with low standard deviation in ratings, like `Italian`, `American`, and `Mexican`, are highly consistent in their ratings.
- **Popularity**: Cuisines with the most votes are generally those that have more global recognition, such as `Italian` and `American`.
- **Cluster Groupings**: Clustering based on `Votes` and `Aggregate rating` reveals that cuisines like `Italian` and `Mexican` form their own clusters based on higher ratings and votes.

---

## ๐Ÿ› ๏ธ **Libraries Used**

- **Pandas**: For data manipulation and analysis.
- **Matplotlib & Seaborn**: For static and visualizations.
- **Scikit-learn**: For clustering techniques.
- **NumPy**: For numerical operations.
- **Plotly**: For creating interactive visualizations.

This provides an in-depth analysis of cuisine ratings, votes, and how they correlate with each other. By using clustering techniques, we uncover hidden patterns and gain insights into which cuisines are consistently rated highly and which ones are most popular based on votes. The combination of data cleaning, EDA, and clustering makes this analysis a comprehensive exploration of the cuisine ratings dataset.

To include the image from your GitHub repository and create a small-size dashboard for **Vites vs Aggregate Rating** in your README, here's how you can modify it:

---

### ๐Ÿ“Š **Vites vs Aggregate Rating Dashboard** ๐Ÿš€

The following visualization showcases the relationship between the **number of votes (Vites)** and **Aggregate Rating** for each cuisine. It helps us understand how higher ratings correlate with more votes, providing insights into the popularity and consistency of cuisines.

#### **Visualization** ๐Ÿ–ผ๏ธ

You can view the plot below, which visualizes the correlation between `Votes` and `Aggregate Rating` for each cuisine:

![Vites vs Aggregate Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/Capture.PNG?raw=true)

## ๐Ÿ“Š **Data Visualization Gallery** ๐Ÿš€

### Here are various visualizations for better understanding of data:

| ![Average Rating by Cuisine](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/average_rating_by_cuisine.png?raw=true) | ![Boxplot Ratings by Cuisine](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/boxplot_ratings_by_cuisine.png?raw=true) |
| --- | --- |
| **Average Rating by Cuisine** | **Boxplot Ratings by Cuisine** |

| ![Correlation Heatmap](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/correlation_heatmap.png?raw=true) | ![Jointplot Votes vs Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/jointplot_votes_rating.png?raw=true) |
| --- | --- |
| **Correlation Heatmap** | **Jointplot Votes vs Rating** |

| ![Pair Plot](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/pair_plot.png?raw=true) | ![Pairplot Votes vs Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/pairplot_votes_rating.png?raw=true) |
| --- | --- |
| **Pair Plot** | **Pairplot Votes vs Rating** |

| ![Rating Distribution Boxplot](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/rating_distribution_boxplot.png?raw=true) | ![Rating Distribution Histogram](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/rating_distribution_histogram.png?raw=true) |
| --- | --- |
| **Rating Distribution Boxplot** | **Rating Distribution Histogram** |

| ![Swarmplot Ratings by Cuisines](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/swarmplot_ratings_cuisines.png?raw=true) | ![Top Cuisines Average Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/top_cuisines_avg_rating.png?raw=true) |
| --- | --- |
| **Swarmplot Ratings by Cuisines** | **Top Cuisines Average Rating** |

| ![Violinplot Votes by Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/violinplot_votes_by_rating.png?raw=true) | ![Votes vs Aggregate Rating](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/votes_vs_aggregate_rating.png?raw=true) |
| --- | --- |
| **Violinplot Votes by Rating** | **Votes vs Aggregate Rating** |

| ![Votes vs Rating Scatter](https://github.com/rubydamodar/Cognifyz-Data-Mastery-Program/blob/main/LEVEL%203%20TASK%203%20Data%20Visualization/votes_vs_rating_scatter.png?raw=true) | |
| --- | --- |
| **Votes vs Rating Scatter** | |