https://github.com/harshitwaldia/exploratory-data-analysis

Exploratory Data Analysis with data cleaning, visualization, and insights discovery.
https://github.com/harshitwaldia/exploratory-data-analysis

exploratory-data-analysis jypyternotebook outlier-detection python sentiment-analysis textblob wordcloud-visualization

Last synced: about 2 months ago
JSON representation

Exploratory Data Analysis with data cleaning, visualization, and insights discovery.

Host: GitHub
URL: https://github.com/harshitwaldia/exploratory-data-analysis
Owner: HarshitWaldia
Created: 2025-09-26T04:21:37.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-09-26T04:45:25.000Z (9 months ago)
Last Synced: 2025-09-26T06:20:44.393Z (9 months ago)
Topics: exploratory-data-analysis, jypyternotebook, outlier-detection, python, sentiment-analysis, textblob, wordcloud-visualization
Language: Jupyter Notebook
Homepage:
Size: 6.05 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 📊 Amazon Product Reviews - Exploratory Data Analysis (EDA)

## 📌 Overview
This project performs **Exploratory Data Analysis (EDA)** on an Amazon product dataset.
The dataset contains product details, prices, discounts, ratings, reviews, and user information.

The goal of this analysis is to:
- Understand the structure and quality of the dataset.
- Identify trends in pricing, discounting, and ratings.
- Explore customer review patterns.
- Detect potential issues like missing values, duplicates, or imbalances.

---

## 🗂️ Dataset Description
The dataset includes the following key columns:

| Column | Description |
|--------|-------------|
| `product_id` | Unique identifier for each product |
| `product_name` | Name/description of the product |
| `category` | Product category (e.g., Electronics, Accessories) |
| `discounted_price` | Selling price after discount |
| `actual_price` | Original price before discount |
| `discount_percentage` | Percentage discount offered |
| `rating` | Customer rating (out of 5) |
| `rating_count` | Number of ratings |
| `about_product` | Short description/features |
| `user_id` | Unique ID of reviewer |
| `user_name` | Name of reviewer |
| `review_id` | Unique ID of review |
| `review_title` | Title of review |
| `review_content` | Full review text |
| `img_link` | Product image link |
| `product_link` | Product page link |

---

## 🔍 Steps in EDA
### 1. **Data Inspection**
- Used `.info()` to check data types, null values, and dataset size.
- Found that most columns are complete, with very few missing values.

### 2. **Descriptive Statistics**
- `.describe()` applied to both numeric and categorical columns.
- Found **mean ≈ median** in prices → data is fairly symmetric.
- Ratings cluster around **4.1**, showing positive bias.

### 3. **Correlation Analysis**
- Computed correlation matrix for numeric features.
- Observed strong negative correlation between `discount_percentage` and `discounted_price`.
- Weak/no correlation between `rating` and price → ratings are not price-driven.

### 4. **Visualizations**
- **Bar Chart**: Average rating per category.
- **Boxplot**: Discount % distribution across categories.
- **Scatterplot**: Discounted price vs rating.
- **Word Cloud**: Most frequent terms in reviews.
- **Heatmap**: Correlations between numeric features.

### 5. **Data Quality Checks**
- Found duplicate product IDs (same product reviewed multiple times).
- Prices and discounts stored as strings (`₹`, `%`) → cleaned and converted to numeric.

---

## 📈 Insights
- Many products receive **4★ or higher** → customer reviews skew positive.
- Discounts are widely offered (~50% most frequent).
- Certain categories dominate the dataset (e.g., Electronics & Accessories).
- Some reviews and users appear multiple times → dataset contains duplicate/overlapping entries.

---

## 🛠️ Tools & Libraries
- **Python 3**
- **Pandas** → data cleaning & manipulation
- **NumPy** → numerical operations
- **Matplotlib / Seaborn** → data visualization
- **WordCloud** → review text analysis

---

## 📌 How to Run
1. Clone the repository:
```
git clone https://github.com/HarshitWaldia/Exploratory-Data-Analysis.git
cd Exploratory-Data-Analysis
```
2. Install required libraries:
```
pip install -r requirements.txt
```

3.Open the Jupyter Notebook:
```
jupyter notebook Amazon_EDA.ipynb
```

4. Run the cells step by step to reproduce the analysis.

## 🚀 Future Work

- **Build a recommendation system using ratings & categories.**

- **Perform sentiment analysis on review text.**

- **Use ML models to predict product ratings based on price & discount.**

## 👨‍💻 Author

**Harshit Waldia**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/harshitwaldia/exploratory-data-analysis

Awesome Lists containing this project

README