https://github.com/yash22222/web-scraping-for-data-analysis-predictive-model-on-customer-data

Utilized web scraping for customer feedback at Air India, conducting robust data analysis, and applying machine learning for predictive modeling. Drove data-driven decisions, enhancing services, and elevating customer satisfaction. Expertise in web scraping, analysis, and predictive modeling for actionable insights.
https://github.com/yash22222/web-scraping-for-data-analysis-predictive-model-on-customer-data

data-analysis data-preprocessing data-science data-visualization exploratory-data-analysis machine-learning powerbi random-forest-classifier sentiment-analysis tableau web-scraping

Last synced: 9 months ago
JSON representation

Host: GitHub
URL: https://github.com/yash22222/web-scraping-for-data-analysis-predictive-model-on-customer-data
Owner: Yash22222
Created: 2024-01-05T13:57:45.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-01-05T14:54:06.000Z (almost 2 years ago)
Last Synced: 2025-01-05T23:22:30.746Z (10 months ago)
Topics: data-analysis, data-preprocessing, data-science, data-visualization, exploratory-data-analysis, machine-learning, powerbi, random-forest-classifier, sentiment-analysis, tableau, web-scraping
Language: Jupyter Notebook
Homepage: https://yashashokshirsath.netlify.app/
Size: 5.08 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Web Scraping for Data Analysis & Predictive Model on Customer’s Data
Utilized web scraping for customer feedback at Air India, conducting robust data analysis, and applying machine learning for predictive modeling. Drove data-driven decisions, enhancing services, and elevating customer satisfaction. Expertise in web scraping, analysis, and predictive modeling for actionable insights.

## Table of Contents
* [Web Scraping for Data Analysis](#web-scraping-for-data-analysis)
* [Web Scraping](#web-scraping)
* [Data Preprocessing](#data-preprocessing)
* [Sentiment Analysis](#sentiment-analysis)
* [Data Visualization](#data-visualization)
* [Predictive Modelling on Customer's Data](#predictive-modelling-on-customers-data)
* [Exploratory Data Analysis](#exploratory-data-analysis)
* [Mutual Information graphs](#mutual-information-graphs)
* [Test and Train Model](#test-and-train-model)
* [Validate Model](#validate-model)
* [Conclusion](#conclusion)
* [Libraries Utilized](#libraries-utilized)

# **Web Scraping for Data Analysis**

## **Web Scraping**
Web scraping was employed to gather customer reviews and insights about Air India from the website [Airline Quality](https://www.airlinequality.com/airline-reviews/air-india/). Using web scraping techniques, data was extracted from the website, including customer comments, ratings, and other relevant information, which was then compiled into the "Reviews Dataset" for further analysis, such as predicting customer buying behaviors or understanding customer sentiments towards Air India's services.

## **Data Preprocessing**
Data preprocessing is crucial in the data mining process, involving cleaning, transforming, and integrating data for analysis.
Its goal is to enhance data quality and suitability for specific tasks

### **Data Cleaning**
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.
- Removal of sentences before '|' in dataframe
- Removal of all special characters from the dataframe

### **Tokenization**
- Tokenization is the process of dividing text into a set of meaningful pieces.
- Tokens converted to tuples using POS Tagging, grouped into words through lemmatization.

## **Sentiment Analysis**
Sentiment analysis is the process of analyzing digital text to determine if the emotional tone of the message is positive, negative, or neutral.

### **VADER**
- VADER(Valence Aware Dictionary for Sentiment Reasoning) is an NLTK module that provides sentiment scores based on the words used.
- It is a rule-based sentiment analyzer in which the terms are generally labeled as per their semantic orientation as either positive, negative or neutral.

## **Data Visualization**
Data visualization uses graphics like charts, plots, infographics, and animations to represent complex data relationships and provide easy-to-understand insights.

### **via. Matplotlib**
Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.

### **via. WordCloud**
Wordcloud is basically a visualization technique to represent the frequency of words in a text where the size of the word represents its frequency.

# **Predictive Modelling on Customer's Data**
Predictive models are machine learning algorithms trained on high-quality customer data, requiring manipulation and preparation to accurately predict target outcomes.

## **Exploratory Data Analysis**
- Exploratory Data Analysis is a crucial step in the data analysis process, where the primary goal is to understand the data, gain insights, and identify patterns or relationships between variables.
- Imported Chardet library(Universal Character Encoding Detector) for UTF-8 encoded code, applied to CSV, and checked for null values.

## **Mutual Information graphs**
- MI score graphs visualize feature relevance to the target variable, measuring dependency and aiding feature selection.
- The scikit-learn (sklearn) library calculates the MI_score correlation between attributes, and a graph is plotted for visualization purposes.

## **Test and Train Model**
- Test and train split is a crucial step in building and evaluating machine learning models, dividing datasets into training and test sets.
- Training sets contain 70-80% of data, while test sets allocate 20-30%.
- The code splits data into training, validation, and testing sets, ensuring model training, validation, and testing on different subsets, preventing overfitting, and providing a reliable evaluation.

**MinMaxScaler**

Min-Max Scaling is a preprocessing technique for scaling numerical features to a fixed range, ensuring consistent scaling across all features.

### **via. Random Forest Classifier**
Random Forest is an ensemble learning method combining multiple decision trees, capturing complex relationships and interactions for more accurate and robust models.
- For top-6 features (Accuracy = 74.5762)
- For all features (Accuracy = 71.1864)

### **via. XGB(Extreme Gradient Booster) Classifier**
XGBoost is a popular machine learning algorithm utilizing gradient boosting to optimize model performance and computational efficiency.
- For top-6 features (Accuracy = 71.1864)
- For all features (Accuracy = 71.1864)

## **Validate Model**
Validating the model on the test dataset is an essential step in the machine learning workflow to assess how well the model performs on unseen data.
- Accuracy = 71.1864

# **Conclusion**
**The Random Forest classifier with the top 6 features showed slightly higher accuracy than XGBoost. It can predict customer satisfaction or other target variables in datasets. Performance may vary depending on data quality and representativeness.**

## **Libraries Utilized**
- BeautifulSoup (bs4)
- Chardet
- Matplotlib
- Natural Language Toolkit (nltk)
- Numpy (np)
- Pandas (pd)
- Requests (re)
- Seaborn (sns)
- Scikit-learn (sklearn)
- VaderSentiment (SentimentIntensityAnalyzer)
- Warnings
- WordCloud

## **License**

TAFRAS is distributed under the [![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/yash22222/web-scraping-for-data-analysis-predictive-model-on-customer-data

Awesome Lists containing this project

README