https://github.com/mishaa931/web-scraping-data-cleaning-feature-extraction-sentiment-analysis

A script to scrap/extract articles about Transportation from a website . The data is then preprocessed to apply feature extraction and sentiment analysis techniques.
https://github.com/mishaa931/web-scraping-data-cleaning-feature-extraction-sentiment-analysis

data-cleaning feature-extraction polarity scraping-websites sentiment-analysis subjectivity

Last synced: 7 months ago
JSON representation

A script to scrap/extract articles about Transportation from a website . The data is then preprocessed to apply feature extraction and sentiment analysis techniques.

Host: GitHub
URL: https://github.com/mishaa931/web-scraping-data-cleaning-feature-extraction-sentiment-analysis
Owner: Mishaa931
Created: 2022-09-30T17:35:08.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2023-07-22T19:52:17.000Z (about 2 years ago)
Last Synced: 2025-01-12T01:49:44.499Z (9 months ago)
Topics: data-cleaning, feature-extraction, polarity, scraping-websites, sentiment-analysis, subjectivity
Language: Jupyter Notebook
Homepage:
Size: 199 KB
Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Web Scraping-Data cleaning-Feature Extraction-Sentiment Analysis

# Web scraping using BeautifulSoup

# Step 01: Import Libraries

The most important libraries for web scraping and data preprocessing are as follows ;

- Pandas

-	NumPy

-	Request

-	BeautifulSoup

-	Nltk

-	String

# Step 02: Extracting sublinks (Pages) from the main URL

While extracting articles from the URL, we will also get the links of other pages on the main link. This process follows the following steps:

1.	Accessing the link using get() function from request library

2.	Parse the data

3.	Then find all the href tags and save them in a list using loop.

# Step 03: Extracting and Storing Data 

1.	The extracted URLs are now accessed one by one in a loop using BeautifulSoup. 

2.	After accessing the URLs, get_text() function will extract articles from each URL.

3.	Store the data in a DataFrame using Pandas.

# Step 04: Data Preprocessing

- #Text Cleaning:

For a data to be processed in any Machine Learning Model, it must have cleaned the

- extra spaces

- punctuation marks

- characters

- word lemmatization

- stop words

- Lower case all text using lower()

# Step 05: Feature Extraction

Feature extraction refers to the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set. It yields better results than applying machine learning directly to the raw data.

For this purpose, we will tokenize the text using TF-IDF vectorizer and then fit.transform() function will be used on tokenized data.

Now we can apply get_feature_names() to get all the features of each article. These new reduced set of features should then be able to summarize most of the information contained in the original set of features.

For further clarity, we have find the frequency distribution of each feature.

# Step 06: Sentiment Analysis 

Sentiment analysis is basically the process of determining the attitude or the emotion of the writer, i.e., whether it is positive or negative or neutral.

The sentiment function of textblob returns two properties, polarity, and subjectivity.

Polarity is float which lies in the range of [-1,1] where 1 means positive statement and -1 means a negative statement. Subjective sentences generally refer to personal opinion, emotion or judgment whereas objective refers to factual information. Subjectivity is also a float which lies in the range of [0,1].

For our scrapped articles, we have values for negative, positive, neutral and compound sentiments. Each value defines the scope of each article.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mishaa931/web-scraping-data-cleaning-feature-extraction-sentiment-analysis

Awesome Lists containing this project

README