https://github.com/2000pawan/sentiment-analysis-
This project performs sentiment analysis on text extracted from URLs using Natural Language Processing (NLP) techniques. It analyzes the sentiment and readability of the extracted content and saves the results in an Excel file.
https://github.com/2000pawan/sentiment-analysis-
artificial-intelligence beautifulsoup machine-learning nlp nltk python requests-library-python selenium-python sentiment-analysis webscraping
Last synced: 8 days ago
JSON representation
This project performs sentiment analysis on text extracted from URLs using Natural Language Processing (NLP) techniques. It analyzes the sentiment and readability of the extracted content and saves the results in an Excel file.
- Host: GitHub
- URL: https://github.com/2000pawan/sentiment-analysis-
- Owner: 2000pawan
- License: mit
- Created: 2025-03-26T18:14:56.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-28T10:35:36.000Z (about 1 year ago)
- Last Synced: 2025-07-20T07:32:59.845Z (9 months ago)
- Topics: artificial-intelligence, beautifulsoup, machine-learning, nlp, nltk, python, requests-library-python, selenium-python, sentiment-analysis, webscraping
- Language: Jupyter Notebook
- Homepage:
- Size: 367 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
---
# **Sentiment Analysis Project**
## **📌 Project Overview**
This project performs **sentiment analysis** on text extracted from URLs using **Natural Language Processing (NLP)** techniques. It analyzes the sentiment and readability of the extracted content and saves the results in an Excel file.
---
## **1. Approach to the Solution**
### **🔹 Step 1: Load Necessary Data**
- The input file `input.xlsx` contains **URLs** with their corresponding **URL_IDs**.
- Stopwords are loaded from the [`StopWords`](https://github.com/2000pawan/Sentiment-Analysis-/tree/main/StopWords) folder to filter out unnecessary words.
- A **positive** and **negative** words dictionary is loaded from [`MasterDictionary`](https://github.com/2000pawan/Sentiment-Analysis-/tree/main/MasterDictionary).
### **🔹 Step 2: Extract Text from URLs**
- The script fetches the webpage content using `requests` and **BeautifulSoup**.
- It extracts the **title** and **article text** from `
` and `
` tags.
- The extracted text is stored in the `url_text/` folder as a `.txt` file for each URL.
### **🔹 Step 3: Preprocessing the Extracted Text**
- Tokenization is performed using `nltk.word_tokenize()`.
- Stopwords and non-alphanumeric words are removed.
### **🔹 Step 4: Sentiment Analysis & Text Complexity Metrics**
For each extracted text, the script computes:
✔ **Positive Score** (count of positive words)
✔ **Negative Score** (count of negative words)
✔ **Polarity Score** (positive vs. negative balance)
✔ **Subjectivity Score** (extent of opinion-based content)
✔ **Complexity Measures**:
- **Fog Index** (readability metric)
- **Syllables per word**
- **Complex word count** (words with >2 syllables)
- **Personal Pronoun Count** (e.g., "I", "we", "my")
✔ **General Statistics**:
- **Average sentence length**
- **Average word length**
### **🔹 Step 5: Save Output to Excel**
- The computed scores and metrics are saved in `output.xlsx`.
- Each row contains a **URL_ID, URL, and sentiment analysis results**.
---
## **2. How to Run the Script**
### **💻 Prerequisites**
Ensure you have **Python 3.x** installed on your system.
### **📌 Steps to Run the Script**
1️⃣ **Clone the Repository**
Run the following command to download the project:
```sh
git clone https://github.com/2000pawan/Sentiment-Analysis-
```
2️⃣ **Install Dependencies**
Run the following command in your terminal or command prompt:
```sh
pip install pandas requests beautifulsoup4 nltk openpyxl
```
3️⃣ **Prepare the Input File**
- Place `input.xlsx` in the same directory as the script.
- Ensure it has **two columns: `URL_ID` and `URL`**.
4️⃣ **Download Master Dictionary & Stopwords**
- **Master Dictionary:** [Download Here](https://github.com/2000pawan/Sentiment-Analysis-/tree/main/MasterDictionary)
- **Stopwords:** [Download Here](https://github.com/2000pawan/Sentiment-Analysis-/tree/main/StopWords)
Make sure these are placed in their respective folders.
5️⃣ **Run the Script**
Navigate to the project folder and execute:
```sh
python Sentiment_analysis.py
```
6️⃣ **View the Output**
- Extracted webpage text is saved in `url_text/`.
- The final **sentiment analysis report** is saved as `output.xlsx`.
---
## **3. Required Dependencies**
Ensure you have the following Python libraries installed:
| **Library** | **Purpose** | **Installation Command** |
|--------------------|-------------|-------------------------|
| `pandas` | Handling Excel data | `pip install pandas` |
| `requests` | Fetching webpage content | `pip install requests` |
| `beautifulsoup4` | Parsing HTML | `pip install beautifulsoup4` |
| `nltk` | Natural Language Processing | `pip install nltk` |
| `openpyxl` | Handling Excel files | `pip install openpyxl` |
Additionally, download the **NLTK tokenizer data** by running:
```python
import nltk
nltk.download('punkt')
```
---
## **4. Import Commands**
Ensure your script includes the following imports at the beginning:
```python
import os
import pandas as pd
import requests
from bs4 import BeautifulSoup
import nltk
import re
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
```
---
## **📌 Notes**
✅ If an error occurs due to a missing directory (`url_text/`), manually create it or ensure the script includes:
```python
os.makedirs("url_text", exist_ok=True)
```
✅ Ensure your input Excel file (`input.xlsx`) is correctly formatted.
✅ If you face encoding issues, use `ISO-8859-1` while reading text files.
---
License
This project is open-source and available for modification and enhancement.
**🎯 Project Completed 🚀**