https://github.com/2000pawan/sentiment-analysis-

Sentiment Analysis from scrap text of different website article.
https://github.com/2000pawan/sentiment-analysis-

artificial-intelligence beautifulsoup machine-learning nlp nltk python requests-library-python selenium-python sentiment-analysis webscraping

Last synced: 4 months ago
JSON representation

Sentiment Analysis from scrap text of different website article.

Host: GitHub
URL: https://github.com/2000pawan/sentiment-analysis-
Owner: 2000pawan
License: mit
Created: 2025-03-26T18:14:56.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-03-28T10:35:36.000Z (7 months ago)
Last Synced: 2025-03-28T11:23:33.865Z (7 months ago)
Topics: artificial-intelligence, beautifulsoup, machine-learning, nlp, nltk, python, requests-library-python, selenium-python, sentiment-analysis, webscraping
Language: Jupyter Notebook
Homepage:
Size: 367 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          ---

# **Sentiment Analysis Project**

## **📌 Project Overview**

This project performs **sentiment analysis** on text extracted from URLs using **Natural Language Processing (NLP)** techniques. It analyzes the sentiment and readability of the extracted content and saves the results in an Excel file.

---

## **1. Approach to the Solution**

### **🔹 Step 1: Load Necessary Data**

- The input file `input.xlsx` contains **URLs** with their corresponding **URL_IDs**.

- Stopwords are loaded from the [`StopWords`](https://github.com/2000pawan/Sentiment-Analysis-/tree/main/StopWords) folder to filter out unnecessary words.

- A **positive** and **negative** words dictionary is loaded from [`MasterDictionary`](https://github.com/2000pawan/Sentiment-Analysis-/tree/main/MasterDictionary).

### **🔹 Step 2: Extract Text from URLs**

- The script fetches the webpage content using `requests` and **BeautifulSoup**.

- It extracts the **title** and **article text** from `
` and `
` tags.

- The extracted text is stored in the `url_text/` folder as a `.txt` file for each URL.

### **🔹 Step 3: Preprocessing the Extracted Text**

- Tokenization is performed using `nltk.word_tokenize()`.

- Stopwords and non-alphanumeric words are removed.

### **🔹 Step 4: Sentiment Analysis & Text Complexity Metrics**

For each extracted text, the script computes:

✔ **Positive Score** (count of positive words)  

✔ **Negative Score** (count of negative words)  

✔ **Polarity Score** (positive vs. negative balance)  

✔ **Subjectivity Score** (extent of opinion-based content)  

✔ **Complexity Measures**:

   - **Fog Index** (readability metric)

   - **Syllables per word**

   - **Complex word count** (words with >2 syllables)

   - **Personal Pronoun Count** (e.g., "I", "we", "my")

✔ **General Statistics**:

   - **Average sentence length**

   - **Average word length**

### **🔹 Step 5: Save Output to Excel**

- The computed scores and metrics are saved in `output.xlsx`.

- Each row contains a **URL_ID, URL, and sentiment analysis results**.

---

## **2. How to Run the Script**

### **💻 Prerequisites**

Ensure you have **Python 3.x** installed on your system.

### **📌 Steps to Run the Script**

1️⃣ **Clone the Repository**  

Run the following command to download the project:

```sh

git clone https://github.com/2000pawan/Sentiment-Analysis-

```

2️⃣ **Install Dependencies**  

Run the following command in your terminal or command prompt:

```sh

pip install pandas requests beautifulsoup4 nltk openpyxl

```

3️⃣ **Prepare the Input File**  

- Place `input.xlsx` in the same directory as the script.

- Ensure it has **two columns: `URL_ID` and `URL`**.

4️⃣ **Download Master Dictionary & Stopwords**  

- **Master Dictionary:** [Download Here](https://github.com/2000pawan/Sentiment-Analysis-/tree/main/MasterDictionary)  

- **Stopwords:** [Download Here](https://github.com/2000pawan/Sentiment-Analysis-/tree/main/StopWords)  

Make sure these are placed in their respective folders.

5️⃣ **Run the Script**  

Navigate to the project folder and execute:

```sh

python Sentiment_analysis.py

```

6️⃣ **View the Output**  

- Extracted webpage text is saved in `url_text/`.

- The final **sentiment analysis report** is saved as `output.xlsx`.

---

## **3. Required Dependencies**

Ensure you have the following Python libraries installed:

| **Library**        | **Purpose**  | **Installation Command** |

|--------------------|-------------|-------------------------|

| `pandas`          | Handling Excel data  | `pip install pandas` |

| `requests`        | Fetching webpage content  | `pip install requests` |

| `beautifulsoup4`  | Parsing HTML  | `pip install beautifulsoup4` |

| `nltk`            | Natural Language Processing  | `pip install nltk` |

| `openpyxl`        | Handling Excel files  | `pip install openpyxl` |

Additionally, download the **NLTK tokenizer data** by running:

```python

import nltk

nltk.download('punkt')

```

---

## **4. Import Commands**

Ensure your script includes the following imports at the beginning:

```python

import os

import pandas as pd

import requests

from bs4 import BeautifulSoup

import nltk

import re

from nltk.tokenize import word_tokenize, sent_tokenize

nltk.download('punkt')

```

---

## **📌 Notes**

✅ If an error occurs due to a missing directory (`url_text/`), manually create it or ensure the script includes:

```python

os.makedirs("url_text", exist_ok=True)

```

✅ Ensure your input Excel file (`input.xlsx`) is correctly formatted.

✅ If you face encoding issues, use `ISO-8859-1` while reading text files.

---

License

This project is open-source and available for modification and enhancement.

**🎯 Project Completed 🚀**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/2000pawan/sentiment-analysis-

Awesome Lists containing this project

README

` and `