https://github.com/sonwaneshivani/web-scraping-and-text-analysis-script

Assignment by Blackcoffer using NLP
https://github.com/sonwaneshivani/web-scraping-and-text-analysis-script

nlp webscraping

Last synced: 3 months ago
JSON representation

Assignment by Blackcoffer using NLP

Host: GitHub
URL: https://github.com/sonwaneshivani/web-scraping-and-text-analysis-script
Owner: sonwaneshivani
Created: 2024-06-15T05:22:31.000Z (12 months ago)
Default Branch: main
Last Pushed: 2024-06-15T05:37:12.000Z (12 months ago)
Last Synced: 2025-01-14T06:28:23.423Z (5 months ago)
Topics: nlp, webscraping
Language: Python
Homepage:
Size: 336 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

## Web Scraping and Text Analysis Script
This Python script performs web scraping, text processing, and analysis on a dataset of URLs. It generates an output file containing various metrics and analysis results.

### Dependencies
Python 3.x
pandas
requests
BeautifulSoup
regex
nltk
You can install the required dependencies using pip:
pip install pandas requests beautifulsoup4 regex nltk

### How to Run
Clone the Repository: Clone this repository to your local machine.

### Install Dependencies:
Install the required dependencies using the command mentioned above.

### Prepare Dataset:
Ensure that you have a dataset named Input.xlsx in the Dataset directory. This dataset should contain two columns: URL and URL_ID.

### Run the Script:
Execute the Python script web_scraping_text_analysis.py using the following command:
python web_scraping_text_analysis.py

Check Output: Once the script execution is complete, you'll find the output file named Output Data Structure.xlsx in the root directory. This file contains the analysis results.

### Approach
Web Scraping: The script iterates through the URLs provided in the dataset, scrapes the text content from the web pages, and stores it in a DataFrame.

Text Processing: The scraped text is preprocessed by converting it to lowercase, removing non-alphabetic characters, tokenizing, removing stopwords, and lemmatizing.

Sentiment Analysis: Positive and negative word counts are calculated based on predefined dictionaries of positive and negative words. Sentiment score and polar score are calculated.

Text Metrics Calculation: Various text metrics such as average sentence length, percentage of complex words, Fog index, average words per sentence, etc., are calculated using NLTK functions.

Output Generation: The calculated metrics are added to the DataFrame, unnecessary columns are dropped, and the final DataFrame is saved to an Excel file.
#

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sonwaneshivani/web-scraping-and-text-analysis-script

Awesome Lists containing this project

README