https://github.com/sonwaneshivani/web-scraping-and-text-analysis-script
Assignment by Blackcoffer using NLP
https://github.com/sonwaneshivani/web-scraping-and-text-analysis-script
nlp webscraping
Last synced: 3 months ago
JSON representation
Assignment by Blackcoffer using NLP
- Host: GitHub
- URL: https://github.com/sonwaneshivani/web-scraping-and-text-analysis-script
- Owner: sonwaneshivani
- Created: 2024-06-15T05:22:31.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-06-15T05:37:12.000Z (12 months ago)
- Last Synced: 2025-01-14T06:28:23.423Z (5 months ago)
- Topics: nlp, webscraping
- Language: Python
- Homepage:
- Size: 336 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Web Scraping and Text Analysis Script
This Python script performs web scraping, text processing, and analysis on a dataset of URLs. It generates an output file containing various metrics and analysis results.### Dependencies
Python 3.x
pandas
requests
BeautifulSoup
regex
nltk
You can install the required dependencies using pip:
pip install pandas requests beautifulsoup4 regex nltk### How to Run
Clone the Repository: Clone this repository to your local machine.### Install Dependencies:
Install the required dependencies using the command mentioned above.### Prepare Dataset:
Ensure that you have a dataset named Input.xlsx in the Dataset directory. This dataset should contain two columns: URL and URL_ID.### Run the Script:
Execute the Python script web_scraping_text_analysis.py using the following command:
python web_scraping_text_analysis.pyCheck Output: Once the script execution is complete, you'll find the output file named Output Data Structure.xlsx in the root directory. This file contains the analysis results.
### Approach
Web Scraping: The script iterates through the URLs provided in the dataset, scrapes the text content from the web pages, and stores it in a DataFrame.Text Processing: The scraped text is preprocessed by converting it to lowercase, removing non-alphabetic characters, tokenizing, removing stopwords, and lemmatizing.
Sentiment Analysis: Positive and negative word counts are calculated based on predefined dictionaries of positive and negative words. Sentiment score and polar score are calculated.
Text Metrics Calculation: Various text metrics such as average sentence length, percentage of complex words, Fog index, average words per sentence, etc., are calculated using NLTK functions.
Output Generation: The calculated metrics are added to the DataFrame, unnecessary columns are dropped, and the final DataFrame is saved to an Excel file.
#