https://github.com/shadowfaxx1/web-sraper

web-scraper with sentiment analysis
https://github.com/shadowfaxx1/web-sraper

Last synced: 2 months ago
JSON representation

web-scraper with sentiment analysis

Host: GitHub
URL: https://github.com/shadowfaxx1/web-sraper
Owner: shadowfaxx1
Created: 2023-06-12T19:52:02.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-08-16T13:03:16.000Z (almost 2 years ago)
Last Synced: 2025-01-26T08:41:44.206Z (4 months ago)
Language: Python
Homepage:
Size: 637 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: readme.md

Awesome Lists containing this project

README

# 🔰 WEB SCRAPING :

[![WebScraper Data Science](https://img.shields.io/badge/WebScraper-Data%20Science-blueviolet)](https://github.com/your-username/webscraper-data-science)

### What it does 🩹
It extracts every article along with the heading stores it in a New file in a txt format and then does a sentiment analysis of the provided article which has the below > fields

### fields ⤵️
- "url"
- "Positive Sentences"
- "Negative Sentences"
- "Polarity"
- "Subjectivity",
- "Average Sentence Length"
- "Complex Word Percentage",
- "Fog Index"
- "Average WordLength"
- "Complex Word Count"
- "Word Count""Syllable Count"
- "Personal Pronouns"

## IMPORTANT LIBRARIES TO INSTALL
- beautifulSoup
- Requests
- Pandas
- os
- nltk
- re
- string

## INSTALLATION

1. Clone this repository to your local machine.
2. Install the required dependencies by running > `pip install -r requirements.txt`.

## CHANGING PATH
- stopword folder
- dict_negative
- dict_positive
- inputfile.xlsx
- for storing the created text file for every article scraped

## USAGE

FOLLOW THESE STEP
1. change the paths of the stopwords folder and files
```
def initialization():
#paths initialize them according to the location of your data
stopword_folder=r"StopWords" #folder not "file"
dictionary_postive=r"positivewords.txt"
dictionary_negative=r"negativewords.txt"

for filename in os.listdir(stopword_folder):
with open(os.path.join(stopword_folder, filename), 'r') as file:
stopw.update([word.lower() for word in file.read().splitlines()])
```
2. change path of input file provide a csv file with urls
```
def file_open():
filepath = r"input.xlsx"
df = pd.read_excel(filepath)
dataset=list()
```

## SCREENSHOTS

> ### SAMPLE REQUESTS PINGING
![https://github.com/shadowfaxx1/Web-Sraper/blob/91e1207928ba83da345310d7291a3e8b819b9e58/sample_screenshots/requests.png](https://github.com/shadowfaxx1/Web-Sraper/blob/91e1207928ba83da345310d7291a3e8b819b9e58/sample_screenshots/requests.png )
> ### SAMPLE OUTPUT
![https://github.com/shadowfaxx1/Web-Sraper/tree/91e1207928ba83da345310d7291a3e8b819b9e58/sample_screenshots](https://github.com/shadowfaxx1/Web-Sraper/blob/91e1207928ba83da345310d7291a3e8b819b9e58/sample_screenshots/ouput.png )

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shadowfaxx1/web-sraper

Awesome Lists containing this project

README