https://github.com/shadowfaxx1/web-sraper
web-scraper with sentiment analysis
https://github.com/shadowfaxx1/web-sraper
Last synced: 2 months ago
JSON representation
web-scraper with sentiment analysis
- Host: GitHub
- URL: https://github.com/shadowfaxx1/web-sraper
- Owner: shadowfaxx1
- Created: 2023-06-12T19:52:02.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2023-08-16T13:03:16.000Z (almost 2 years ago)
- Last Synced: 2025-01-26T08:41:44.206Z (4 months ago)
- Language: Python
- Homepage:
- Size: 637 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# 🔰 WEB SCRAPING :
[](https://github.com/your-username/webscraper-data-science)
### What it does 🩹
It extracts every article along with the heading stores it in a New file in a txt format and then does a sentiment analysis of the provided article which has the below > fields
### fields ⤵️
- "url"
- "Positive Sentences"
- "Negative Sentences"
- "Polarity"
- "Subjectivity",
- "Average Sentence Length"
- "Complex Word Percentage",
- "Fog Index"
- "Average WordLength"
- "Complex Word Count"
- "Word Count""Syllable Count"
- "Personal Pronouns"## IMPORTANT LIBRARIES TO INSTALL
- beautifulSoup
- Requests
- Pandas
- os
- nltk
- re
- string
## INSTALLATION1. Clone this repository to your local machine.
2. Install the required dependencies by running > `pip install -r requirements.txt`.
## CHANGING PATH
- stopword folder
- dict_negative
- dict_positive
- inputfile.xlsx
- for storing the created text file for every article scraped## USAGE
FOLLOW THESE STEP
1. change the paths of the stopwords folder and files
```
def initialization():
#paths initialize them according to the location of your data
stopword_folder=r"StopWords" #folder not "file"
dictionary_postive=r"positivewords.txt"
dictionary_negative=r"negativewords.txt"
for filename in os.listdir(stopword_folder):
with open(os.path.join(stopword_folder, filename), 'r') as file:
stopw.update([word.lower() for word in file.read().splitlines()])
```
2. change path of input file provide a csv file with urls
```
def file_open():
filepath = r"input.xlsx"
df = pd.read_excel(filepath)
dataset=list()
```## SCREENSHOTS
> ### SAMPLE REQUESTS PINGING

> ### SAMPLE OUTPUT
