https://github.com/samueleleli/webscraping_py

Python project for the purpose of doing web scraping on a list of sites contained in a csv file. Specifically, the script allows searching for keywords or phrases within the site and providing statistics on the results obtained.
https://github.com/samueleleli/webscraping_py

beatifulsoup csv parser python webscraping

Last synced: over 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/samueleleli/webscraping_py
Owner: samueleleli
Created: 2022-05-29T17:49:38.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-08-06T11:20:15.000Z (almost 4 years ago)
Last Synced: 2025-02-02T01:31:55.954Z (over 1 year ago)
Topics: beatifulsoup, csv, parser, python, webscraping
Language: Python
Homepage:
Size: 9.77 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# webscraping_py
Python project with the purpose of web scraping a list of sites contained in a csv file. Specifically, the script allows searching for keywords or phrases within the site.

The search results are saved to a separate csv file.

The `config.py` file allows you to configure:
- the paths, separators and character encoding of the input and output datasets;
- the indexes related to the columns to be taken from the input CSV;
- the header of the output file
- the keywords to be searched in the web page
- the header of the http request
- the delay between requests

## Startup
Run the following command to install the libraries:
```bash
pip install -r requirements.txt
```
Stand in the project root and run the following command to begin URL parsing:
```bash
python script_webscraping.py
```

N.B. Changing the number of columns in the output file involves minor changes within the `script_webscraping.py` file, particularly of the variables `data_error` and `data_OK` and from line 80 to 82.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/samueleleli/webscraping_py

Awesome Lists containing this project

README