https://github.com/samueleleli/webscraping_py
Python project for the purpose of doing web scraping on a list of sites contained in a csv file. Specifically, the script allows searching for keywords or phrases within the site and providing statistics on the results obtained.
https://github.com/samueleleli/webscraping_py
beatifulsoup csv parser python webscraping
Last synced: about 1 year ago
JSON representation
Python project for the purpose of doing web scraping on a list of sites contained in a csv file. Specifically, the script allows searching for keywords or phrases within the site and providing statistics on the results obtained.
- Host: GitHub
- URL: https://github.com/samueleleli/webscraping_py
- Owner: samueleleli
- Created: 2022-05-29T17:49:38.000Z (about 4 years ago)
- Default Branch: main
- Last Pushed: 2022-08-06T11:20:15.000Z (almost 4 years ago)
- Last Synced: 2025-02-02T01:31:55.954Z (over 1 year ago)
- Topics: beatifulsoup, csv, parser, python, webscraping
- Language: Python
- Homepage:
- Size: 9.77 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# webscraping_py
Python project with the purpose of web scraping a list of sites contained in a csv file. Specifically, the script allows searching for keywords or phrases within the site.
The search results are saved to a separate csv file.
The `config.py` file allows you to configure:
- the paths, separators and character encoding of the input and output datasets;
- the indexes related to the columns to be taken from the input CSV;
- the header of the output file
- the keywords to be searched in the web page
- the header of the http request
- the delay between requests
## Startup
Run the following command to install the libraries:
```bash
pip install -r requirements.txt
```
Stand in the project root and run the following command to begin URL parsing:
```bash
python script_webscraping.py
```
N.B. Changing the number of columns in the output file involves minor changes within the `script_webscraping.py` file, particularly of the variables `data_error` and `data_OK` and from line 80 to 82.