https://github.com/pablolec/oc_web_scraper
Simple web scraper made for OpenClassrooms studies.
https://github.com/pablolec/oc_web_scraper
python scraper
Last synced: 4 days ago
JSON representation
Simple web scraper made for OpenClassrooms studies.
- Host: GitHub
- URL: https://github.com/pablolec/oc_web_scraper
- Owner: PabloLec
- License: mit
- Created: 2021-03-01T13:28:12.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2022-03-03T15:22:32.000Z (over 4 years ago)
- Last Synced: 2025-03-03T21:15:05.947Z (over 1 year ago)
- Topics: python, scraper
- Language: Python
- Homepage:
- Size: 90.8 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# oc_web_scraper [](https://github.com/PabloLec/oc_web_scraper/releases/) [](https://github.com/PabloLec/oc_web_scraper/blob/main/LICENCE) [](https://github.com/psf/black)
:books: Made for an [OpenClassrooms](https://openclassrooms.com) studies project.
oc_web_scraper scrapes a [dummy book store website](https://books.toscrape.com/) and saves its entire library locally.
## Installation
#### :penguin: Linux / :apple: macOS
```bash
git clone https://github.com/pablolec/oc_web_scraper
cd oc_web_scraper
python3 -m venv env
source env/bin/activate
pip install .
```
#### :framed_picture: Windows
```powershell
git clone https://github.com/pablolec/oc_web_scraper
cd oc_web_scraper
py -m venv env
.\env\Scripts\activate
pip install .
```
## Usage
**Before execution**, make sure to review `config.yml` to set the scraping content save path. You may also custom the logging behavior.
#### :penguin: Linux / :apple: macOS
```bash
python3 -m oc_web_scraper
```
#### :framed_picture: Windows
```powershell
py -m oc_web_scraper
```
_:floppy_disk: The website content will be saved into a folder named `data`. Subfolders will be created per category with corresponding books infos inside a csv file and book cover images stored under `data/CATEGORY_NAME/images/`._
## Improvement
As the MIT Licence once said, the software is provided 'as is'. Being a study project for a particular website, its usage can hardly be extended.
:bulb: Although, performances and UX could be enhanced by:
- Multithreading with creating a pool of either individual GET requests or whole category scrapes.
- Including date/time in dir and file naming. It would ease periodical scraping.
- Incremental saving, as the whole process takes several minutes it could be useful to prevent data loss.
- Comparing scraped results with previously-stored results to bring relevant changes to user attention.