{"id":16538118,"url":"https://github.com/pablolec/oc_web_scraper","last_synced_at":"2026-06-15T20:31:53.544Z","repository":{"id":43397981,"uuid":"343425927","full_name":"PabloLec/oc_web_scraper","owner":"PabloLec","description":"Simple web scraper made for OpenClassrooms studies.","archived":false,"fork":false,"pushed_at":"2022-03-03T15:22:32.000Z","size":93,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-03T21:15:05.947Z","etag":null,"topics":["python","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PabloLec.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-03-01T13:28:12.000Z","updated_at":"2021-05-05T09:49:23.000Z","dependencies_parsed_at":"2022-08-24T09:11:08.019Z","dependency_job_id":null,"html_url":"https://github.com/PabloLec/oc_web_scraper","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/PabloLec/oc_web_scraper","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PabloLec%2Foc_web_scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PabloLec%2Foc_web_scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PabloLec%2Foc_web_scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PabloLec%2Foc_web_scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PabloLec","download_url":"https://codeload.github.com/PabloLec/oc_web_scraper/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PabloLec%2Foc_web_scraper/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34379915,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","scraper"],"created_at":"2024-10-11T18:44:27.551Z","updated_at":"2026-06-15T20:31:53.522Z","avatar_url":"https://github.com/PabloLec.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# oc_web_scraper [![GitHub release (latest by date)](https://img.shields.io/github/v/release/pablolec/oc_web_scraper)](https://github.com/PabloLec/oc_web_scraper/releases/) [![GitHub](https://img.shields.io/github/license/pablolec/oc_web_scraper)](https://github.com/PabloLec/oc_web_scraper/blob/main/LICENCE) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n:books: Made for an [OpenClassrooms](https://openclassrooms.com) studies project.\n\noc_web_scraper scrapes a [dummy book store website](https://books.toscrape.com/) and saves its entire library locally.\n\n## Installation\n\n#### :penguin: Linux / :apple: macOS\n\n```bash\ngit clone https://github.com/pablolec/oc_web_scraper\ncd oc_web_scraper\npython3 -m venv env\nsource env/bin/activate\npip install .\n```\n\n#### :framed_picture: Windows \n\n```powershell\ngit clone https://github.com/pablolec/oc_web_scraper\ncd oc_web_scraper\npy -m venv env\n.\\env\\Scripts\\activate\npip install .\n```\n\n## Usage\n\n**Before execution**, make sure to review `config.yml` to set the scraping content save path. You may also custom the logging behavior.\n\n#### :penguin: Linux / :apple: macOS\n\n```bash\npython3 -m oc_web_scraper\n```\n\n#### :framed_picture: Windows \n\n```powershell\npy -m oc_web_scraper\n```\n\n_:floppy_disk: The website content will be saved into a folder named `data`. Subfolders will be created per category with corresponding books infos inside a csv file and book cover images stored under `data/CATEGORY_NAME/images/`._\n\n## Improvement\n\nAs the MIT Licence once said, the software is provided 'as is'. Being a study project for a particular website, its usage can hardly be extended.\n\n:bulb: Although, performances and UX could be enhanced by:\n\n- Multithreading with creating a pool of either individual GET requests or whole category scrapes.\n- Including date/time in dir and file naming. It would ease periodical scraping.\n- Incremental saving, as the whole process takes several minutes it could be useful to prevent data loss.\n- Comparing scraped results with previously-stored results to bring relevant changes to user attention.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpablolec%2Foc_web_scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpablolec%2Foc_web_scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpablolec%2Foc_web_scraper/lists"}