{"id":15598642,"url":"https://github.com/jofaval/webscraping","last_synced_at":"2025-10-06T01:59:52.142Z","repository":{"id":125618491,"uuid":"438067102","full_name":"jofaval/webscraping","owner":"jofaval","description":"WebScraper providing tools to scrape tons of websites with the same base","archived":false,"fork":false,"pushed_at":"2022-01-04T00:36:58.000Z","size":59,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-04T12:52:04.058Z","etag":null,"topics":["crawler","e-commerce","python","scraper","webscraper","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jofaval.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-14T00:31:09.000Z","updated_at":"2022-03-11T20:51:45.000Z","dependencies_parsed_at":"2023-08-12T06:02:36.312Z","dependency_job_id":null,"html_url":"https://github.com/jofaval/webscraping","commit_stats":{"total_commits":32,"total_committers":2,"mean_commits":16.0,"dds":0.03125,"last_synced_commit":"7d3b2658619e4a1d47e8e054f6da7902adb1c3c0"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jofaval%2Fwebscraping","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jofaval%2Fwebscraping/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jofaval%2Fwebscraping/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jofaval%2Fwebscraping/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jofaval","download_url":"https://codeload.github.com/jofaval/webscraping/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246180923,"owners_count":20736460,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","e-commerce","python","scraper","webscraper","webscraping"],"created_at":"2024-10-03T01:40:53.387Z","updated_at":"2025-10-06T01:59:47.085Z","avatar_url":"https://github.com/jofaval.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Web Scraping #\nWeb scraping is data scraping (retrieval) used for extracting data from websites.\n\n## Why?\nThis was part of a degree project, and I expanded on it much further than intially intended.\n\u003cbr /\u003e\nIt started as such, but continued with the idea of having a complex system made simple for scraping a ton of sites easily. Initially focused on the E-Commerce [thewhiskyexchange.com](https://thewhiskyexchange.com).\n\n### Advice\nAlways check the `domain.com/robots.txt` to check if the website allows and acknowledges the use of scrapers on their page.\n\n## Usage\n### Disclaimer\nWebsites that use, or are like React (Vue, Angular, etc.) and/or with any sort of lazy loading **won't** work properly, or at all.\n\nAs with websites that use JS for something further than some really basic user interacion.\n\n### Requirements\n- Python \u003e= v3.6 Installed.\n- Decent Internet connection.\n- Modules (if not previously installed, they will be downloaded on execution):\n  - `requests`, `bs4`, `validators`, `cchardet`, `lxml`,\n\n### How to use? \nMove to your working directory:\n`cd /my/working/directory`\n\nAnd execute the python script (use python3 on a linux OS):\\\n`python script.py` or `python3 script.py`\\\nor the absolute path\\\n`/usr/bin/python3 /my/working/directory/script.py`\n\n#### Tip\nYou can always modify de `THREADS_LIMIT` constant to update the number of *workers* (threads) that will be executed simultaneously. It can really speed up the process, but only if you're computer does allow the number you're inputing.\n\nFor example, I have 8 virtual cores, but 50 *workers* work perfectly fine, more slows it, less are not enough.\n\n### How to download?\nThe basis of webscraping you'd have to replicate what a use would do to check prices and products.\n\nThe important parts to modify on this scraper, at the moment of writting the README 2021-19-12, are:\n- `FIELDS` all the fields to download.\n  - `name` the dict `key`, the name of the field to download, the label.\n  - `query` the CSS query from which to get the element(s). NOT tested with multiple queries at once.\n  - `parser` the function to parse the data recieved, if used, this will override the default data retrieval.\n  - `default` doesn't matter if it's using a parser or not, the default value if the given value is `None`.\n- `CATEGORY_LINK_QUERY` the CSS query to get all the category links.\n- `is_category_url` wether a url is a category url.\n- `is_product_url` wether a url is a product url.\n- `CATEGORY_PRODUCT_LINK_QUERY` the CSS to get all the product links in a category.\n- `BASE_DOMAIN` the base domain of the website, just the base path of the domain, i.e. `google.com/es`, would also work.\n- `IMG_DOMAIN` the base domain of the images. Some websites use different domains for their images.\n- `RETRY` will it attempt to redownload a failed download? It will by default.\n- `DOWNLOAD_ATTEMPTS` how many times do you want to retry? 3 by default.\n- `get_category_products` if your page does allow for pagination without JS, it should be implemented here.\n\n*To see it fully implemented, fully realized, take a look at [`websites/thewhiskyexchange/configuration.py`](./websites/thewhiskyexchange/configuration.py)*\n\n### How to deploy?\nYou'll need the use of cron (UNIX based OS, not tested on Windows OS).\nFor a cron to work you need to use absolute paths (the /usr/bin/... command) otherwise it won't work.\n\nAnd you'd need to specify the time it will execute at, 1 AM, 4 AM (the on I used), 9 AM.\n\n#### Cron example\n`0 4 * * * /usr/bin/python3 /my/working/directory/script.py`\\\nIt will execute *everyday* at exactly 4:00 AM *forever*.\n\nNow with a more real path:\n`0 4 * * * /usr/bin/python3 /home/username/webscraping/script.py`\n\n## Testing\nRun them individually, modify the limit so it only downloads N number of categories or products, just to check if it works.\\\nLater on, you could just pick the categories you want to download (my example and use case), or simply, remove the `category_urls` param and use a `limit = ALL` to get everything on the website scraped.\n\nThere's no unit testing in webscraping as such, there's just manually checking if it works as required.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjofaval%2Fwebscraping","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjofaval%2Fwebscraping","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjofaval%2Fwebscraping/lists"}