{"id":17279501,"url":"https://github.com/reagentx/scraper-platform","last_synced_at":"2025-06-30T20:31:37.030Z","repository":{"id":68633706,"uuid":"132377014","full_name":"ReagentX/scraper-platform","owner":"ReagentX","description":"A simple multiprocessed Python scraper platform powered by RegEx and requests.","archived":false,"fork":false,"pushed_at":"2018-09-07T05:26:27.000Z","size":24,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-26T14:53:04.475Z","etag":null,"topics":["multithreading","regex","scraper"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ReagentX.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-05-06T21:21:26.000Z","updated_at":"2025-02-22T22:18:47.000Z","dependencies_parsed_at":null,"dependency_job_id":"e6a20b43-efe7-4ae8-8557-6a3be50f8fc5","html_url":"https://github.com/ReagentX/scraper-platform","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ReagentX/scraper-platform","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ReagentX%2Fscraper-platform","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ReagentX%2Fscraper-platform/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ReagentX%2Fscraper-platform/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ReagentX%2Fscraper-platform/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ReagentX","download_url":"https://codeload.github.com/ReagentX/scraper-platform/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ReagentX%2Fscraper-platform/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":262846069,"owners_count":23373750,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["multithreading","regex","scraper"],"created_at":"2024-10-15T09:17:48.249Z","updated_at":"2025-06-30T20:31:36.944Z","avatar_url":"https://github.com/ReagentX.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scraper-platform\n\nA simple multiprocessed Python scraper platform powered by RegEx and `requests`.\n\n## Classes\n\n### HTMLCache\n\nThis class contains the methods to construct a cache of HTML information for scraping.\n\nThis class is constructed with a `timedelta` for the expiration length of the cache database and a boolean for whether the cache will be used or bypassed.\n\n### Scraper\n\nThis class contains the methods to apply a set of `Rules()` to a list of URLs in multiple threads.\n\nThis class is constructed with an integer that sets the maximum number of threads.\n\n### Rules\n\nThis class contains the code to parse URLs. It is initialized with an `HTMLCache` object. Rules classes must be kept in the `scraper_rules` module.\n\nThis class is constructed with an HTMLCache which the rules will use to get the HTML of webpages.\n\nThe class methods are rules that are designed to take a single URL and apply some transformation to it, then return a list of data.\n\n## Examples\n\nTo install the package, download/clone it, `cd` to the directory, and run `python setup.py develop`.\n\nThis allows for fast testing of RegEx rules across a set of URLs. To begin, we need to import a few things:\n\n    from scraper_platform import scraper, cache\n    from datetime import timedelta\n    from scraper_rules import sample_rules\n\nHere, we import the `scraper` and `cache` modules from the `scraper_platform` module as well as the `sample_rules` module from the `scraper_rules` module. We also import the `timedelta` datatype so we can construct an `HTMLCache` object.\n\nNext, we need to create the objects we imported:\n\n    s = scraper.Scraper(2)\n    c = cache.HTMLCache(timedelta(days=2), True)\n\nHere, we create a Scraper object that will use at maximum two threads and a Cache object where the HTML cached will have an expiration of 2 days from access. Changing the boolean argument to false will bypass the cache altogether.\n\nIf we print `s` and `c` we should get:\n\n    \u003cScraper object: max 2 threads\u003e\n    \u003cHTMLCache object: expiry: 2 days, 0:00:00, cached\u003e\n\nNext, we can create our sample rules class and define a list of URLs to analyze. For this example I use Google's store page and some regex to capture all of the http/https links:\n\n    r = sample_rules.Rules(c)\n    urls = [\n    'https://store.google.com/category/phones',\n    'https://store.google.com/category/home_entertainment',\n    'https://store.google.com/category/laptops_tablets',\n    'https://store.google.com/category/virtual_reality'\n    ]\n\nTo get the data, we then run the scraper by passing the list of URLs and the function we want to apply to them:\n\n    data = s.scrape(urls, r.get_links)\n\nThis maps the URLs to the rule method `get_links` and returns a list of the data we asked for in a variable called `data`.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freagentx%2Fscraper-platform","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Freagentx%2Fscraper-platform","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Freagentx%2Fscraper-platform/lists"}