{"id":19557177,"url":"https://github.com/gingray/pycrawl","last_synced_at":"2025-06-25T18:33:46.534Z","repository":{"id":13889050,"uuid":"16587308","full_name":"gingray/PyCrawl","owner":"gingray","description":null,"archived":false,"fork":false,"pushed_at":"2014-02-08T09:39:28.000Z","size":140,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-26T08:14:58.712Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gingray.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-02-06T17:28:49.000Z","updated_at":"2014-02-08T09:39:29.000Z","dependencies_parsed_at":"2022-08-23T14:50:54.359Z","dependency_job_id":null,"html_url":"https://github.com/gingray/PyCrawl","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gingray/PyCrawl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gingray%2FPyCrawl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gingray%2FPyCrawl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gingray%2FPyCrawl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gingray%2FPyCrawl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gingray","download_url":"https://codeload.github.com/gingray/PyCrawl/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gingray%2FPyCrawl/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261931102,"owners_count":23232015,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-11T04:40:41.919Z","updated_at":"2025-06-25T18:33:46.501Z","avatar_url":"https://github.com/gingray.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Simple way to fetching data over internet\n\nExample:\n\nfrom data_fetcher import RangeFetcher\nfrom web_client import WebClient \nimport lxml.html\n\ndef worker(url, content):\n    print url\n    root = lxml.html.fromstring(content)\n    hrefs = root.xpath(\".//a/@href\")\n    for row in hrefs:\n        print row\n\nweb_client = WebClient()\ntemplate = \"http://somesite.com/%s\"\nrange_parser = RangeFetcher(web_client, worker, template, 1, 3)\nrange_parser.process()\n\n\n\nthe main idea is that you have some site with url organization like\n\n\nhttp://somesite.com/1\nhttp://somesite.com/2\nhttp://somesite.com/3\n\n\n\nYou can set url template and range, fetcher will crawl all of them and execute worker on each of it.\nYou can also use UrlFileFetcher the main idea is the same but links took from file\n\nExample:\n\nfrom data_fetcher import UrlFileFetcher\nfrom web_client import WebClient \nimport lxml.html\n\ndef worker(url, content):\n    print url\n    root = lxml.html.fromstring(content)\n    hrefs = root.xpath(\".//a/@href\")\n    for row in hrefs:\n        print row\n\nweb_client = WebClient()\nfilename = \"links.txt\"\nrange_parser = UrlFileFetcher(web_client, worker, filename)\nrange_parser.process()\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgingray%2Fpycrawl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgingray%2Fpycrawl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgingray%2Fpycrawl/lists"}