{"id":13404801,"url":"https://github.com/crwlrsoft/crawler","last_synced_at":"2025-05-15T05:06:01.476Z","repository":{"id":37097289,"uuid":"447387652","full_name":"crwlrsoft/crawler","owner":"crwlrsoft","description":"Library for Rapid (Web) Crawler and Scraper Development","archived":false,"fork":false,"pushed_at":"2025-04-23T15:47:42.000Z","size":1225,"stargazers_count":360,"open_issues_count":2,"forks_count":13,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-23T16:50:44.624Z","etag":null,"topics":["crawler","crawling","hacktoberfest","php","scraper","scraping","scraping-websites","web-crawler","web-crawling","web-scraper","web-scraping"],"latest_commit_sha":null,"homepage":"https://www.crwlr.software/packages/crawler","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crwlrsoft.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-01-12T22:20:59.000Z","updated_at":"2025-04-23T15:46:54.000Z","dependencies_parsed_at":"2024-02-05T23:25:24.257Z","dependency_job_id":"211a64d5-d4a1-4c14-9dd3-edfe8cbf0a3a","html_url":"https://github.com/crwlrsoft/crawler","commit_stats":{"total_commits":269,"total_committers":4,"mean_commits":67.25,"dds":0.06319702602230481,"last_synced_commit":"d3e9a41bc35ca47ed306c8cc1d0d18811d52a76b"},"previous_names":[],"tags_count":79,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwlrsoft%2Fcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwlrsoft%2Fcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwlrsoft%2Fcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwlrsoft%2Fcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crwlrsoft","download_url":"https://codeload.github.com/crwlrsoft/crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254276447,"owners_count":22043867,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawling","hacktoberfest","php","scraper","scraping","scraping-websites","web-crawler","web-crawling","web-scraper","web-scraping"],"created_at":"2024-07-30T19:01:51.619Z","updated_at":"2025-05-15T05:06:01.410Z","avatar_url":"https://github.com/crwlrsoft.png","language":"PHP","funding_links":[],"categories":["PHP"],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\u003ca href=\"https://www.crwlr.software\" target=\"_blank\"\u003e\u003cimg src=\"https://github.com/crwlrsoft/graphics/blob/eee6cf48ee491b538d11b9acd7ee71fbcdbe3a09/crwlr-logo.png\" alt=\"crwlr.software logo\" width=\"260\"\u003e\u003c/a\u003e\u003c/p\u003e\n\n# Library for Rapid (Web) Crawler and Scraper Development\n\nThis library provides kind of a framework and a lot of ready to use, so-called __steps__, that you can use as building blocks, to build your own crawlers and scrapers with.\n\nTo give you an overview, here's a list of things that it helps you with:\n* [Crawler __Politeness__](https://www.crwlr.software/packages/crawler/the-crawler/politeness) \u0026#128519; (respecting robots.txt, throttling,...)\n* Load URLs using\n    * [a __(PSR-18) HTTP client__](https://www.crwlr.software/packages/crawler/the-crawler/loaders) (default is of course Guzzle)\n    * or a [__headless browser__](https://www.crwlr.software/packages/crawler/the-crawler/loaders#using-a-headless-browser) (chrome) to get source after Javascript execution\n* [Get __absolute links__ from HTML documents](https://www.crwlr.software/packages/crawler/included-steps/html#html-get-link) \u0026#x1F517;\n* [Get __sitemaps__ from robots.txt and get all URLs from those sitemaps](https://www.crwlr.software/packages/crawler/included-steps/sitemap)\n* [__Crawl__ (load) all pages of a website](https://www.crwlr.software/packages/crawler/included-steps/http#crawling) \u0026#x1F577;\n* [Use __cookies__ (or don't)](https://www.crwlr.software/packages/crawler/the-crawler/loaders#http-loader) \u0026#x1F36A;\n* [Use any __HTTP methods__ (GET, POST,...) and send any headers or body](https://www.crwlr.software/packages/crawler/included-steps/http#http-requests)\n* [Easily iterate over __paginated__ list pages](https://www.crwlr.software/packages/crawler/included-steps/http#paginating) \u0026#x1F501;\n* Extract data from:\n    * [__HTML__](https://www.crwlr.software/packages/crawler/included-steps/html#extracting-data) and also [__XML__](https://www.crwlr.software/packages/crawler/included-steps/xml) (using CSS selectors or XPath queries)\n    * [__JSON__](https://www.crwlr.software/packages/crawler/included-steps/json) (using dot notation)\n    * [__CSV__](https://www.crwlr.software/packages/crawler/included-steps/csv) (map columns)\n* [Extract __schema.org__ structured data](https://www.crwlr.software/packages/crawler/included-steps/html#schema-org) in __JSON-LD__ format from HTML documents\n* [Keep memory usage low](https://www.crwlr.software/packages/crawler/crawling-procedure#memory-usage) by using PHP __Generators__ \u0026#x1F4AA;\n* [__Cache__ HTTP responses](https://www.crwlr.software/packages/crawler/response-cache) during development, so you don't have to load pages again and again after every code change\n* [Get __logs__](https://www.crwlr.software/packages/crawler/the-crawler#loggers) about what your crawler is doing (accepts any PSR-3 LoggerInterface)\n* And a lot more...\n\n## Documentation\n\nYou can find the documentation at [crwlr.software](https://www.crwlr.software/packages/crawler/getting-started).\n\n## Contributing\n\nIf you consider contributing something to this package, read the [contribution guide (CONTRIBUTING.md)](CONTRIBUTING.md).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrwlrsoft%2Fcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrwlrsoft%2Fcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrwlrsoft%2Fcrawler/lists"}