Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
https://github.com/crwlrsoft/crawler
crawler crawling hacktoberfest php scraper scraping scraping-websites web-crawler web-crawling web-scraper web-scraping
Last synced: 6 days ago
JSON representation
Library for Rapid (Web) Crawler and Scraper Development
- Host: GitHub
- URL: https://github.com/crwlrsoft/crawler
- Owner: crwlrsoft
- License: mit
- Created: 2022-01-12T22:20:59.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-10-24T11:02:55.000Z (3 months ago)
- Last Synced: 2024-10-25T09:57:00.967Z (3 months ago)
- Topics: crawler, crawling, hacktoberfest, php, scraper, scraping, scraping-websites, web-crawler, web-crawling, web-scraper, web-scraping
- Language: PHP
- Homepage: https://www.crwlr.software/packages/crawler
- Size: 737 KB
- Stars: 334
- Watchers: 4
- Forks: 12
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
README
# Library for Rapid (Web) Crawler and Scraper Development
This library provides kind of a framework and a lot of ready to use, so-called __steps__, that you can use as building blocks, to build your own crawlers and scrapers with.
To give you an overview, here's a list of things that it helps you with:
* [Crawler __Politeness__](https://www.crwlr.software/packages/crawler/the-crawler/politeness) 😇 (respecting robots.txt, throttling,...)
* Load URLs using
* [a __(PSR-18) HTTP client__](https://www.crwlr.software/packages/crawler/the-crawler/loaders) (default is of course Guzzle)
* or a [__headless browser__](https://www.crwlr.software/packages/crawler/the-crawler/loaders#using-a-headless-browser) (chrome) to get source after Javascript execution
* [Get __absolute links__ from HTML documents](https://www.crwlr.software/packages/crawler/included-steps/html#html-get-link) 🔗
* [Get __sitemaps__ from robots.txt and get all URLs from those sitemaps](https://www.crwlr.software/packages/crawler/included-steps/sitemap)
* [__Crawl__ (load) all pages of a website](https://www.crwlr.software/packages/crawler/included-steps/http#crawling) 🕷
* [Use __cookies__ (or don't)](https://www.crwlr.software/packages/crawler/the-crawler/loaders#http-loader) 🍪
* [Use any __HTTP methods__ (GET, POST,...) and send any headers or body](https://www.crwlr.software/packages/crawler/included-steps/http#http-requests)
* [Easily iterate over __paginated__ list pages](https://www.crwlr.software/packages/crawler/included-steps/http#paginating) 🔁
* Extract data from:
* [__HTML__](https://www.crwlr.software/packages/crawler/included-steps/html#extracting-data) and also [__XML__](https://www.crwlr.software/packages/crawler/included-steps/xml) (using CSS selectors or XPath queries)
* [__JSON__](https://www.crwlr.software/packages/crawler/included-steps/json) (using dot notation)
* [__CSV__](https://www.crwlr.software/packages/crawler/included-steps/csv) (map columns)
* [Extract __schema.org__ structured data](https://www.crwlr.software/packages/crawler/included-steps/html#schema-org) in __JSON-LD__ format from HTML documents
* [Keep memory usage low](https://www.crwlr.software/packages/crawler/crawling-procedure#memory-usage) by using PHP __Generators__ 💪
* [__Cache__ HTTP responses](https://www.crwlr.software/packages/crawler/response-cache) during development, so you don't have to load pages again and again after every code change
* [Get __logs__](https://www.crwlr.software/packages/crawler/the-crawler#loggers) about what your crawler is doing (accepts any PSR-3 LoggerInterface)
* And a lot more...## Documentation
You can find the documentation at [crwlr.software](https://www.crwlr.software/packages/crawler/getting-started).
## Contributing
If you consider contributing something to this package, read the [contribution guide (CONTRIBUTING.md)](CONTRIBUTING.md).