https://github.com/crwlrsoft/crawler

Library for Rapid (Web) Crawler and Scraper Development
https://github.com/crwlrsoft/crawler

crawler crawling hacktoberfest php scraper scraping scraping-websites web-crawler web-crawling web-scraper web-scraping

Last synced: 5 months ago
JSON representation

Library for Rapid (Web) Crawler and Scraper Development

Host: GitHub
URL: https://github.com/crwlrsoft/crawler
Owner: crwlrsoft
License: mit
Created: 2022-01-12T22:20:59.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2025-04-23T15:47:42.000Z (6 months ago)
Last Synced: 2025-04-23T16:50:44.624Z (6 months ago)
Topics: crawler, crawling, hacktoberfest, php, scraper, scraping, scraping-websites, web-crawler, web-crawling, web-scraper, web-scraping
Language: PHP
Homepage: https://www.crwlr.software/packages/crawler
Size: 1.17 MB
Stars: 360
Watchers: 4
Forks: 13
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          


# Library for Rapid (Web) Crawler and Scraper Development

This library provides kind of a framework and a lot of ready to use, so-called __steps__, that you can use as building blocks, to build your own crawlers and scrapers with.

To give you an overview, here's a list of things that it helps you with:

* [Crawler __Politeness__](https://www.crwlr.software/packages/crawler/the-crawler/politeness) 😇 (respecting robots.txt, throttling,...)

* Load URLs using

    * [a __(PSR-18) HTTP client__](https://www.crwlr.software/packages/crawler/the-crawler/loaders) (default is of course Guzzle)

    * or a [__headless browser__](https://www.crwlr.software/packages/crawler/the-crawler/loaders#using-a-headless-browser) (chrome) to get source after Javascript execution

* [Get __absolute links__ from HTML documents](https://www.crwlr.software/packages/crawler/included-steps/html#html-get-link) 🔗

* [Get __sitemaps__ from robots.txt and get all URLs from those sitemaps](https://www.crwlr.software/packages/crawler/included-steps/sitemap)

* [__Crawl__ (load) all pages of a website](https://www.crwlr.software/packages/crawler/included-steps/http#crawling) 🕷

* [Use __cookies__ (or don't)](https://www.crwlr.software/packages/crawler/the-crawler/loaders#http-loader) 🍪

* [Use any __HTTP methods__ (GET, POST,...) and send any headers or body](https://www.crwlr.software/packages/crawler/included-steps/http#http-requests)

* [Easily iterate over __paginated__ list pages](https://www.crwlr.software/packages/crawler/included-steps/http#paginating) 🔁

* Extract data from:

    * [__HTML__](https://www.crwlr.software/packages/crawler/included-steps/html#extracting-data) and also [__XML__](https://www.crwlr.software/packages/crawler/included-steps/xml) (using CSS selectors or XPath queries)

    * [__JSON__](https://www.crwlr.software/packages/crawler/included-steps/json) (using dot notation)

    * [__CSV__](https://www.crwlr.software/packages/crawler/included-steps/csv) (map columns)

* [Extract __schema.org__ structured data](https://www.crwlr.software/packages/crawler/included-steps/html#schema-org) in __JSON-LD__ format from HTML documents

* [Keep memory usage low](https://www.crwlr.software/packages/crawler/crawling-procedure#memory-usage) by using PHP __Generators__ 💪

* [__Cache__ HTTP responses](https://www.crwlr.software/packages/crawler/response-cache) during development, so you don't have to load pages again and again after every code change

* [Get __logs__](https://www.crwlr.software/packages/crawler/the-crawler#loggers) about what your crawler is doing (accepts any PSR-3 LoggerInterface)

* And a lot more...

## Documentation

You can find the documentation at [crwlr.software](https://www.crwlr.software/packages/crawler/getting-started).

## Contributing

If you consider contributing something to this package, read the [contribution guide (CONTRIBUTING.md)](CONTRIBUTING.md).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/crwlrsoft/crawler

Awesome Lists containing this project

README