{"id":17968981,"url":"https://github.com/sobak/scrawler","last_synced_at":"2025-03-25T10:32:41.411Z","repository":{"id":57054542,"uuid":"172265322","full_name":"Sobak/scrawler","owner":"Sobak","description":"Declarative, scriptable web robot (crawler) and scrapper","archived":false,"fork":false,"pushed_at":"2020-04-09T04:17:16.000Z","size":254,"stargazers_count":9,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"develop","last_synced_at":"2025-03-20T01:11:23.826Z","etag":null,"topics":["crawler","crawler-engine","robots-txt","scraper","scraping-websites"],"latest_commit_sha":null,"homepage":"http://scrawler.sobak.pl","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sobak.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-02-23T21:19:49.000Z","updated_at":"2025-01-23T18:41:27.000Z","dependencies_parsed_at":"2022-08-24T14:00:15.856Z","dependency_job_id":null,"html_url":"https://github.com/Sobak/scrawler","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sobak%2Fscrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sobak%2Fscrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sobak%2Fscrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sobak%2Fscrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sobak","download_url":"https://codeload.github.com/Sobak/scrawler/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245444484,"owners_count":20616388,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","crawler-engine","robots-txt","scraper","scraping-websites"],"created_at":"2024-10-29T14:42:02.911Z","updated_at":"2025-03-25T10:32:40.908Z","avatar_url":"https://github.com/Sobak.png","language":"PHP","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Scrawler\n\n[![Packagist](https://img.shields.io/packagist/v/sobak/scrawler.svg?style=flat-square)](https://packagist.org/packages/sobak/scrawler)\n[![Travis build](https://img.shields.io/travis/Sobak/scrawler.svg?style=flat-square)](https://travis-ci.org/Sobak/scrawler)\n[![Test Coverage](https://api.codeclimate.com/v1/badges/87612b6a62c95287a108/test_coverage)](https://codeclimate.com/github/Sobak/scrawler/test_coverage)\n\nScrawler is a declarative, scriptable web robot (crawler) and scrapper which\nyou can easily configure to parse any website and process the information into\nthe desired format.\n\nConfiguration is based on the building _blocks_, for which you can provide your\nown implementations allowing for further customization of the process.\n\n## Install\nAs usual, start by installing the library with Composer:\n\n```bash\ncomposer require sobak/scrawler\n```\n\n## Usage\n\n```php\n\u003c?php\n\nuse App\\PostEntity;\nuse Sobak\\Scrawler\\Block\\Matcher\\CssSelectorHtmlMatcher;\nuse Sobak\\Scrawler\\Block\\Matcher\\CssSelectorListMatcher;\nuse Sobak\\Scrawler\\Block\\ResultWriter\\FilenameProvider\\EntityPropertyFilenameProvider;\nuse Sobak\\Scrawler\\Block\\ResultWriter\\JsonFileResultWriter;\nuse Sobak\\Scrawler\\Block\\UrlListProvider\\ArgumentAdvancerUrlListProvider;\nuse Sobak\\Scrawler\\Configuration\\Configuration;\nuse Sobak\\Scrawler\\Configuration\\ObjectConfiguration;\n\nrequire 'vendor/autoload.php';\n\n$scrawler = new Configuration();\n\n$scrawler\n    -\u003esetOperationName('Sobakowy Blog')\n    -\u003esetBaseUrl('http://sobak.pl')\n    -\u003eaddUrlListProvider(new ArgumentAdvancerUrlListProvider('/page/%u', 2))\n    -\u003eaddObjectDefinition('post', new CssSelectorListMatcher('article.hentry'), function (ObjectConfiguration $object) {\n        $object\n            -\u003eaddFieldDefinition('date', new CssSelectorHtmlMatcher('time.entry-date'))\n            -\u003eaddFieldDefinition('content', new CssSelectorHtmlMatcher('div.entry-content'))\n            -\u003eaddFieldDefinition('title', new CssSelectorHtmlMatcher('h1.entry-title a'))\n            -\u003eaddEntityMapping(PostEntity::class)\n            -\u003eaddResultWriter(PostEntity::class, new JsonFileResultWriter([\n                'directory' =\u003e 'posts/',\n                'filename' =\u003e new EntityPropertyFilenameProvider([\n                    'property' =\u003e 'slug',\n                ]),\n            ]))\n        ;\n    })\n;\n\nreturn $scrawler;\n```\n\nAfter saving the configuration file (perhaps as a `config.php`) all you have to\ndo is execute this command:\n\n```bash\nphp vendor/bin/scrawler crawl config.php\n```\n\nThe example shown above will fetch [http://sobak.pl]() page, then it will iterate\nover all existing post pages (limited by first 404 occurence) starting from 2nd,\nget all posts on each page, map them to `App\\PostEntity` objects and finally write\nthe results down to individual JSON files using post slugs as filenames.\n\nAs you can see with this short code, almost half of it being the imports,\nyou can easily achieve quite tedious task for which you would otherwise need\nto get a few libraries, define rules to follow, provide correct map to write\ndown the file... Scrawler does it all for you!\n\n\u003e **Note:** Scrawler _does not_ aim to execute client side code, by design.\n\u003e This completely is doable (look at headless Chrome or even phantom.js if\n\u003e you like history) but I consider it out of scope for this project and have\n\u003e no interest in developing it. Thanks for understanding.\n\n## Documentation\nFor the detailed documentation please check the table of contents below.\n\n- [Getting started](docs/getting-started.md)\n- [Configuration](docs/configuration.md)\n- [Entities](docs/entities.md)\n- [Blocks](docs/blocks.md)\n- [Cookbook](docs/cookbook.md)\n- [Changelog](CHANGELOG.md)\n\nIf you are already familiar with the basic Scrawler concepts you will probably\nbe mostly interested in the _\"Blocks\"_ chapter. _Block_ in Scrawler is an\nabstracted, swappable piece of logic defining the crawling, scrapping or result\nprocessing operations which you can customize using one of many builtin classes\nor even your own, tailored implementation. Looking at the example above, you\ncould provide custom logic for `UrlListProvider` or `ResultWriter` (just\nexamples for many of the available block types).\n\n\u003e **Note:** I have to admit I am not a fan of excessive DocBlocks usage.\n\u003e That's why documentation in the code is sparse and focuses mainly\n\u003e on interfaces, especially ones for creating custom implementation\n\u003e of blocks. Use the documentation linked above and obviously read the\n\u003e code.\n\n## Just be polite\nBefore you start tinkering with a library, please remember: some people do not want\ntheir websites to be scrapped by bots. With growing percentage of bandwidth being\ncaused by bots it might not only be considered problematic from the business\nstandpoint but also expensive to handle all that traffic. Please respect that.\nEven though Scrawler provides implementations for some blocks, which might be useful\nto mimic the actual internet user, you should not use them to bypass anti-scrapping\nmeasures taken by some of the website owners.\n\n\u003e **Note:** For the testing purposes you can freely crawl [my website](http://sobak.pl),\n\u003e _excluding_ its subdomains. Just please leave the default user agent.\n\n## License\nScrawler is distributed under the MIT license. For the details please check the\ndedicated [LICENSE](LICENSE.md) file.\n\n## Contributing\nFor the details on how to contribute please check the dedicated\n[CONTRIBUTING](CONTRIBUTING.md) file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsobak%2Fscrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsobak%2Fscrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsobak%2Fscrawler/lists"}