{"id":18552031,"url":"https://github.com/baraja-core/webcrawler","last_synced_at":"2025-09-03T02:41:11.349Z","repository":{"id":48404222,"uuid":"198583964","full_name":"baraja-core/webcrawler","owner":"baraja-core","description":"Simple crawling websites by following links.","archived":false,"fork":false,"pushed_at":"2024-06-09T20:13:55.000Z","size":90,"stargazers_count":6,"open_issues_count":1,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-24T13:05:09.840Z","etag":null,"topics":["bot","crawler","crawling-websites","fast","php","robot","speed"],"latest_commit_sha":null,"homepage":"https://en.php.brj.cz/downloading-the-whole-site-by-links-in-php","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/baraja-core.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"janbarasek","custom":["https://brj.app","https://baraja.cz","https://php.baraja.cz"]}},"created_at":"2019-07-24T07:40:29.000Z","updated_at":"2024-11-21T15:12:43.000Z","dependencies_parsed_at":"2024-06-15T14:31:03.593Z","dependency_job_id":null,"html_url":"https://github.com/baraja-core/webcrawler","commit_stats":{"total_commits":58,"total_committers":2,"mean_commits":29.0,"dds":"0.017241379310344862","last_synced_commit":"e69b21e31c509fd827d7ef4dfb6decb8acdcd8e3"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baraja-core%2Fwebcrawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baraja-core%2Fwebcrawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baraja-core%2Fwebcrawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baraja-core%2Fwebcrawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/baraja-core","download_url":"https://codeload.github.com/baraja-core/webcrawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248123697,"owners_count":21051513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bot","crawler","crawling-websites","fast","php","robot","speed"],"created_at":"2024-11-06T21:11:39.939Z","updated_at":"2025-04-09T22:31:47.100Z","avatar_url":"https://github.com/baraja-core.png","language":"PHP","funding_links":["https://github.com/sponsors/janbarasek","https://brj.app","https://baraja.cz","https://php.baraja.cz"],"categories":[],"sub_categories":[],"readme":"\u003cdiv align='center'\u003e\n  \u003cpicture\u003e\n    \u003csource media='(prefers-color-scheme: dark)' srcset='https://cdn.brj.app/images/brj-logo/logo-regular.png'\u003e\n    \u003cimg src='https://cdn.brj.app/images/brj-logo/logo-dark.png' alt='BRJ logo'\u003e\n  \u003c/picture\u003e\n  \u003cbr\u003e\n  \u003ca href=\"https://brj.app\"\u003eBRJ organisation\u003c/a\u003e\n\u003c/div\u003e\n\u003chr\u003e\n\nWeb crawler\n===========\n\n![Integrity check](https://github.com/baraja-core/webcrawler/workflows/Integrity%20check/badge.svg)\n\nSimply library for crawling websites by following links with minimal dependencies.\n\n[Czech documentation](https://php.baraja.cz/stazeni-celeho-webu-po-odkazech)\n\n📦 Installation\n---------------\n\nIt's best to use [Composer](https://getcomposer.org) for installation, and you can also find the package on\n[Packagist](https://packagist.org/packages/baraja-core/webcrawler) and\n[GitHub](https://github.com/baraja-core/webcrawler).\n\nTo install, simply use the command:\n\n```\n$ composer require baraja-core/webcrawler\n```\n\nYou can use the package manually by creating an instance of the internal classes, or register a DIC extension to link the services directly to the Nette Framework.\n\nHow to use\n----------\n\nCrawler can run without dependencies.\n\nIn default settings create instance and call `crawl()` method:\n\n```php\n$crawler = new \\Baraja\\WebCrawler\\Crawler;\n\n$result = $crawler-\u003ecrawl('https://example.com');\n```\n\nIn `$result` variable will be entity of type `CrawledResult`.\n\nAdvanced checking of multiple URLs\n----------------------------------\n\nIn real case you need download multiple URLs in single domain and check if some specific URLs works.\n\nSimple example:\n\n```php\n$crawler = new \\Baraja\\WebCrawler\\Crawler;\n\n$result = $crawler-\u003ecrawlList(\n    'https://example.com', // Starting (main) URL\n    [ // Additional URLs\n        'https://example.com/error-404',\n        '/robots.txt', // Relative links are also allowed\n        '/web.config',\n    ]\n);\n```\n\nNotice: File **robots.txt** and sitemap will be downloaded automatically if exist.\n\nSettings\n--------\n\nIn constructor of service `Crawler` you can define your project specific configuration.\n\nSimply like:\n\n```php\n$crawler = new \\Baraja\\WebCrawler\\Crawler(\n    new \\Baraja\\WebCrawler\\Config([\n        // key =\u003e value\n    ])\n);\n```\n\nNo one value is required. Please use as key-value array.\n\nConfiguration options:\n\n| Option                  | Default value | Possible values |\n|-------------------------|---------------|-----------------|\n| `followExternalLinks`   | `false`       | `Bool`: Stay only in given domain? |\n| `sleepBetweenRequests`  | `1000`        | `Int`: Sleep in milliseconds. |\n| `maxHttpRequests`       | `1000000`     | `Int`: Crawler budget limit. |\n| `maxCrawlTimeInSeconds` | `30`          | `Int`: Stop crawling when limit is exceeded. |\n| `allowedUrls`           | `['.+']`      | `String[]`: List of valid regex about allowed URL format. |\n| `forbiddenUrls`         | `['']`        | `String[]`: List of valid regex about banned URL format. |\n\n\n📄 License\n-----------\n\n`baraja-core/webcrawler` is licensed under the MIT license. See the [LICENSE](https://github.com/baraja-core/variable-generator/blob/master/LICENSE) file for more details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaraja-core%2Fwebcrawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaraja-core%2Fwebcrawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaraja-core%2Fwebcrawler/lists"}