{"id":17046636,"url":"https://github.com/sebobo/shel.crawler","last_synced_at":"2025-04-12T15:33:38.293Z","repository":{"id":44780647,"uuid":"73498425","full_name":"Sebobo/Shel.Crawler","owner":"Sebobo","description":"Neos based crawler for nodes and sites","archived":false,"fork":false,"pushed_at":"2022-12-14T08:08:10.000Z","size":87,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-05-02T06:13:15.439Z","etag":null,"topics":["crawler","neos-cms"],"latest_commit_sha":null,"homepage":"","language":"PHP","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Sebobo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":"FUNDING.yml","license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null},"funding":{"patreon":"shelzle","github":"sebobo"}},"created_at":"2016-11-11T17:40:58.000Z","updated_at":"2023-11-15T11:58:06.000Z","dependencies_parsed_at":"2023-01-28T19:45:52.710Z","dependency_job_id":null,"html_url":"https://github.com/Sebobo/Shel.Crawler","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sebobo%2FShel.Crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sebobo%2FShel.Crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sebobo%2FShel.Crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Sebobo%2FShel.Crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Sebobo","download_url":"https://codeload.github.com/Sebobo/Shel.Crawler/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248589883,"owners_count":21129697,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","neos-cms"],"created_at":"2024-10-14T09:46:57.016Z","updated_at":"2025-04-12T15:33:38.257Z","avatar_url":"https://github.com/Sebobo.png","language":"PHP","funding_links":["https://patreon.com/shelzle","https://github.com/sponsors/sebobo"],"categories":[],"sub_categories":[],"readme":"# Shel.Crawler for Neos CMS\n\nCrawler for Neos CMS nodes and sites.\nIt can be used to warm up the caches after a release or dump your site as html files.\n\n## Installation\n\nRun the following command in your project\n\n    composer require shel/crawler\n    \n## Usage\n    \nTo crawl all pages based on a single sitemap run\n\n```console\n./flow crawler:crawlsitemap --url=http://huve.de.test/sitemap.xml --simultaneousLimit=10 --delay=0\n```\n    \nTo crawl all pages based on all sitemaps listed in a robots.txt file\n\n```console\n./flow crawler:crawlrobotstxt --url=http://huve.de.test/robots.txt --simultaneousLimit=10 --delay=0\n```\n    \n## Node based crawling    \n\nThis command will try to generate all page html without using actual requests and only renders them internally.\nDue to the complexity of the page context, this might not give the desired results, but the resulting \nhtml of alle crawled pages can be stored for further usage.\n\nThis can be much faster as all pages are rendered in one process and all caches are reused.\n\nTo make this work, you need make provide a valid hostname. \n\nThis can be done via one of the following ways:\n\n* have an active domain setup for a site (recommended, the crawler will use the first active domain)\n* set the `Neos.Flow.http.baseUri` setting for Neos in your `Settings.yaml`\n* provide the `baseUri` in general via the environment variable `CRAWLER_BASE_URI` and use the example in `Configuration/Production/Settings.yaml`\n\n```console\n./flow crawler:crawlnodes --siteNodeName \u003csitename\u003e\n```\n\nTo crawl all sites based on their primary active domain:\n\n```console\n./flow crawler:crawlsites       \n```\n\nTo crawl all sites based on their primary active domain and use the URLs listed in robots.txt:\n\n```console\n./flow crawler:crawlsites --method robotstxt\n```\n\n### Experimental static file cache \n    \nBy providing the `outputPath` you can store all crawled content as html files. \n\n```console\n./flow crawler:crawlnodes --siteNodeName \u003csitename\u003e --outputPath=Web/cache\n```\n    \nYou can use this actually as a super simple static file cache by adapting your webserver configuration.\nThere is an example for nginx:\n\n```nginx\n# Serve a cached page matching the request if it exists \nlocation / {\n    default_type \"text/html\";\n    try_files /cache/$uri $uri $uri/ /index.php?$args;\n}\n\n# Serve cache/index(.html) instead of / if it exists\nlocation = / {\n    default_type \"text/html\";\n    try_files /cache/index.html /cache/index /index.php?$args;\n} \n```\n\nYou replace the existing `try_files` part with the given code and adapt the path `cache` if you use a different one.\nThis cache feature is really experimental, and you are currently in charge of keeping the files up-to-date and removing old ones.\n\n* Doesn't clear cache\n* Doesn't update automatically on publish\n* Ignores Fusion caching configuration\n* Shortcuts are ignored (open TODO)\n\n## Contributing\n\nContributions or sponsorships are very welcome.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebobo%2Fshel.crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsebobo%2Fshel.crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsebobo%2Fshel.crawler/lists"}