Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/igrigorik/canicrawl
Hosted robots.txt permissions verifier
https://github.com/igrigorik/canicrawl
Last synced: about 2 months ago
JSON representation
Hosted robots.txt permissions verifier
- Host: GitHub
- URL: https://github.com/igrigorik/canicrawl
- Owner: igrigorik
- Created: 2011-05-15T21:02:59.000Z (over 13 years ago)
- Default Branch: master
- Last Pushed: 2014-06-07T18:09:31.000Z (over 10 years ago)
- Last Synced: 2024-05-08T18:35:47.509Z (8 months ago)
- Language: Go
- Homepage: http://canicrawl.appspot.com
- Size: 134 KB
- Stars: 23
- Watchers: 4
- Forks: 5
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Can I Crawl (this URL)
Hosted robots.txt permissions verifier.
## ENDPOINTS
- [`/`](http://canicrawl.appspot.com/) This page.
- [`/check`](http://canicrawl.appspot.com/check) Runs the robots.txt verification check.## Description
Verifies if the provided URL is allowed to be crawled by your User-Agent. Pass in the destination URL and the service will download, parse and check the [robots.txt](http://www.robotstxt.org/) file for permissions. If you're allowed to continue, it will issue a **3XX** redirect, otherwise a **4XX** code is returned.
## Examples
### $ curl -v http://canicrawl.appspot.com/check?url=http://google.com/
< HTTP/1.0 302 Found
< Location: http://www.google.com/### $ curl -v http://canicrawl.appspot.com/check?url=http://google.com/search
< HTTP/1.0 403 Forbidden
< Content-Length: 23
{"status":"disallowed"}### $ curl -H'User-Agent: MyCustomAgent' -v http://canicrawl.appspot.com/check?url=http://google.com/
> User-Agent: MyCustomAgent
< HTTP/1.0 302 Found
< Location: http://www.google.com/Note: [google.com/robots.txt](http://google.com/robots.txt) disallows requests to _/search_.
### License
MIT License - Copyright (c) 2011 [Ilya Grigorik](http://www.igvita.com/)