https://github.com/crackcomm/crawl-cache
https://github.com/crackcomm/crawl-cache
cache crawl
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/crackcomm/crawl-cache
- Owner: crackcomm
- License: apache-2.0
- Created: 2017-03-19T16:39:57.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-03-24T22:42:07.000Z (about 9 years ago)
- Last Synced: 2024-06-20T03:47:10.899Z (almost 2 years ago)
- Topics: cache, crawl
- Language: Go
- Size: 9.77 KB
- Stars: 1
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# crawl-cache
[](https://circleci.com/gh/crackcomm/crawl-cache)
NSQ crawl queue interceptor caching requests.
Ignores `http://`, `https://`, `www.` prefixes.
## Usage
Example usage from command line:
```sh
# Install command line application for crawl scheduling
$ go install github.com/crackcomm/crawl/nsq/crawl-schedule
# It will consumer `google_search_cache` and produce `google_search`
$ crawl-cache --topic google_search_cache:google_search &
# Schedule crawl of google search results
$ crawl-schedule \
--topic google_search_cache \
--callback github.com/crackcomm/go-google-search/spider.Google \
"https://www.google.com/search?q=Github"
```
Callbacks are currently ignored, only URLs are cached.
## License
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
## Authors
* [Łukasz Kurowski](https://github.com/crackcomm)