https://github.com/crackcomm/crawl-cache

cache crawl

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/crackcomm/crawl-cache
Owner: crackcomm
License: apache-2.0
Created: 2017-03-19T16:39:57.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2017-03-24T22:42:07.000Z (about 9 years ago)
Last Synced: 2024-06-20T03:47:10.899Z (almost 2 years ago)
Topics: cache, crawl
Language: Go
Size: 9.77 KB
Stars: 1
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# crawl-cache

[![Circle CI](https://img.shields.io/circleci/project/crackcomm/crawl-cache.svg)](https://circleci.com/gh/crackcomm/crawl-cache)

NSQ crawl queue interceptor caching requests.

Ignores `http://`, `https://`, `www.` prefixes.

## Usage

Example usage from command line:

```sh
# Install command line application for crawl scheduling
$ go install github.com/crackcomm/crawl/nsq/crawl-schedule
# It will consumer `google_search_cache` and produce `google_search`
$ crawl-cache --topic google_search_cache:google_search &
# Schedule crawl of google search results
$ crawl-schedule \
--topic google_search_cache \
--callback github.com/crackcomm/go-google-search/spider.Google \
"https://www.google.com/search?q=Github"
```

Callbacks are currently ignored, only URLs are cached.

## License

Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/

## Authors

* [Łukasz Kurowski](https://github.com/crackcomm)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/crackcomm/crawl-cache

Awesome Lists containing this project

README