Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ceteri/slinky
Slinky, a high-performance web crawler / text analytics in Python, Redis, Hadoop, R, Gephi
https://github.com/ceteri/slinky
Last synced: about 1 month ago
JSON representation
Slinky, a high-performance web crawler / text analytics in Python, Redis, Hadoop, R, Gephi
- Host: GitHub
- URL: https://github.com/ceteri/slinky
- Owner: ceteri
- Created: 2010-08-01T00:54:35.000Z (over 14 years ago)
- Default Branch: master
- Last Pushed: 2010-08-30T20:37:56.000Z (over 14 years ago)
- Last Synced: 2024-10-26T23:39:46.303Z (about 2 months ago)
- Language: Python
- Homepage: http://ceteri.blogspot.com/
- Size: 135 KB
- Stars: 41
- Watchers: 5
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README
Awesome Lists containing this project
- awesome-starred - ceteri/slinky - Slinky, a high-performance web crawler / text analytics in Python, Redis, Hadoop, R, Gephi (others)
README
## Copyright (C) 2010, Paco Nathan. This work is licensed under
## the BSD License. To view a copy of this license, visit:
## http://creativecommons.org/licenses/BSD/
## or send a letter to:
## Creative Commons, 171 Second Street, Suite 300
## San Francisco, California, 94105, USA
##
## @author Paco NathanSlinky provides an open source, high-performance Web Crawler, plus
common Text Analytics, implemented in Python.* uses Redis key/value store for both CrawlQueue and PageStore
* uses SQLite to persist crawled URI content
* uses Neo4j to persist and analyze URI metadata
* uses Hadoop, R, Gephi for Text Analytics and Link AnalyticsThis leverages a "Particle Cluster" design pattern. In contrast to
MapReduce, a Particle Cluster is particularly well-suited for
combinging highly reliable servers plus low-cost/unreliable VMs. In
other words, you can take advantage of CPU + memory + I/O on availably
but relatively ephemeral resources -- which might get taken away
without notice. For example in AWS, the key/value store could run on a
large EC2 node, while the distributed tasks run on Spot Instances --
based on pricing and availability. This pattern helps maximize
throughput and reliability while minimizing the cost of scale-out for
long-running jobs.Required installs for worker nodes:
http://github.com/andymccurdy/redis-py
http://www.crummy.com/software/BeautifulSoup/
http://components.neo4j.org/neo4j.py/
http://jpype.sourceforge.net/
http://henry.precheur.org/python/rfc3339 (already included)Additional required installs for server nodes:
http://code.google.com/p/redis/downloads
http://www.sqlite.org/download.html
http://neo4j.org/download/Usage:
# initialize Redis; run on server node...
cd PATH_TO_REDIS
nohup ./redis-server &
# you probably want to config so it does "BGSAVE"# edit "config.tsv" for your settings...
# e.g., Slinky handles ~100 crawler threads/node, but not in default config# initialize CrawlQueue and PageStore; run from any node...
./src/slinky.py redis_host:port:db flush
./src/slinky.py redis_host:port:db config < config.tsv
./src/slinky.py redis_host:port:db whitelist < whitelist.tsv
./src/slinky.py redis_host:port:db seed < urls.tsv# perform a crawl; run this on each worker node...
nohup ./src/slinky.py redis_host:port:db perform &
# will poll/sleep indefinitely; use "kill -9 PID" to terminate# persist the crawled URI content; run from any reliable node with attached storage...
nohup ./src/slinky.py redis_host:port:db persist &
# will poll/sleep indefinitely; use "kill -2 PID" to close with no data loss# analyze the crawled URI metadata; run from any reliable node with attached storage...
nohup ./src/slinky.py redis_host:port:db analyze &
# will poll/sleep indefinitely; use "kill -2 PID" to close with no data loss