{"id":13558018,"url":"https://github.com/ceteri/slinky","last_synced_at":"2025-04-17T06:31:05.978Z","repository":{"id":998652,"uuid":"810001","full_name":"ceteri/slinky","owner":"ceteri","description":"Slinky, a high-performance web crawler / text analytics in Python, Redis, Hadoop, R, Gephi","archived":true,"fork":false,"pushed_at":"2010-08-30T20:37:56.000Z","size":138,"stargazers_count":41,"open_issues_count":0,"forks_count":4,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-29T22:01:33.985Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://ceteri.blogspot.com/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"avendael/atomic-emacs","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ceteri.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2010-08-01T00:54:35.000Z","updated_at":"2025-03-07T17:36:47.000Z","dependencies_parsed_at":"2022-08-06T10:01:20.859Z","dependency_job_id":null,"html_url":"https://github.com/ceteri/slinky","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceteri%2Fslinky","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceteri%2Fslinky/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceteri%2Fslinky/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ceteri%2Fslinky/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ceteri","download_url":"https://codeload.github.com/ceteri/slinky/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249319594,"owners_count":21250578,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T12:04:41.319Z","updated_at":"2025-04-17T06:31:05.616Z","avatar_url":"https://github.com/ceteri.png","language":"Python","funding_links":[],"categories":["Python","others"],"sub_categories":[],"readme":"## Copyright (C) 2010, Paco Nathan. This work is licensed under \n## the BSD License. To view a copy of this license, visit:\n##    http://creativecommons.org/licenses/BSD/\n## or send a letter to:\n##    Creative Commons, 171 Second Street, Suite 300\n##    San Francisco, California, 94105, USA\n##\n## @author Paco Nathan \u003cceteri@gmail.com\u003e\n\n\nSlinky provides an open source, high-performance Web Crawler, plus\ncommon Text Analytics, implemented in Python. \n\n  * uses Redis key/value store for both CrawlQueue and PageStore\n  * uses SQLite to persist crawled URI content\n  * uses Neo4j to persist and analyze URI metadata\n  * uses Hadoop, R, Gephi for Text Analytics and Link Analytics\n\nThis leverages a \"Particle Cluster\" design pattern. In contrast to\nMapReduce, a Particle Cluster is particularly well-suited for\ncombinging highly reliable servers plus low-cost/unreliable VMs. In\nother words, you can take advantage of CPU + memory + I/O on availably\nbut relatively ephemeral resources -- which might get taken away\nwithout notice. For example in AWS, the key/value store could run on a\nlarge EC2 node, while the distributed tasks run on Spot Instances --\nbased on pricing and availability. This pattern helps maximize\nthroughput and reliability while minimizing the cost of scale-out for\nlong-running jobs.\n\n\nRequired installs for worker nodes:\n\n\thttp://github.com/andymccurdy/redis-py\n\thttp://www.crummy.com/software/BeautifulSoup/\n\thttp://components.neo4j.org/neo4j.py/\n\thttp://jpype.sourceforge.net/\n\thttp://henry.precheur.org/python/rfc3339 (already included)\n\nAdditional required installs for server nodes:\n\n\thttp://code.google.com/p/redis/downloads\n\thttp://www.sqlite.org/download.html\n\thttp://neo4j.org/download/\n\nUsage:\n\t# initialize Redis; run on server node...\n\tcd PATH_TO_REDIS\n\tnohup ./redis-server \u0026\n\t# you probably want to config so it does \"BGSAVE\"\n\n\t# edit \"config.tsv\" for your settings...\n\t# e.g., Slinky handles ~100 crawler threads/node, but not in default config\n\n\t# initialize CrawlQueue and PageStore; run from any node...\n\t./src/slinky.py redis_host:port:db flush\n\t./src/slinky.py redis_host:port:db config \u003c config.tsv\n\t./src/slinky.py redis_host:port:db whitelist \u003c whitelist.tsv\n\t./src/slinky.py redis_host:port:db seed \u003c urls.tsv\n\n\t# perform a crawl; run this on each worker node...\n\tnohup ./src/slinky.py redis_host:port:db perform \u0026\n\t# will poll/sleep indefinitely; use \"kill -9 PID\" to terminate\n\n\t# persist the crawled URI content; run from any reliable node with attached storage...\n\tnohup ./src/slinky.py redis_host:port:db persist \u0026\n\t# will poll/sleep indefinitely; use \"kill -2 PID\" to close with no data loss\n\n\t# analyze the crawled URI metadata; run from any reliable node with attached storage...\n\tnohup ./src/slinky.py redis_host:port:db analyze \u0026\n\t# will poll/sleep indefinitely; use \"kill -2 PID\" to close with no data loss\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceteri%2Fslinky","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fceteri%2Fslinky","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fceteri%2Fslinky/lists"}