https://github.com/odynvolk/web-snake
A simple web crawler in Python that crawls and returns the urls.
https://github.com/odynvolk/web-snake
links python scraper web web-crawler
Last synced: about 2 months ago
JSON representation
A simple web crawler in Python that crawls and returns the urls.
- Host: GitHub
- URL: https://github.com/odynvolk/web-snake
- Owner: odynvolk
- Created: 2015-01-18T21:40:59.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2015-05-10T16:29:18.000Z (about 11 years ago)
- Last Synced: 2025-05-15T23:42:07.926Z (about 1 year ago)
- Topics: links, python, scraper, web, web-crawler
- Language: Python
- Size: 1.65 MB
- Stars: 2
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# web-snake
A simple web crawler in Python that crawls and returns the urls.
# INSTALL
If you have downloaded the source code:
```bash
python setup.py install
```
Create indexes on both collections in MongoDB.
```
use web_snake
db.crawled_urls.createIndex( { "hash" : 1 } )
db.crawled_domains.createIndex( { "domain" : 1 } )
```
## python
```python
from Queue import Queue
from web_snake.crawler import Crawler
from web_snake.proxies import Proxies
from web_snake.domain_storage import DomainStorage
from web_snake.url_storage import UrlStorage
from web_snake.result_set import ResultSet
crawl_queue = Queue()
crawl_queue.put('http://www.reddit.com/')
result = ResultSet()
proxies = Proxies('../../commondata/proxies.txt')
urls = UrlStorage()
domains = DomainStorage()
crawler = Crawler(crawl_queue=crawl_queue, result=result, domains=domains, urls=urls, max_level=3, proxies=proxies)
crawler.start()
crawler.join()
print "Found {number} links...".format(number=len(result.all())
```