https://github.com/odynvolk/web-snake

A simple web crawler in Python that crawls and returns the urls.
https://github.com/odynvolk/web-snake

links python scraper web web-crawler

Last synced: about 2 months ago
JSON representation

A simple web crawler in Python that crawls and returns the urls.

Host: GitHub
URL: https://github.com/odynvolk/web-snake
Owner: odynvolk
Created: 2015-01-18T21:40:59.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2015-05-10T16:29:18.000Z (about 11 years ago)
Last Synced: 2025-05-15T23:42:07.926Z (about 1 year ago)
Topics: links, python, scraper, web, web-crawler
Language: Python
Size: 1.65 MB
Stars: 2
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md

Awesome Lists containing this project

README

          # web-snake

A simple web crawler in Python that crawls and returns the urls.

# INSTALL

If you have downloaded the source code:

```bash

python setup.py install

```

Create indexes on both collections in MongoDB.

``` 

use web_snake

db.crawled_urls.createIndex( { "hash" : 1 } )

db.crawled_domains.createIndex( { "domain" : 1 } )

```

## python

```python

from Queue import Queue

from web_snake.crawler import Crawler

from web_snake.proxies import Proxies

from web_snake.domain_storage import DomainStorage

from web_snake.url_storage import UrlStorage

from web_snake.result_set import ResultSet

crawl_queue = Queue()

crawl_queue.put('http://www.reddit.com/')

result = ResultSet()

proxies = Proxies('../../commondata/proxies.txt')

urls = UrlStorage()

domains = DomainStorage()

crawler = Crawler(crawl_queue=crawl_queue, result=result, domains=domains, urls=urls, max_level=3, proxies=proxies)

crawler.start()

crawler.join()

print "Found {number} links...".format(number=len(result.all())

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/odynvolk/web-snake

Awesome Lists containing this project

README