https://github.com/gingray/pycrawl

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/gingray/pycrawl
Owner: gingray
Created: 2014-02-06T17:28:49.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2014-02-08T09:39:28.000Z (over 12 years ago)
Last Synced: 2025-02-26T08:14:58.712Z (over 1 year ago)
Language: Python
Size: 137 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README

Awesome Lists containing this project

README

          Simple way to fetching data over internet

Example:

from data_fetcher import RangeFetcher

from web_client import WebClient 

import lxml.html

def worker(url, content):

    print url

    root = lxml.html.fromstring(content)

    hrefs = root.xpath(".//a/@href")

    for row in hrefs:

        print row

web_client = WebClient()

template = "http://somesite.com/%s"

range_parser = RangeFetcher(web_client, worker, template, 1, 3)

range_parser.process()

the main idea is that you have some site with url organization like

http://somesite.com/1

http://somesite.com/2

http://somesite.com/3

You can set url template and range, fetcher will crawl all of them and execute worker on each of it.

You can also use UrlFileFetcher the main idea is the same but links took from file

Example:

from data_fetcher import UrlFileFetcher

from web_client import WebClient 

import lxml.html

def worker(url, content):

    print url

    root = lxml.html.fromstring(content)

    hrefs = root.xpath(".//a/@href")

    for row in hrefs:

        print row

web_client = WebClient()

filename = "links.txt"

range_parser = UrlFileFetcher(web_client, worker, filename)

range_parser.process()

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gingray/pycrawl

Awesome Lists containing this project

README