https://github.com/jmg/crawley

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
https://github.com/jmg/crawley

Last synced: 8 months ago
JSON representation

Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Host: GitHub
URL: https://github.com/jmg/crawley
Owner: jmg
Created: 2011-09-07T04:59:35.000Z (about 14 years ago)
Default Branch: master
Last Pushed: 2023-04-14T18:37:02.000Z (over 2 years ago)
Last Synced: 2024-10-29T17:51:37.910Z (about 1 year ago)
Language: Python
Homepage: http://project.crawley-cloud.com
Size: 1.63 MB
Stars: 186
Watchers: 22
Forks: 33
Open Issues: 10
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-crawler-cn - crawley - 基于非阻塞通信(NIO)的python爬虫框架. (Python)
awesome-crawler - crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. (Python)

README

          # Pythonic Crawling / Scraping Framework Built on Eventlet 

------------------------------------------------------------------

[![Build Status](https://travis-ci.org/crawley-project/crawley.svg)](https://travis-ci.org/crawley-project/crawley)

[![Code Climate](https://codeclimate.com/github/crawley-project/crawley/badges/gpa.svg)](https://codeclimate.com/github/crawley-project/crawley)

[![Stories in Ready](https://badge.waffle.io/crawley-project/crawley.png?label=ready&title=Ready)](https://waffle.io/crawley-project/crawley)

### Features

* High Speed WebCrawler built on Eventlet.

* Supports relational databases engines like Postgre, Mysql, Oracle, Sqlite.

* Supports NoSQL databased like Mongodb and Couchdb. **New!**

* Export your data into Json, XML or CSV formats. **New!**

* Command line tools.

* Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python).

* Cookie Handlers.

* Very easy to use (see the example).

### Documentation

http://packages.python.org/crawley/

### Project WebSite

http://project.crawley-cloud.com/

------------------------------------------------------------------

### To install crawley run

```bash

~$ python setup.py install

```

### or from pip

```bash

~$ pip install crawley

```

------------------------------------------------------------------

### To start a new project run

```bash

~$ crawley startproject [project_name]

~$ cd [project_name]

```

------------------------------------------------------------------

### Write your Models

```python

""" models.py """

from crawley.persistance import Entity, UrlEntity, Field, Unicode

class Package(Entity):

    

    #add your table fields here

    updated = Field(Unicode(255))    

    package = Field(Unicode(255))

    description = Field(Unicode(255))

```

------------------------------------------------------------------

### Write your Scrapers

```python

""" crawlers.py """

from crawley.crawlers import BaseCrawler

from crawley.scrapers import BaseScraper

from crawley.extractors import XPathExtractor

from models import *

class pypiScraper(BaseScraper):

    

    #specify the urls that can be scraped by this class

    matching_urls = ["%"]

    

    def scrape(self, response):

                        

        #getting the current document's url.

        current_url = response.url        

        #getting the html table.

        table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]

        

        #for rows 1 to n-1

        for tr in table[1:-1]:

                        

            #obtaining the searched html inside the rows

            td_updated = tr[0]

            td_package = tr[1]

            package_link = td_package[0]

            td_description = tr[2]

            

            #storing data in Packages table

            Package(updated=td_updated.text, package=package_link.text, description=td_description.text)

class pypiCrawler(BaseCrawler):

    

    #add your starting urls here

    start_urls = ["http://pypi.python.org/pypi"]

    

    #add your scraper classes here    

    scrapers = [pypiScraper]

    

    #specify you maximum crawling depth level    

    max_depth = 0

    

    #select your favourite HTML parsing tool

    extractor = XPathExtractor

```

### Configure your settings

```python

""" settings.py """

import os 

PATH = os.path.dirname(os.path.abspath(__file__))

#Don't change this if you don't have renamed the project

PROJECT_NAME = "pypi"

PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)

DATABASE_ENGINE = 'sqlite'     

DATABASE_NAME = 'pypi'  

DATABASE_USER = ''             

DATABASE_PASSWORD = ''         

DATABASE_HOST = ''             

DATABASE_PORT = ''     

SHOW_DEBUG_INFO = True

```

------------------------------------------------------------------

### Finally, just run the crawler

```bash

~$ crawley run

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jmg/crawley

Awesome Lists containing this project

README