https://github.com/jmg/crawley
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
https://github.com/jmg/crawley
Last synced: 8 months ago
JSON representation
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.
- Host: GitHub
- URL: https://github.com/jmg/crawley
- Owner: jmg
- Created: 2011-09-07T04:59:35.000Z (about 14 years ago)
- Default Branch: master
- Last Pushed: 2023-04-14T18:37:02.000Z (over 2 years ago)
- Last Synced: 2024-10-29T17:51:37.910Z (about 1 year ago)
- Language: Python
- Homepage: http://project.crawley-cloud.com
- Size: 1.63 MB
- Stars: 186
- Watchers: 22
- Forks: 33
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-crawler-cn - crawley - 基于非阻塞通信(NIO)的python爬虫框架. (Python)
- awesome-crawler - crawley - Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations. (Python)
README
# Pythonic Crawling / Scraping Framework Built on Eventlet
------------------------------------------------------------------
[](https://travis-ci.org/crawley-project/crawley)
[](https://codeclimate.com/github/crawley-project/crawley)
[](https://waffle.io/crawley-project/crawley)
### Features
* High Speed WebCrawler built on Eventlet.
* Supports relational databases engines like Postgre, Mysql, Oracle, Sqlite.
* Supports NoSQL databased like Mongodb and Couchdb. **New!**
* Export your data into Json, XML or CSV formats. **New!**
* Command line tools.
* Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python).
* Cookie Handlers.
* Very easy to use (see the example).
### Documentation
http://packages.python.org/crawley/
### Project WebSite
http://project.crawley-cloud.com/
------------------------------------------------------------------
### To install crawley run
```bash
~$ python setup.py install
```
### or from pip
```bash
~$ pip install crawley
```
------------------------------------------------------------------
### To start a new project run
```bash
~$ crawley startproject [project_name]
~$ cd [project_name]
```
------------------------------------------------------------------
### Write your Models
```python
""" models.py """
from crawley.persistance import Entity, UrlEntity, Field, Unicode
class Package(Entity):
#add your table fields here
updated = Field(Unicode(255))
package = Field(Unicode(255))
description = Field(Unicode(255))
```
------------------------------------------------------------------
### Write your Scrapers
```python
""" crawlers.py """
from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *
class pypiScraper(BaseScraper):
#specify the urls that can be scraped by this class
matching_urls = ["%"]
def scrape(self, response):
#getting the current document's url.
current_url = response.url
#getting the html table.
table = response.html.xpath("/html/body/div[5]/div/div/div[3]/table")[0]
#for rows 1 to n-1
for tr in table[1:-1]:
#obtaining the searched html inside the rows
td_updated = tr[0]
td_package = tr[1]
package_link = td_package[0]
td_description = tr[2]
#storing data in Packages table
Package(updated=td_updated.text, package=package_link.text, description=td_description.text)
class pypiCrawler(BaseCrawler):
#add your starting urls here
start_urls = ["http://pypi.python.org/pypi"]
#add your scraper classes here
scrapers = [pypiScraper]
#specify you maximum crawling depth level
max_depth = 0
#select your favourite HTML parsing tool
extractor = XPathExtractor
```
### Configure your settings
```python
""" settings.py """
import os
PATH = os.path.dirname(os.path.abspath(__file__))
#Don't change this if you don't have renamed the project
PROJECT_NAME = "pypi"
PROJECT_ROOT = os.path.join(PATH, PROJECT_NAME)
DATABASE_ENGINE = 'sqlite'
DATABASE_NAME = 'pypi'
DATABASE_USER = ''
DATABASE_PASSWORD = ''
DATABASE_HOST = ''
DATABASE_PORT = ''
SHOW_DEBUG_INFO = True
```
------------------------------------------------------------------
### Finally, just run the crawler
```bash
~$ crawley run
```