https://github.com/redapple/parslepy

Python implementation of the Parsley language for extracting structured data from web pages
https://github.com/redapple/parslepy

Last synced: 6 months ago
JSON representation

Python implementation of the Parsley language for extracting structured data from web pages

Host: GitHub
URL: https://github.com/redapple/parslepy
Owner: redapple
License: mit
Created: 2013-06-10T16:10:47.000Z (almost 12 years ago)
Default Branch: master
Last Pushed: 2017-10-26T10:49:47.000Z (over 7 years ago)
Last Synced: 2024-11-03T11:49:52.355Z (6 months ago)
Language: Python
Size: 226 KB
Stars: 92
Watchers: 10
Forks: 15
Open Issues: 3
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG
- License: LICENSE

Awesome Lists containing this project

starred-awesome - parslepy - Python implementation of the Parsley language for extracting structured data from web pages (Python)

README

        parslepy

========

[![Build Status](https://travis-ci.org/redapple/parslepy.png?branch=master)](https://travis-ci.org/redapple/parslepy)

*parslepy* (pronounced *"parsley-pie"*, */ˈpɑːslipaɪ/*) is a Python implementation

(built on top of [lxml](http://lxml.de) and [cssselect](https://github.com/SimonSapin/cssselect)) of the

[Parsley DSL](https://github.com/fizx/parsley)

for extracting structured data from web pages, as defined by Kyle Maxwell and Andrew Cantino

(see [Parsley's wiki](https://github.com/fizx/parsley/wiki) for more details and original C implementation).

Kudos to Kyle Maxwell (@fizx) for coming up with this smart and easy syntax to define extracting rules.

> Please note that this *Parsley DSL* is **NOT** the same as the Parsley parsing library at https://pypi.python.org/pypi/Parsley

Check out the [official docs](http://pythonhosted.org/parslepy) for more information on how to install

and use *parslepy*. There is also some useful information at the [parslepy Wiki](https://github.com/redapple/parslepy/wiki)

Here is an example of a parselet script that extracts the questions from StackOverflow first page:

    {

        "first_page_questions(//div[contains(@class,'question-summary')])": [{

            "title": ".//h3/a",

            "tags": "div.tags",

            "votes": "div.votes div.mini-counts",

            "views": "div.views div.mini-counts",

            "answers": "div.status div.mini-counts"

        }]

    }

### Install

Install via pip with:

    sudo pip install parslepy

Alternatively, you can install from the latest source code:

    git clone https://github.com/redapple/parslepy.git

    sudo python setup.py install

### Online Resources ###

* [Official Documentation](http://pythonhosted.org/parslepy)

* [Wiki with examples and tutorials](https://github.com/redapple/parslepy/wiki)

* [Parsley DSL](https://github.com/fizx/parsley)

* [JSON Structure details -- Parsley wiki](https://github.com/fizx/parsley/wiki/JSON-Structure)

* [Example Scrapy Spider using Parsley](http://snipplr.com/view/67016/parsley-spider/)

* [Parsley DSL on Hacker News](https://news.ycombinator.com/item?id=1585301)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/redapple/parslepy

Awesome Lists containing this project

README