Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/neegor/wanish
Open Source implementation of Summly
https://github.com/neegor/wanish
parsing python readability summly
Last synced: 16 days ago
JSON representation
Open Source implementation of Summly
- Host: GitHub
- URL: https://github.com/neegor/wanish
- Owner: neegor
- License: mit
- Created: 2015-02-27T10:24:02.000Z (over 9 years ago)
- Default Branch: master
- Last Pushed: 2016-12-11T12:48:47.000Z (almost 8 years ago)
- Last Synced: 2024-10-03T20:06:57.924Z (about 1 month ago)
- Topics: parsing, python, readability, summly
- Language: Python
- Homepage:
- Size: 1.85 MB
- Stars: 47
- Watchers: 5
- Forks: 15
- Open Issues: 1
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
.. image:: https://codeclimate.com/github/reefeed/wanish/badges/gpa.svg
:target: https://codeclimate.com/github/reefeed/wanish
:alt: Code ClimateAbout
-----This package allows you to summarize text by reducing an article in size
to several sentences retaining the idea of the text.Besides of that the package extracts the following from the document:
1. Canonical URL of the article
2. Title of the article
3. URL of the image characterizing this article
4. Strips the document of excessive information (headers, footers,
navigation, advertisement, etc.) and forms a clean HTML based on
structured data of schema.org`DEMO`_
Installation
------------::
easy_install wanish
or
pip install wanishUsage
-----.. code:: python
from wanish import Wanish
wanish = Wanish()
wanish.perform_url(document_url)# getting doc's source canonical url
url = wanish.url
# getting document's title
title = wanish.title
# getting url of related image if document has it
image_url = wanish.image_url
# getting two-letter code of the document's language (en, de, es...)
language_code = wanish.language
# getting a clean html page of a document with article
clean_html = wanish.clean_html
# getting a short summarized description of the article reduced to several sentences (5 by default)
description = wanish.descriptionAvailable kwarg options for *Wanish()* class (all are optional):
.. code:: python
wanish = Wanish(url=document_url,
positive_keywords=["main", "story"],
negative_keywords=["banner", "adv", "similar", "top-ad"],
summary_sentences_qty=5,
headers={'user-agent': 'test-purposes/0.0.1'})- **url:** Allows to pass an url of a document in constructor. If set,
then it will automatically launch *self.perform\_url(url)* after
initialization. Default is None.
- **positive\_keywords:** A list of positive search patterns in classes
and ids, for example: *[“main”, “story”]* . Default is None.
- **negative\_keywords:** A list of negative search patterns in classes
and ids, for example: *[“banner”, “adv”, “similar”, “top-ad”]* .
Default is None.
- **summary\_sentences\_qty:** Maximum quantity of sentences in
summarized text of the document. Set to 5 by default.
- **headers:** Dict of additional custom headers for GET request to
obtain web page of the article. Default is None.Special Thanks
--------------- https://github.com/nltk/nltk
- https://github.com/buriy/python-readability
- https://github.com/saffsd/langid.py.. _DEMO: http://reefeed.com