Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/svven/summary

Summary is a complete solution to extract the title, image and description from any URL.
https://github.com/svven/summary

Last synced: 4 months ago
JSON representation

Summary is a complete solution to extract the title, image and description from any URL.

Host: GitHub
URL: https://github.com/svven/summary
Owner: svven
License: mit
Archived: true
Created: 2014-05-09T09:14:55.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2023-11-25T11:28:04.000Z (about 1 year ago)
Last Synced: 2024-09-20T01:06:31.119Z (5 months ago)
Language: Python
Homepage:
Size: 3.74 MB
Stars: 18
Watchers: 6
Forks: 3
Open Issues: 7
Metadata Files:
- Readme: README.md
- Changelog: CHANGES.txt
- License: LICENSE.txt

Awesome Lists containing this project

README

        # Summary

### RETIRED ###

This project is no longer maintained.

## Simple usage

Working with the `summary` package:

    >>> import summary

    >>> s = summary.Summary('https://github.com/svven/summary')

    >>> s.extract()

    >>> s.title

    u'svven/summary'

    >>> s.image

    https://avatars0.githubusercontent.com/u/7524085?s=400

    >>> s.description

    u'summary - Summary is a complete solution to extract the title, image and description from any URL.'

## Batch usage with HTML rendering

If you fork or clone the repo you can use summarize.py like this:

    >>> import summary

    >>> summary.GET_ALL_DATA = True # default is False

    >>> urls = [

            'http://www.wired.com/',

            'http://www.nytimes.com/', 

            'http://www.technologyreview.com/lists/technologies/2014/'

        ]

    >>> from summarize import summarize, render

    >>> summaries, result, speed = summarize(urls)

    -> http://www.wired.com/

    [BadImage] RatioImageException(398, 82): http://www.wired.com/wp-content/vendor/condenast/pangea/themes/wired/assets/images/wired_logo.gif

    -> http://www.nytimes.com/

    [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nyt.png

    [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nytcom.png

    [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-4panel-opinion.png

    [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375173/CRS-1572_nytpinion_EARS_L_184x90_CP2.gif

    [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375174/CRS-1572_nytpinion_EARS_R_184x90_ER1.gif

    [BadImage] RatioImageException(379, 64): http://i1.nyt.com/images/misc/nytlogo379x64.gif

    [BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/facebook.gif

    [BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/twitter.gif-> http://www.technologyreview.com/lists/technologies/2014/

    Success: 3.

    >>> html = render(template="news.html",

        summaries=summaries, result=result, speed=speed)

    >>> with open('demo.html', 'w') as file:

    ...   file.write(html.encode('uft8'))

    >>> 

## In a nutshell

Summary requests the page from the URL, then uses

[extraction](https://github.com/lethain/extraction) to parse the HTML.

Worth mentioning that it downloads the head tag first, performs specific

extraction techniques, and goes further to body only if extracted data

is not complete. Unless `summary.GET_ALL_DATA = True`.

The resulting lists of titles, images, and descriptions are filtered on

the fly to rule out unwanted items like ads, tiny images (tracking

images or sharing buttons), and plain white images. See the whole list

of filters below.

Many thanks to Will Larson ([@lethain](https://github.com/lethain)) for

adapting his [extraction](https://github.com/lethain/extraction) library

to version 0.2 to accomodate summary.

## Rendering

The purpose of the HTML rendering mechanism is just to visualize

extracted data. The included Jinja2 template (news.html) is built on top

of bootstrap and displays the summaries in a nice responsive grid

layout.

You can completely disregard the rendering mechanism and just import

summary module for data extraction and filtering. You probably have your

own means to render the data, so you only need the summary folder.

![image](img/sample.png)

And this one produced much faster (see footer) with

`summary.GET_ALL_DATA = False`. It contains only the first valid item of

each kind - title, image, and description. This is the default

behaviour.

## Installation

Pip it for simple usage:

    $ pip install summary-extraction

Or clone the repo if you need rendering:

    $ virtualenv env 

    $ source env/bin/activate

    $ git clone https://github.com/svven/summary.git 

    $ pip install -r summary/requirements.txt 

    

    $ cd summary

    $ python # see the usage instructions above

## Requirements

Base required packages are `extraction` and `requests`, but it doesn't

do much withouth `adblockparser` and `Pillow`:

    Jinja2==2.7.2 # only for rendering 

    Pillow==2.4.0

    adblockparser==0.2

    extraction==0.2 

    lxml==3.3.5 

    re2==0.2.20 # good for adblockparser

    requests==2.2.1

    w3lib==1.6

## Filters

Filters are *callable* classes that perform specific data checks.

For the moment there are only image filters. The image URL is passed as

input parameter to the first filter. The check is performed and the URL

is returned if it is valid, so it is passed to the second filter and so

on. When the check fails it returns `None`.

This pattern makes it possible to write the filtering routine like this:

    def _filter_image(self, url):

      "The param is the image URL, which is returned if it passes *all* the filters."

      return reduce(lambda f, g: f and g(f), 

        [

          filters.AdblockURLFilter()(url),

          filters.NoImageFilter(),

          filters.SizeImageFilter(),

          filters.MonoImageFilter(),

          filters.FormatImageFilter(),

        ])

    

    images = filter(None, map(self._filter_image, image_urls))

  - **AdblockURLFilter**

    

    > Uses [adblockparser](https://github.com/scrapinghub/adblockparser)

    > and returns `None` if it `should_block` the URL.

    > 

    > Hats off to Mikhail Korobov ([@kmike](https://github.com/kmike))

    > for the awesome work. It gives a lot of value to this mashup repo.

  - **NoImageFilter**

    

    > Retrieves actual image file, and returns `None` if it fails.

    > 

    > Otherwise it returns an instance of the `filters.Image` class

    > containing the URL, together with the size and format of the

    > actual image. Basically it hydrates this instance which is passed

    > to following filters. The `Image.__repr__` override returns just

    > the URL so we can write the beautiful filtering routine you can

    > see above.

    > 

    > Worth mentioning again that it only gets first few chunks of the

    > image file until the PIL parser gets the size and format of the

    > image.

  - **SizeImageFilter**

    

    > Checks the `filters.Image` instance to have proper size.

    > 

    > This can raise following exceptions based on defined limits:

    > `TinyImageException`, `HugeImageException`, or

    > `RatioImageException`. If any of these happens it returns `None`.

  - **MonoImageFilter**

    

    > Checks whether the image is plain white and returns `None`.

    > 

    > This filter retrieves the whole image file so it has an extra

    > regex check before. E.g.: rules out these URLs:

    > 

    >   - 

    >   - 

  - **FormatImageFilter**

    

    > Rules out animated gif images for the moment. This can be extended

    > to exclude other image formats based on file contents.

That's it for now. You're very welcome to contribute.

Comments and suggestions are welcome as well. Cheers,

[@ducu](http://twitter.com/ducu)