Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vinta/haul
An Extensible Image Crawler
https://github.com/vinta/haul
Last synced: 2 days ago
JSON representation
An Extensible Image Crawler
- Host: GitHub
- URL: https://github.com/vinta/haul
- Owner: vinta
- License: mit
- Created: 2013-10-01T02:39:55.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2017-01-07T17:48:15.000Z (about 8 years ago)
- Last Synced: 2025-01-02T16:11:40.967Z (10 days ago)
- Language: Python
- Homepage:
- Size: 409 KB
- Stars: 158
- Watchers: 11
- Forks: 38
- Open Issues: 10
-
Metadata Files:
- Readme: README.rst
- Changelog: HISTORY.rst
- License: LICENSE
Awesome Lists containing this project
- awesome-python - haul - An Extensible Image Crawler (Awesome Python / Web Content Extracting)
README
Haul
====.. image:: https://travis-ci.org/vinta/Haul.png?branch=master
:alt: Build Badge
:target: https://travis-ci.org/vinta/Haul.. image:: https://coveralls.io/repos/vinta/Haul/badge.png?branch=master
:alt: Coverage Badge
:target: https://coveralls.io/r/vinta/Haul.. image:: https://badge.fury.io/py/haul.png
:alt: Version Badge
:target: http://badge.fury.io/py/haul.. image:: https://d2weczhvl823v0.cloudfront.net/vinta/haul/trend.png
:alt: Bitdeli Badge
:target: https://bitdeli.com/freeFind thumbnails and original images from URL or HTML file.
Demo
====`Hauler on Heroku `_
Installation
============on Ubuntu
.. code-block:: bash
$ sudo apt-get install build-essential python-dev libxml2-dev libxslt1-dev
$ pip install haulon Mac OS X
.. code-block:: bash
$ pip install haul
Fail to install haul? `It is probably caused by lxml `_.
Usage
=====Find images from ``img src``, ``a href`` and even ``background-image``:
.. code-block:: python
import haul
url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url)print(result.image_urls)
"""
output:
[
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
]
"""Find original (or bigger size) images with ``extend=True``:
.. code-block:: python
import haul
url = 'http://gibuloto.tumblr.com/post/62525699435/fuck-yeah'
result = haul.find_images(url, extend=True)print(result.image_urls)
"""
output:
[
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_500.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_16.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_16.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_16.png',
# bigger size, extended from above urls
'http://25.media.tumblr.com/3f5f10d7216f1dd5eacb5eb3e302286a/tumblr_mtpcwdzKBT1qh9n5lo1_1280.png',
...
'http://24.media.tumblr.com/avatar_a3a119b674e2_128.png',
'http://25.media.tumblr.com/avatar_9b04f54875e1_128.png',
'http://31.media.tumblr.com/avatar_0acf8f9b4380_128.png',
]
"""Advanced Usage
==============Custom finder / extender pipeline:
.. code-block:: python
from haul import Haul
from haul.compat import strdef img_data_src_finder(pipeline_index,
soup,
finder_image_urls=[],
*args, **kwargs):
"""
Find image URL in 's data-src attribute
"""now_finder_image_urls = []
for img in soup.find_all('img'):
src = img.get('data-src', None)
if src:
src = str(src)
now_finder_image_urls.append(src)output = {}
output['finder_image_urls'] = finder_image_urls + now_finder_image_urlsreturn output
MY_FINDER_PIPELINE = (
'haul.finders.pipeline.html.img_src_finder',
'haul.finders.pipeline.css.background_image_finder',
img_data_src_finder,
)GOOGLE_SITES_EXTENDER_PIEPLINE = (
'haul.extenders.pipeline.google.blogspot_s1600_extender',
'haul.extenders.pipeline.google.ggpht_s1600_extender',
'haul.extenders.pipeline.google.googleusercontent_s1600_extender',
)url = 'http://fashion-fever.nl/dressing-up/'
h = Haul(parser='lxml',
finder_pipeline=MY_FINDER_PIPELINE,
extender_pipeline=GOOGLE_SITES_EXTENDER_PIEPLINE)
result = h.find_images(url, extend=True)Run Tests
=========.. code-block:: bash
$ python setup.py test