https://github.com/alephdata/memorious

Lightweight web scraping toolkit for documents and structured data.
https://github.com/alephdata/memorious

crawling scraping scraping-framework

Last synced: 3 months ago
JSON representation

Lightweight web scraping toolkit for documents and structured data.

Host: GitHub
URL: https://github.com/alephdata/memorious
Owner: alephdata
License: mit
Created: 2017-09-04T15:22:02.000Z (almost 8 years ago)
Default Branch: main
Last Pushed: 2024-01-10T16:13:27.000Z (over 1 year ago)
Last Synced: 2025-04-04T17:03:11.073Z (3 months ago)
Topics: crawling, scraping, scraping-framework
Language: Python
Homepage: https://docs.alephdata.org/developers/memorious
Size: 1.39 MB
Stars: 311
Watchers: 16
Forks: 62
Open Issues: 25
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

starred-awesome - memorious - Distributed crawling framework for documents and structured data. (Python)
awesome-starred - alephdata/memorious - Lightweight web scraping toolkit for documents and structured data. (others)

README

=========
Memorious
=========

The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.

-- `Funes the Memorious `_,
Jorge Luis Borges

.. image:: https://github.com/alephdata/memorious/workflows/memorious/badge.svg

``memorious`` is a light-weight web scraping toolkit. It supports scrapers that
collect structured or un-structured data. This includes the following use cases:

* Make crawlers modular and simple tasks re-usable
* Provide utility functions to do common tasks such as data storage, HTTP session management
* Integrate crawlers with the Aleph and FollowTheMoney ecosystem
* Get out of your way as much as possible

Design
------

When writing a scraper, you often need to paginate through through an index
page, then download an HTML page for each result and finally parse that page
and insert or update a record in a database.

``memorious`` handles this by managing a set of ``crawlers``, each of which
can be composed of multiple ``stages``. Each ``stage`` is implemented using a
Python function, which can be re-used across different ``crawlers``.

The basic steps of writing a Memorious crawler:

1. Make YAML crawler configuration file
2. Add different stages
3. Write code for stage operations (optional)
4. Test, rinse, repeat

Documentation
-------------

The documentation for Memorious is available at
`alephdata.github.io/memorious `_.
Feel free to edit the source files in the ``docs`` folder and send pull requests for improvements.

To build the documentation, inside the ``docs`` folder run ``make html``

You'll find the resulting HTML files in /docs/_build/html.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/alephdata/memorious

Awesome Lists containing this project

README