https://github.com/alephdata/memorious
Lightweight web scraping toolkit for documents and structured data.
https://github.com/alephdata/memorious
crawling scraping scraping-framework
Last synced: 7 months ago
JSON representation
Lightweight web scraping toolkit for documents and structured data.
- Host: GitHub
- URL: https://github.com/alephdata/memorious
- Owner: alephdata
- License: mit
- Created: 2017-09-04T15:22:02.000Z (about 8 years ago)
- Default Branch: main
- Last Pushed: 2024-01-10T16:13:27.000Z (almost 2 years ago)
- Last Synced: 2025-04-04T17:03:11.073Z (7 months ago)
- Topics: crawling, scraping, scraping-framework
- Language: Python
- Homepage: https://docs.alephdata.org/developers/memorious
- Size: 1.39 MB
- Stars: 311
- Watchers: 16
- Forks: 62
- Open Issues: 25
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
- starred-awesome - memorious - Distributed crawling framework for documents and structured data. (Python)
- awesome-starred - alephdata/memorious - Lightweight web scraping toolkit for documents and structured data. (others)
README
=========
Memorious
=========The solitary and lucid spectator of a multiform, instantaneous and almost intolerably precise world.
-- `Funes the Memorious `_,
Jorge Luis Borges.. image:: https://github.com/alephdata/memorious/workflows/memorious/badge.svg
``memorious`` is a light-weight web scraping toolkit. It supports scrapers that
collect structured or un-structured data. This includes the following use cases:* Make crawlers modular and simple tasks re-usable
* Provide utility functions to do common tasks such as data storage, HTTP session management
* Integrate crawlers with the Aleph and FollowTheMoney ecosystem
* Get out of your way as much as possibleDesign
------When writing a scraper, you often need to paginate through through an index
page, then download an HTML page for each result and finally parse that page
and insert or update a record in a database.``memorious`` handles this by managing a set of ``crawlers``, each of which
can be composed of multiple ``stages``. Each ``stage`` is implemented using a
Python function, which can be re-used across different ``crawlers``.The basic steps of writing a Memorious crawler:
1. Make YAML crawler configuration file
2. Add different stages
3. Write code for stage operations (optional)
4. Test, rinse, repeatDocumentation
-------------The documentation for Memorious is available at
`alephdata.github.io/memorious `_.
Feel free to edit the source files in the ``docs`` folder and send pull requests for improvements.To build the documentation, inside the ``docs`` folder run ``make html``
You'll find the resulting HTML files in /docs/_build/html.