{"id":20652316,"url":"https://github.com/openpolis/ooetl","last_synced_at":"2025-09-07T12:36:13.613Z","repository":{"id":57448918,"uuid":"402455224","full_name":"openpolis/ooetl","owner":"openpolis","description":"Minimal opinionated object oriented ETL framework","archived":false,"fork":false,"pushed_at":"2022-11-21T15:30:16.000Z","size":176,"stargazers_count":4,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-29T06:33:22.908Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openpolis.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-09-02T14:40:29.000Z","updated_at":"2023-11-14T22:56:22.000Z","dependencies_parsed_at":"2023-01-23T11:46:04.426Z","dependency_job_id":null,"html_url":"https://github.com/openpolis/ooetl","commit_stats":null,"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openpolis%2Fooetl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openpolis%2Fooetl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openpolis%2Fooetl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openpolis%2Fooetl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openpolis","download_url":"https://codeload.github.com/openpolis/ooetl/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249381010,"owners_count":21261228,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T17:33:51.839Z","updated_at":"2025-04-18T16:19:04.031Z","avatar_url":"https://github.com/openpolis.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![Latest Version](https://img.shields.io/pypi/v/ooetl.svg)](https://pypi.python.org/pypi/ooetl)\n[![Latest Version](https://img.shields.io/pypi/pyversions/ooetl.svg)](https://pypi.python.org/pypi/ooetl)\n[![License](https://img.shields.io/pypi/l/ooetl.svg)](https://pypi.python.org/pypi/ooetl)\n[![Downloads](https://pepy.tech/badge/ooetl/month)](https://pepy.tech/project/ooetl/month)\n\n[![Twitter Follow](https://img.shields.io/twitter/follow/openpolislab)](https://twitter.com/openpolislab)\n\n![Tests Badge](https://op-badges.s3.eu-west-1.amazonaws.com/ooetl/tests-badge.svg?2)\n![Coverage Badge](https://op-badges.s3.eu-west-1.amazonaws.com/ooetl/coverage-badge.svg?2)\n![Flake8](https://op-badges.s3.eu-west-1.amazonaws.com/ooetl/flake8-badge.svg?2)\n\n\n**ooetl** is a minimal opinionated object oriented ETL framework.\n\nThe class-based nature of the framework allows to build complex dedicated classes,\nstarting from simple abstract ones.\n\n\n## Installation\n\nPython versions from 3.7.1 are supported.\n\nThe package is hosted on pypi, and can be installed, for example using pip:\n\n    pip install --upgrade \"ooetl[all]\"\n    pip install \"ooetl[elastic]==1.1.2\"\n\nor poetry:\n\n    poetry add ooetl -Eall\n    poetry add ooetl -Eelastic\n\n\n## Usage\n\nThis is a quick tutorial on how to use the `ooetl` module in order to fetch\ndata from a SQL source and store them into a CSV destination.\n\nThe ETL process is performed by invoking the `etl()` method on a `ooetl.ETL` instance.\n\nThe `etl()` method is a shortcut to the sequence `ooetl.ETL::extract().transform().load()`,\nwhich is possible, as each method returns a pointer to the `ooetl.ETL` instance.\n\nWhen the `ooetl.ETL` instance invokes the `ooetl.ETL::extract()` method, it invokes the corresponging\n`ooetl.Extractor::extract()` method of the *extractor*. The method extracts the data from the source\ninto the `ooetl.ETL::original_data` attribute of the ooetl.`ETL` instance.\n\nThe `ooetl.ETL::transform()` method is overridden in the instance and may be used to apply\ncustom data transformation, before the loading phase.\nThe data from `ooetl.ETL::original_data` are then transformed into `ooetl.ETL::processed_data`.\n\nThe `ooetl.ETL::load()` method invokes the `ooetl.Loader::load()` method storing the data from\n`ooetl.ETL::processed_data` into the defined destination.\n\nThe package provides a series of simple Extractors and Loaders, derived from common abstract classes.\n\nExtractors:\n\n- CSVExctractor - extracts data from a remote CSV\n- ZIPCSVExctractor - extracts data from a remote zipped CSV\n- HTMLParserExtractor - extracts data from a remote HTML page (requires html extra and needs to be extended)\n- SparqlExtractor - extracts data from a remote SPARQL endpoint (requires sparql extra)\n- SqlExtractor - extracts data from a RDBMS (requires mysql or postgresql extra)\n- XSLExtractor - extracts data from a remote Excel file\n- ZIPXLSExctractor - extracts data from an excel file within a remote zipped archive\n\nLoaders:\n\n- CSVLoader - loads data into a CSV\n- JsonLoader - loads data into a json file\n- ESLoader - loads data into an ES instance (requires elastic extra)\n- DjangoBulkLoader - adds data in bulk to a django model (only works inside a django project)\n- DjangoUpdateOrCreateLoader - adds data with an update or create logic into a django model (slow, only works within a django project)\n\nThe `ooetl.ETL` abstract class is defined in the `__init__.py` file of the `ooetl` package.\n\nETL classes implement a pipeline of extraction, transformation and load logic.\n\nAa a very basic example, here is how to extract data from a postgresql query, into a CSV file.\n\n```python\n    from ooetl import ETL\n    from ooetl.extractors import SqlExtractor\n    from ooetl.loaders import CSVLoader\n    \n    ETL(\n        extractor=SqlExtractor(\n            conn_url=\"postgresql://postgres:@localhost:5432/opdm\",\n            sql=\"select id, name, inhabitants from popolo_area where istat_classification='COM' \"\n                \"order by inhabitants desc\"\n        ),\n        loader=CSVLoader(\n            csv_path=\"./\",\n            label='opdm_areas'\n        )\n    )()\n```\n\nExtractors (and Loaders) may be easily extended within the projects using the `ooetl` package.\nAs an example, consider the following snippet, extending the abstract `ooetl.HTMLParserExtractor`, that parser\nthe Italian government's site and extracts the list of officials, as CSV.\n\nThis example requires the html extra to be installed.\n\n```python\n    import requests\n    from lxml import html\n    from lxml.cssselect import CSSSelector\n    \n    from ooetl import ETL\n    from ooetl.extractors import HTMLParserExtractor\n    from ooetl.loaders import CSVLoader\n    \n    class GovernoExtractor(HTMLParserExtractor):\n    \n        def parse(self, html_content):\n            list_tree = html.fromstring(html_content)\n            items = []\n            for e in CSSSelector('div.content div.box_text a')(list_tree):\n                item_name = e.text_content().strip()\n                item_url = e.get('href').strip()\n                item_page = requests.get(item_url)\n                item_tree = html.fromstring(item_page.content)\n                item_par = CSSSelector('div.content div.field')(item_tree)[0]\n                item_charge = CSSSelector('blockquote p')(item_par)[0].text_content().strip()\n                item_descr = \" \".join([\n                  e.text_content() for e in CSSSelector('p')(item_par)[1:] if\\\n                     e.text_content() is not None\n                ])\n                items.append({\n                    'nome': item_name,\n                    'url': item_url,\n                    'incarico': item_charge,\n                    'descrizione': item_descr\n                })\n    \n                print(item_name)\n    \n            return items\n    \n    ETL(\n        extractor=GovernoExtractor(\"https://www.governo.it/it/ministri-e-sottosegretari\"),\n        loader=CSVLoader(\n            csv_path=\"./\",\n            label='governo'\n        )\n    )()\n```\n\nOther, more complex examples can be found in the `examples` directory.\n\n## Support\n\nThere is no guaranteed support available, but authors will try to keep up with issues \nand merge proposed solutions into the code base.\n\n## Project Status\nThis project is currently being developed by the [Openpolis Foundation](https://www.openpolis.it/openpolis-foundation/)\nand is being used interanally.\n\nCurrently extras for elasticsearch and sparql have been developed.\n \nShould more be needed, you can either ask to increase the coverage, or try to contribute, following instructions below.\n\n## Contributing\nIn order to contribute to this project:\n* verify that python 3.7.1+ is being used (or use [pyenv](https://github.com/pyenv/pyenv))\n* verify or install [poetry](https://python-poetry.org/), to handle packages and dependencies in a leaner way, \n  with respect to pip and requirements\n* clone the project `git clone git@github.com:openpolis/ooetl.git` \n* install the dependencies in the virtualenv, with `poetry install -Eall`,\n  this will also install the dev dependencies and all extras\n* develop and test \n* create a [pull request](https://docs.github.com/en/github/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests)\n* wait for the maintainers to review and eventually merge your pull request into the main repository\n\n### Testing\nTests are under the tests folder, and can be launched with \n\n    pytest\n\nRequests and responses from ATOKA's API are mocked, in order to avoid having to connect to \nthe remote service during tests (slow and needs an API key).\n\nCoverage is installed as a dev dependency and can be used to see how much of the package's code is covered by tests:\n\n    coverage run -m pytest\n\n    # sends coverage report to terminal\n    coverage report -m \n\n    # generate and open a web page with interactive coverage report\n    coverage html\n    open htmlcov/index.html \n\nSyntax can be checked with `flake8`.\n\nCoverage and flake8 configurations are in their sections within `setup.cfg`.\n\n## Authors\nGuglielmo Celata - guglielmo@openpolis.it\n\n## Licensing\nThis package is released under an MIT License, see details in the LICENSE.txt file.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenpolis%2Fooetl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenpolis%2Fooetl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenpolis%2Fooetl/lists"}