{"id":19607545,"url":"https://github.com/andreax79/airflow-provider-xlsx","last_synced_at":"2025-08-22T08:07:28.761Z","repository":{"id":57409303,"uuid":"391968520","full_name":"andreax79/airflow-provider-xlsx","owner":"andreax79","description":"Airflow operators for converting XLSX files from/to Parquet/CSV/JSON","archived":false,"fork":false,"pushed_at":"2022-03-25T16:07:37.000Z","size":428,"stargazers_count":5,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-13T14:44:53.277Z","etag":null,"topics":["airflow","apache-airflow","excel","parquet"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andreax79.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-08-02T13:55:57.000Z","updated_at":"2024-06-05T10:58:23.000Z","dependencies_parsed_at":"2022-08-24T18:51:20.446Z","dependency_job_id":null,"html_url":"https://github.com/andreax79/airflow-provider-xlsx","commit_stats":null,"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreax79%2Fairflow-provider-xlsx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreax79%2Fairflow-provider-xlsx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreax79%2Fairflow-provider-xlsx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andreax79%2Fairflow-provider-xlsx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andreax79","download_url":"https://codeload.github.com/andreax79/airflow-provider-xlsx/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251204548,"owners_count":21552239,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","apache-airflow","excel","parquet"],"created_at":"2024-11-11T10:11:18.447Z","updated_at":"2025-04-27T20:32:25.726Z","avatar_url":"https://github.com/andreax79.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Airflow Provider XLSX\n\n[Apache Airflow](https://github.com/apache/airflow) operators for converting XLSX files from/to Parquet, CSV and JSON.\n\n[![Build Status](https://github.com/andreax79/airflow-provider-xlsx/workflows/Tests/badge.svg)](https://github.com/andreax79/airflow-provider-xlsx/actions)\n[![PyPI version](https://badge.fury.io/py/airflow-provider-xlsx.svg)](https://badge.fury.io/py/airflow-provider-xlsx)\n[![PyPI](https://img.shields.io/pypi/pyversions/airflow-provider-xlsx.svg)](https://pypi.org/project/airflow-provider-xlsx)\n[![Downloads](https://pepy.tech/badge/airflow-provider-xlsx/month)](https://pepy.tech/project/airflow-provider-xlsx)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\n### System Requirements\n\n* Airflow Versions\n    * 2.0 or newer\n\n### Installation\n\n```console\n$ pip install airflow-provider-xlsx\n```\n\n### Operators\n\n#### FromXLSXOperator\n\nRead an XLSX or XLS file and convert it into Parquet, CSV, JSON, JSON Lines(one line per record) file.\n\n[API Documentation](https://airflow-provider-xlsx.readthedocs.io/en/latest/#module-xlsx_provider.operators.from_xlsx_operator)\n\n##### Example\n\nXLSX Source\n\n![image](https://user-images.githubusercontent.com/1288154/130972144-e33f01af-2f9a-4e34-803a-907324a7adbf.png)\n\nAirflow Task\n\n```python\nfrom xlsx_provider.operators.from_xlsx_operator import FromXLSXOperator\n\nxlsx_to_jsonl = FromXLSXOperator(\n   task_id='xlsx_to_jsonl',\n   source='{{ var.value.tmp_path }}/test.xlsx',\n   target='{{ var.value.tmp_path }}/test.jsonl',\n   file_format='jsonl',\n   dag=dag\n)\n```\n\nJSON Lines Output\n\n```json\n{\"month\": \"Jan\", \"high\": -12.2, \"mean\": -16.2, \"low\": -20.1, \"precipitation\": 19}\n{\"month\": \"Feb\", \"high\": -10.3, \"mean\": -14.7, \"low\": -19.1, \"precipitation\": 14}\n{\"month\": \"Mar\", \"high\": -2.6, \"mean\": -7.2, \"low\": -11.8, \"precipitation\": 15}\n{\"month\": \"Apr\", \"high\": 8.1, \"mean\": 3.2, \"low\": -1.7, \"precipitation\": 24}\n{\"month\": \"May\", \"high\": 17.5, \"mean\": 11.6, \"low\": 5.6, \"precipitation\": 36}\n{\"month\": \"Jun\", \"high\": 24, \"mean\": 18.2, \"low\": 12.3, \"precipitation\": 58}\n{\"month\": \"Jul\", \"high\": 25.7, \"mean\": 20.2, \"low\": 14.7, \"precipitation\": 72}\n{\"month\": \"Aug\", \"high\": 22.2, \"mean\": 17, \"low\": 11.7, \"precipitation\": 66}\n{\"month\": \"Sep\", \"high\": 16.6, \"mean\": 11.5, \"low\": 6.4, \"precipitation\": 44}\n{\"month\": \"Oct\", \"high\": 6.8, \"mean\": 3.4, \"low\": 0, \"precipitation\": 38}\n```\n\n#### FromXLSXQueryOperator\n\nExecute an SQL query an XLSX/XLS file and export the result into a Parquet or CSV file\n\nThis operators loads an XLSX or XLS file into an in-memory SQLite database, executes a query on the db and stores the result into a Parquet, CSV, JSON, JSON Lines(one line per record) file. The output columns names and types are determinated by the SQL query output.\n\n[API Documentation](https://airflow-provider-xlsx.readthedocs.io/en/latest/#xlsx-provider-operators-operators-from-xlsx-query-operator)\n\n##### Example\n\nXLSX Source\n\n![image](https://user-images.githubusercontent.com/1288154/130963470-f7f05ca0-a952-47e1-86ec-c6cd322746f6.png)\n\nSQL Query\n\n```sql\n select\n     g as high_tech_sector,\n     h as eur_bilion,\n     i as share\n from\n     high_tech\n where\n     _index \u003e 1\n     and high_tech_sector \u003c\u003e ''\n     and lower(high_tech_sector) \u003c\u003e 'total'\n```\n\nAirflow Task\n\n```python\nfrom xlsx_provider.operators.from_xlsx_query_operator import FromXLSXQueryOperator\n\nxlsx_to_csv = FromXLSXQueryOperator(\n   task_id='xlsx_to_csv',\n   source='{{ var.value.tmp_path }}/high_tech.xlsx',\n   target='{{ var.value.tmp_path }}/high_tech.parquet',\n   file_format='csv',\n   csv_delimiter=',',\n   table_name='high_tech',\n   worksheet='Figure 3',\n   query='''\n       select\n           g as high_tech_sector,\n           h as eur_bilion,\n           i as share\n       from\n           high_tech\n       where\n           _index \u003e 1\n           and high_tech_sector \u003c\u003e ''\n           and lower(high_tech_sector) \u003c\u003e 'total'\n   ''',\n   dag = dag\n)\n```\n\nOutput\n\n```\nhigh_tech_sector,value,share\nPharmacy,78280,0.231952169555313\nElectronics-telecommunications,75243,0.222954583130376\nScientific instruments,64010,0.189670433253542\nAerospace,44472,0.131776952366115\nComputers office machines,21772,0.0645136852766778\nNon-electrical machinery,20813,0.0616714981835167\nChemistry,19776,0.058598734453222\nElectrical machinery,9730,0.028831912195612\nArmament,3384,0.0100300315856265\n```\n\n#### ToXLSXOperator\n\nRead a Parquest, CSV, JSON, JSON Lines(one line per record) file and convert it into XLSX.\n\n[API Documentation](https://airflow-provider-xlsx.readthedocs.io/en/latest/#xlsx-provider-operators-operators-to-xlsx-operator)\n\n##### Example\n\n```python\nfrom xlsx_provider.operators.to_xlsx_operator import ToXLSXOperator\n\nparquet_to_xlsx = ToXLSXOperator(\n   task_id='parquet_to_xlsx',\n   source='{{ var.value.tmp_path }}/test.parquet',\n   target='{{ var.value.tmp_path }}/test.xlsx',\n   dag=dag\n)\n\n```\n\n### Links\n\n* Apache Airflow - https://github.com/apache/airflow\n* Project home page (GitHub) - https://github.com/andreax79/airflow-provider-xlsx\n* Documentation (Read the Docs) - https://airflow-provider-xlsx.readthedocs.io/en/latest\n* openpyxl, library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files - https://foss.heptapod.net/openpyxl/openpyxl\n* lrd, library for reading data and formatting information from Excel files in the historical .xls format - https://github.com/python-excel/xlrd\n* Python library for Apache Arrow - https://github.com/apache/arrow/tree/master/python\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreax79%2Fairflow-provider-xlsx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandreax79%2Fairflow-provider-xlsx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreax79%2Fairflow-provider-xlsx/lists"}