{"id":19084390,"url":"https://github.com/pawsanie/luigi_etl","last_synced_at":"2025-11-12T12:30:34.385Z","repository":{"id":158648152,"uuid":"469152026","full_name":"Pawsanie/Luigi_ETL","owner":"Pawsanie","description":"Universal Luigi ETL pipeline. Validates data received from external sources. Extracts, transforms them and lands.","archived":false,"fork":false,"pushed_at":"2023-03-02T09:36:33.000Z","size":98,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-22T06:41:22.835Z","etag":null,"topics":["etl","etl-automation","etl-pipeline","luigi","luigi-pipeline","luigi-task","luigi-tasks","python","python-3","python3"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"0bsd","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Pawsanie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-03-12T17:25:49.000Z","updated_at":"2023-02-23T05:51:30.000Z","dependencies_parsed_at":null,"dependency_job_id":"3427a67d-0f4d-45c5-a8c1-cf18b83ddefd","html_url":"https://github.com/Pawsanie/Luigi_ETL","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Pawsanie/Luigi_ETL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FLuigi_ETL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FLuigi_ETL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FLuigi_ETL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FLuigi_ETL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Pawsanie","download_url":"https://codeload.github.com/Pawsanie/Luigi_ETL/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Pawsanie%2FLuigi_ETL/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":284037614,"owners_count":26936681,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-12T02:00:06.336Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["etl","etl-automation","etl-pipeline","luigi","luigi-pipeline","luigi-task","luigi-tasks","python","python-3","python3"],"created_at":"2024-11-09T02:51:10.628Z","updated_at":"2025-11-12T12:30:34.368Z","avatar_url":"https://github.com/Pawsanie.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Luigi ETL pipeline\n\n## Disclaimer:\n:warning:**Using** some or all of the elements of this code, **You** assume **responsibility for any consequences!**\u003cbr\u003e\n\n:warning:The **licenses** for the technologies on which the code **depends** are subject to **change by their authors**.\u003cbr\u003e\u003cbr\u003e\n\n## Description of the pipeline:\nThe pipeline collects data from sources, in the form of (csv tables / json dictionaries) data, so that in the end:\n* Collects data from external sources to Luigi targets.\n* Data cleansing.\n* Land data to DWH.\n\n## Required:\nThe application code is written in python and obviously depends on it.\u003cbr\u003e\n**Python** version 3.6 [Python Software Foundation License / (with) Zero-Clause BSD license (after 3.8.6 version Python)]:\n* :octocat:[Python GitHub](https://github.com/python)\n* :bookmark_tabs:[Python internet page](https://www.python.org/)\n\n## Required Packages:\n**Luigi** [Apache License 2.0]:\n* :octocat:[Luigi GitHub](https://github.com/spotify/luigi)\n\nUsed to Luigi tasks conveyor.\n\n**Pandas** [BSD-3-Clause license]:\n* :octocat:[Pandas GitHub](https://github.com/pandas-dev/pandas/)\n* :bookmark_tabs:[Pandas internet page](https://pandas.pydata.org/)\n\nUsed to work with tabular data.\n\n**NumPy** [BSD-3-Clause license]:\n* :octocat:[NumPy GitHub](https://github.com/numpy/numpy)\n* :bookmark_tabs:[NumPy internet page](https://numpy.org/)\n\nUsed to bring the table cells to the desired value.\n\n**PyArrow** [Apache-2.0 license]:\n* :octocat:[PyArrow GitHub](https://github.com/apache/arrow)\n* :bookmark_tabs:[PyArrow internet page](https://arrow.apache.org/)\n\nUsed to save data in parquet format.\n\n## Installing the Required Packages:\n```bash\npip install luigi\npip install pandas\npip install numpy\npip install pyarrow\n```\n\n## Description of tasks:\n### ExternalData:\nWrappers for data from external sources.\u003cbr/\u003e\n* Reads datasets in the directory received from the parameter '**external_data_path**'.\u003cbr/\u003e\n:warning:All paths to partitions inside the root directory of the passed ExternalData **must** be in the format '**Dataset_Name/YYYY/MM/DD/**'.\u003cbr/\u003e\n* For all partitions where a '**\\_Validate**' flag file was found, creates a new '**\\_Validate_Success**' flag as Luigi.LocalTarget.\n\n### ExtractTask:\n* Reads data from ExternalData by dates.\n* Merges them into one array.\n* If '**drop_list**' parameter is not '**None**' ('None' as default) Task will drop all columns names in this Luigi.ListParameter.\u003cbr/\u003e\n**Example of 'drop_list' Luigi.ListParameter:**\n```json\n[\"drop_name\", \"Delete\"]\n```\n* '**extract_file_mask**' Luigi.Parameter as output file format and '**external_data_file_mask**' as input.\n\n### TransformTask:\n* Remove all lines matching the transform_parsing_rules_drop parameter.\u003cbr/\u003e\n**Example of 'transform_parsing_rules_drop' Luigi.DictParameter:**\n```json\n{\"column_to_drop\": [\"False\", \"NaN\", 0]}\n```\n* Rows will be discarded if at least one value matches in ALL keys of transform_parsing_rules_filter.\u003cbr/\u003e\n**Example of 'transform_parsing_rules_filter' Luigi.DictParameter:**\n```json\n{\"column_to_filter\": [\"drop_if_not_in_vip\", \"drop_too\"], \"filter_too\": [\"0\"]}\n```\n* And provided that the string does not contain values from the transform_parsing_rules_vip keys.\u003cbr/\u003e\n**Example of 'transform_parsing_rules_vip' Luigi.DictParameter:**\n```json\n{\"data_to_save_like_vip\": [\"vip_value_1\", \"vip_value_2\"], \"save_too\": [\"vip_value_3\"]}\n```\n* Has 'date_parameter' Luigi.DateParameter (today as default).\n* '**transform_file_mask**' Luigi.Parameter as output file format and '**extract_file_mask**' as input.\n\n### LoadTask:\n* Landing result data to directory received from the Luigi.Parameter '**load_data_path**'.\n* Has '**date_parameter**' Luigi.DateParameter (today as default).\n* '**load_file_mask**' Luigi.Parameter as output file format and '**transform_file_mask**' as input.\n\n## Launch:\n### Launch with 'luigi_config' and Luigi.build:\nIf you want to use a simple launch by passing Luigi **parameters** through a **configuration** file: \n1) Fill the '**luigi_config.cfg**' file with correct data.\n2) Then run the script '**luigi_pipeline.py**'.\n**Files location:**\u003cbr\u003e\n**./**:open_file_folder:Luigi_ETL\u003cbr\u003e\n   └── :file_folder:Pipeline\u003cbr\u003e\n            ├── :page_facing_up:luigi_pipeline.py\u003cbr\u003e\n            └── :file_folder:My_Beautiful_Tasks.py\u003cbr\u003e\n                     └── :file_folder:Configuration\u003cbr\u003e\n                              └── :page_facing_up:luigi_config.cfg\u003cbr\u003e\n\nPlease note that rows with optional parameters can be removed from the 'luigi_config' if you do not need them.\n\n**Example of run script:**\n```bash\npython3 -B -m .luigi_pipeline.py\n```\n### Launch with terminal or command line:\nFirst you need to replace the variable '**build**' to variable '**run**' in '**Pipeline_launcher.py**' script, \nwith removing all the parameters passed to it.\u003cbr\u003e\nThen you need to clear all parameters in Luigi's task instances that are called in '**luigi_pipeline.py**' script.\u003cbr\u003e\n\nAfter that, you can start Luigi by passing parameters through the terminal, or using a '**start_luigi_etl_pipeline.sh**' script.\n\n**Files location:**\u003cbr\u003e\n**./**:open_file_folder:Luigi_ETL\u003cbr\u003e\n   └── :file_folder:Pipeline\u003cbr\u003e\n            ├── :page_facing_up:luigi_pipeline.py\u003cbr\u003e\n            ├── :page_facing_up:start_luigi_etl_pipeline.sh\u003cbr\u003e\n            └── :file_folder:My_Beautiful_Tasks\u003cbr\u003e\n                     └── :page_facing_up:Pipeline_launcher.py\u003cbr\u003e\n\nIf Your OS has a bash shell the ETL pipeline can be started using the bash script:\n```bash\n./start_luigi_etl_pipeline.sh\n```\nThe script contains an example of all the necessary arguments to run.\u003cbr/\u003e\nTo launch the pipeline through this script, do not forget to make it executable.\n```bash\nchmod +x ./start_luigi_etl_pipeline.sh\n```\nThe script can also be run directly with python.\u003cbr/\u003e\n**Example of run script:**\n```bash\npython3 -B -m luigi_pipeline Load.LoadTask --local-scheduler \\\n--ExternalData.ExternalData-external-data-path \"~/luigi_tasks/ExternalData\" \\\n\\\n--Extract.ExtractTask-extract-data-path \"~/luigi_tasks/ExtractTask\" \\\n--Extract.ExtractTask-extract-file-mask \"csv\" \\\n--Extract.ExtractTask-external-data-file-mask \"csv\" \\\n--Extract.ExtractTask-drop-list \"['column_drop_name', 'column_to_delete']\" \\\n\\\n--Transform.TransformTask-file-to-transform-path \"~/luigi_tasks/TransformTask\" \\\n--Transform.TransformTask-transform-file-mask \"json\" \\\n--Transform.TransformTask-transform-parsing-rules-drop \"{'column_to_drop': [False, 'NaN', 0]}\" \\\n--Transform.TransformTask-transform-parsing-rules-filter \"{'column_to_filter': ['drop_if_not_in_vip', 'drop_too'], 'filter_too': ['0']}\" \\\n--Transform.TransformTask-transform-parsing-rules-vip \"{'data_to_save_like_vip': ['vip_value_1, vip_value_2'], 'save_too': ['vip_value_3']}\" \\\n--Transform.TransformTask-date-path-part $(date +%F --date \"2022-12-01\") \\\n\\\n--Load.LoadTask-load-data-path \"~/luigi_tasks/LoadTask\" \\\n--Load.LoadTask-load-file-mask \"parquet\"\n```\nThe example above shows the launch of all tasks.\n\n## Tests:\nTests are embedded inside the pipeline.\n\n***\n\n**Thank you** for your interest in my work.\u003cbr\u003e\u003cbr\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpawsanie%2Fluigi_etl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpawsanie%2Fluigi_etl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpawsanie%2Fluigi_etl/lists"}