Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rafaeljurkfitz/etl-excel
A study case of develop a simple etl project to convert excel files into a single one.
https://github.com/rafaeljurkfitz/etl-excel
ci-cd data-engineering data-science etl-pipeline excel lvgalvao mkdocs pep8 poetry precommit-hooks pyenv-virtualenv pytest taskipy
Last synced: 17 days ago
JSON representation
A study case of develop a simple etl project to convert excel files into a single one.
- Host: GitHub
- URL: https://github.com/rafaeljurkfitz/etl-excel
- Owner: rafaeljurkfitz
- Created: 2024-04-18T19:24:24.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-11-28T20:50:00.000Z (about 1 month ago)
- Last Synced: 2024-11-28T21:28:44.694Z (about 1 month ago)
- Topics: ci-cd, data-engineering, data-science, etl-pipeline, excel, lvgalvao, mkdocs, pep8, poetry, precommit-hooks, pyenv-virtualenv, pytest, taskipy
- Language: Python
- Homepage: https://rafaeljurkfitz.github.io/etl-excel/
- Size: 887 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ETL Excel
![Flow](docs/static/fluxo.png)
## About the Project ๐๏ธ
This repository aims to serve as a portfolio. The goal is to demonstrate the benefits of software development best practices in the data field and provide a standardized structure to start engineering, science, and data analysis projects.
**The main focus is on best practices, automation, testing, and documentation.**
### Requirements ๐ง
There are two things to set up before starting any Python project:
- Python version control.
- Package and virtual environment management.#### Pyenv ๐
```Pyenv``` allows you to manage **multiple Python versions on the same system**, ensuring you can use the correct version for each project.
#### Poetry ๐ฆ
```Poetry``` is a tool for managing **dependencies**, **virtual environments**, and Python project packaging.
**Advantages of Poetry:**
- Centralized management in the ```pyproject.toml``` file.
- Automatic creation of isolated virtual environments.
- Simplified installation flow.**Poetry automatically uses the Python version configured locally in the project via Pyenv to ensure seamless integration between the tools.**
### Dependencies โ
#### Project Dependencies ๐ง
These are the essential dependencies required for the project to run. They include libraries for processing and handling Excel files.
- ```pandas```: Library for data analysis and manipulation.
- ```openpyxl```: Library for reading and writing Excel files.#### Development Dependencies ๐ป
These dependencies are needed during project development, such as tools for code formatting, linting, and task automation.
- ```taskipy```: For automating tasks like running scripts and tests.
- ```pre-commit```: For configuring pre-commit hooks to ensure the code adheres to project conventions.
- ```pip-audit```: For auditing dependencies and checking for vulnerabilities.
- ```pydocstyle```: To check code documentation style.
- ```blue```: Code formatter similar to Black.
- ```isort```: For consistently organizing imports.
- ```loguro```: For logging.#### Testing Dependencies ๐งช
These dependencies are required for running the project tests, such as the testing framework and its plugins.
- ```pytest```: Framework for writing and running automated tests.
#### Documentation Dependencies ๐
These dependencies are used to generate and serve the project documentation. They include tools for building documentation sites and generating dynamic content.
- ```mkdocstrings-python```: For rendering Python docstrings in documentation generated by MkDocs.
- ```pygments```: For syntax highlighting in the documentation.
- ```pymdown-extensions```: Extensions for MkDocs, enabling advanced Markdown usage.
- ```mkdocs-bootstrap386```: Bootstrap theme for MkDocs.
- ```mkdocs-material```: Material theme for MkDocs.
- ```mkdocs```: Tool for creating documentation websites using Markdown.### Installation and Configuration
1. Clone the repository:
```bash
git clone https://github.com/rafaeljurkfitz/etl-excel.git
cd etl-excel
```2. Set up the correct Python version using `pyenv`:
```bash
pyenv install 3.12.0
pyenv local 3.12.0
```3. Configure Poetry for Python version 3.12.0 and activate the virtual environment:
```bash
poetry env use 3.12.0
poetry shell
```4. Install the project dependencies:
```bash
poetry install
```5. Run the tests to ensure everything is correct and working:
```bash
task test
```6. Run the command to view the project documentation:
```bash
task doc
```7. Start the pipeline execution by running the command to initiate the ETL:
```bash
task run
```8. Check the ```data/output``` folder path to ensure the generated file is correct.
## Contact
For questions, suggestions, or feedback:
- **Rafael Jurkfitz** - [[email protected]](mailto:[email protected])
## License
This project is licensed under the MIT License.