https://github.com/pperrinn/final-thesis
This repository contains the research, code, and examples related to the orchestration of data pipelines using a microservices architecture. The project explores the challenges of constructing modern data-centric pipelines and evaluates the role of orchestration tools in Data Science workflows.
https://github.com/pperrinn/final-thesis
data-science docker etl-pipeline mage-ai python
Last synced: 2 months ago
JSON representation
This repository contains the research, code, and examples related to the orchestration of data pipelines using a microservices architecture. The project explores the challenges of constructing modern data-centric pipelines and evaluates the role of orchestration tools in Data Science workflows.
- Host: GitHub
- URL: https://github.com/pperrinn/final-thesis
- Owner: pperrinn
- License: mit
- Created: 2025-03-17T10:13:27.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-03-17T10:16:27.000Z (3 months ago)
- Last Synced: 2025-03-17T11:29:37.070Z (3 months ago)
- Topics: data-science, docker, etl-pipeline, mage-ai, python
- Homepage:
- Size: 2.93 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Final Thesis: Problems in Building Data Pipelines
This repository contains the research, code, and examples related to the orchestration of data pipelines using a microservices architecture. The project explores the challenges of constructing modern data-centric pipelines and evaluates the role of orchestration tools in Data Science workflows.## π About
Data Science (hereinafter, Data Science) is a scientific discipline in which data is the main actor (data-centric model). The diversity and complexity of the processing required by data is articulated based on sequences of steps (data pipelines) through which data flows like conduits.
Thus, the aim is to address the analysis of the problems in Data Science associated with the different stages of data pipeline development based on a microservices architecture. More specifically, the aim is to evaluate both the need for and the possibility of orchestrating these services.
## π What will you find in this repository?
This repository provides code that exemplifies the challenges of building data pipelines today. Building data pipelines primarily uses the Python programming language, but also requires some knowledge of Docker and database query languages.## π§ What do you need to understand this project?
You'll need to install a Python distribution on your local machine, which must have at least 4 GB of memory, as well as Python version 3.10, which you can install globally on your system or use a virtual environment to test scripts with this version. Knowledge of container technology, such as Docker or Kubernetes, the YAML serialization language, and a general understanding of Mage AI are also required.## π Meet the Author
Laura began her university studies in biomedical engineering. However, realizing that the trend in all technology and organizations was focusing on data and sustainability (among other things), she decided to pursue a degree in Data Science and Artificial Intelligence at the Polytechnic University of Madrid (UPM). Her interests include Big Data, Artificial Intelligence (AI), and software development, which she believes can be applied to various sectors and improved through best practices. Right now, Laura works as a Cyber Security Specialist at Roche Diagnstics S.L (Sant Cugat del VallΓ©s, Spain).