{"id":27443897,"url":"https://github.com/josephmachado/beginner_de_project","last_synced_at":"2025-05-15T19:08:34.098Z","repository":{"id":37964366,"uuid":"266409331","full_name":"josephmachado/beginner_de_project","owner":"josephmachado","description":"Beginner data engineering project - batch edition","archived":false,"fork":false,"pushed_at":"2025-01-22T00:44:05.000Z","size":32564,"stargazers_count":513,"open_issues_count":6,"forks_count":158,"subscribers_count":10,"default_branch":"master","last_synced_at":"2025-04-15T02:58:10.344Z","etag":null,"topics":["airflow","database","docker","emr","engineering","etl","python","redshift","redshift-cluster","spark"],"latest_commit_sha":null,"homepage":"https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/josephmachado.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-05-23T19:55:14.000Z","updated_at":"2025-04-14T04:02:55.000Z","dependencies_parsed_at":"2024-06-13T23:22:27.364Z","dependency_job_id":"a3328b94-b7ad-40f1-87f8-bcdc19516385","html_url":"https://github.com/josephmachado/beginner_de_project","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fbeginner_de_project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fbeginner_de_project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fbeginner_de_project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/josephmachado%2Fbeginner_de_project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/josephmachado","download_url":"https://codeload.github.com/josephmachado/beginner_de_project/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254404357,"owners_count":22065641,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","database","docker","emr","engineering","etl","python","redshift","redshift-cluster","spark"],"created_at":"2025-04-15T02:58:06.957Z","updated_at":"2025-05-15T19:08:34.073Z","avatar_url":"https://github.com/josephmachado.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n\n* [Beginner DE Project - Batch Edition](#beginner-de-project---batch-edition)\n    * [Run Data Pipeline](#run-data-pipeline)\n        * [Run on codespaces](#run-on-codespaces)\n        * [Run locally](#run-locally)\n    * [Architecture](#architecture)\n\n# Beginner DE Project - Batch Edition\n\nCode for blog at [Data Engineering Project for Beginners](https://www.startdataengineering.com/post/data-engineering-project-for-beginners-batch-edition/).\n\n## Run Data Pipeline\n\nCode available at **[beginner_de_project](https://github.com/josephmachado/beginner_de_project)** repository.\n\n### Run on codespaces\n\nYou can run this data pipeline using GitHub codespaces. Follow the instructions below.\n\n1. Create codespaces by going to the **[beginner_de_project](https://github.com/josephmachado/beginner_de_project)** repository, cloning it(or click `Use this template` button) and then clicking on `Create codespaces on main` button.\n2. Wait for codespaces to start, then in the terminal type `make up`.\n3. Wait for `make up` to complete, and then wait for 30s (for Airflow to start).\n4. After 30s go to the `ports` tab and click on the link exposing port `8080` to access Airflow UI (username and password is `airflow`).\n\n![Codespace](assets/images/cs1.png)\n![Codespace make up](assets/images/cs2.png)\n![Codespace Airflow UI](assets/images/cs3.png)\n\n**Note** Make sure to switch off codespaces instance, you only have limited free usage; see docs [here](https://github.com/features/codespaces#pricing).\n\n### Run locally\n\nTo run locally, you need:\n\n1. [git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)\n2. [Github account](https://github.com/)\n3. [Docker](https://docs.docker.com/engine/install/) with at least 4GB of RAM and [Docker Compose](https://docs.docker.com/compose/install/) v1.27.0 or later\n\nClone the repo and run the following commands to start the data pipeline:\n\n```bash\ngit clone https://github.com/josephmachado/beginner_de_project.git\ncd beginner_de_project \nmake up\nsleep 30 # wait for Airflow to start\nmake ci # run checks and tests\n```\n\nGo to [http:localhost:8080](http:localhost:8080) to see the Airflow UI. Username and password are both `airflow`.\n\n## Architecture\n\nThis data engineering project, includes the following:\n\n1. **`Airflow`**: To schedule and orchestrate DAGs.\n2. **`Postgres`**: To store Airflow's details (which you can see via Airflow UI) and also has a schema to represent upstream databases.\n3. **`DuckDB`**: To act as our warehouse\n4. **`Quarto with Plotly`**: To convert code in `markdown` format to html files that can be embedded in your app or servered as is.\n5. **`Apache Spark`**: To process our data, specifically to run a classification algorithm.\n6. **`minio`**: To provide an S3 compatible open source storage system.\n\nFor simplicity services 1-5 of the above are installed and run in one container defined [here](./containers/airflow/Dockerfile).\n\n![Data pipeline design](assets/images/arch.png)\n\nThe `user_analytics_dag` DAG in the [Airflow UI](http://localhost:8080) will look like the below image:\n\n![DAG](assets/images/dag.png)\n\nOn completion, you can see the dashboard html rendered at[./dags/scripts/dashboard/dashboard.html](./dags/scripts/dashboard/dashboard.html).\n\nRead **[this post](https://www.startdataengineering.com/post/data-engineering-projects-with-free-template/)**, for information on setting up CI/CD, IAC(terraform), \"make\" commands and automated testing.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fbeginner_de_project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjosephmachado%2Fbeginner_de_project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjosephmachado%2Fbeginner_de_project/lists"}