{"id":19073615,"url":"https://github.com/hvignolo87/analytics_engineer_assignment","last_synced_at":"2025-07-30T14:11:31.844Z","repository":{"id":223205306,"uuid":"759461489","full_name":"hvignolo87/analytics_engineer_assignment","owner":"hvignolo87","description":"Resolution of the Analytics Engineering assignment of Clara","archived":false,"fork":false,"pushed_at":"2024-02-19T12:23:52.000Z","size":2745,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-02T17:27:40.755Z","etag":null,"topics":["airflow","analytics-engineering","data-engineering","dbt","postgresql","python","sql"],"latest_commit_sha":null,"homepage":"","language":"Makefile","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hvignolo87.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-18T16:55:02.000Z","updated_at":"2024-09-20T13:44:40.000Z","dependencies_parsed_at":"2024-02-19T01:09:57.174Z","dependency_job_id":null,"html_url":"https://github.com/hvignolo87/analytics_engineer_assignment","commit_stats":null,"previous_names":["hvignolo87/analytics_engineer_assignment"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hvignolo87%2Fanalytics_engineer_assignment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hvignolo87%2Fanalytics_engineer_assignment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hvignolo87%2Fanalytics_engineer_assignment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hvignolo87%2Fanalytics_engineer_assignment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hvignolo87","download_url":"https://codeload.github.com/hvignolo87/analytics_engineer_assignment/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240125911,"owners_count":19751834,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","analytics-engineering","data-engineering","dbt","postgresql","python","sql"],"created_at":"2024-11-09T01:47:39.975Z","updated_at":"2025-02-22T04:25:40.535Z","avatar_url":"https://github.com/hvignolo87.png","language":"Makefile","readme":"# Analytics Engineer assignment resolution\n\n[![Apache Airflow](https://img.shields.io/badge/Apache%20Airflow-2.6.3-green.svg?logo=apacheairflow)](https://airflow.apache.org/docs/apache-airflow/2.6.3/index.html) [![Python 3.10.12](https://img.shields.io/badge/python-3.10.12-blue.svg?labelColor=%23FFE873\u0026logo=python)](https://www.python.org/downloads/release/python-31012/) ![dbt-version](https://img.shields.io/badge/dbt-version?style=flat\u0026logo=dbt\u0026label=1.5\u0026link=https%3A%2F%2Fdocs.getdbt.com%2Fdocs%2Fintroduction)\u003cbr\u003e[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://docs.astral.sh/ruff/) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://black.readthedocs.io/en/stable/) [![Imports: isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat\u0026labelColor=ef8336)](https://pycqa.github.io/isort/)\u003cbr\u003e[![Conventional Commits](https://img.shields.io/badge/Conventional%20Commits-1.0.0-%23FE5196?logo=conventionalcommits\u0026logoColor=white)](https://conventionalcommits.org) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://pre-commit.com/)\n\nIn this document, you'll find information and instructions about my solution to the analytics engineer assignment.\n\n## Directories structure\n\nThis is the structure of the project.\n\n```text\n.\n├── .dockerignore\n├── .env.airflow.local.example\n├── .env.dbt.local.example\n├── .env.postgres.local.example\n├── .gitignore\n├── .pre-commit-config.yaml\n├── .python-version\n├── .sqlfluff\n├── .sqlfluffignore\n├── .vscode\n│   ├── extensions.json\n│   └── settings.json\n├── Analytics Engineer Assessment.docx\n├── Dockerfile.airflow\n├── Dockerfile.dbt\n├── LICENSE\n├── Makefile\n├── README.md\n├── analytics_engineer_assessment.pdf\n├── dags\n│   ├── .airflowignore\n│   ├── settings.py\n│   └── transformations.py\n├── dbt\n│   ├── analyses\n│   │   └── .gitkeep\n│   ├── dbt_project.yml\n│   ├── macros\n│   │   ├── .gitkeep\n│   │   ├── generate_raw_data.sql\n│   │   ├── generate_schema_name.sql\n│   │   └── macros.yml\n│   ├── models\n│   │   ├── intermediate\n│   │   │   ├── int_commits.sql\n│   │   │   ├── int_events.sql\n│   │   │   ├── int_repos.sql\n│   │   │   ├── int_users.sql\n│   │   │   └── intermediate.yml\n│   │   ├── landing.yml\n│   │   ├── marts\n│   │   │   ├── marts.yml\n│   │   │   └── reporting\n│   │   │       ├── dim_commits.sql\n│   │   │       ├── dim_repos.sql\n│   │   │       ├── dim_users.sql\n│   │   │       ├── fct_events.sql\n│   │   │       └── reporting.yml\n│   │   └── staging\n│   │       ├── staging.yml\n│   │       ├── stg_actors.sql\n│   │       ├── stg_commits.sql\n│   │       ├── stg_events.sql\n│   │       └── stg_repos.sql\n│   ├── packages.yml\n│   ├── profiles.yml\n│   ├── seeds\n│   │   ├── .gitkeep\n│   │   ├── raw\n│   │   │   ├── actors.csv\n│   │   │   ├── commits.csv\n│   │   │   ├── events.csv\n│   │   │   └── repos.csv\n│   │   └── seeds.yml\n│   ├── snapshots\n│   │   └── .gitkeep\n│   └── tests\n│       └── .gitkeep\n├── docker-compose.yml\n├── images\n│   ├── airflow_dag.png\n│   ├── lineage.png\n│   └── raw_erd.png\n├── mypy.ini\n├── noxfile.py\n├── poetry.lock\n├── pyproject.toml\n└── scripts\n    └── postgres_init.sh\n\n17 directories, 63 files\n```\n\n## What you'll need\n\nThis solution is containerized, so you'll need to [install docker and docker-compose](https://docs.docker.com/get-docker/).\n\nAlso, it's recommended to have a desktop SQL client like [DBeaver](https://dbeaver.io/download/).\n\nOn a secondary stage, you can install the recommended VS Code extensions.\n\n## Setup\n\nLet's dive into the setup process.\n\n### 1. Generate the environment variables\n\nOpen a shell in your machine, and navigate to this directory. Then run:\n\n```bash\nmake generate-dotenv\n```\n\nThis will generate three `.env` files with predefined values. Please, go ahead and open it! If you want to modify some values, just take into account that this may break some things.\n\n### 2. Install the project dependencies\n\nRun these commands in this sequence:\n\n```bash\nmake install-poetry\nmake install-project\nmake dbt-install-pkgs\n```\n\nOptionally, if you've cloned the repo, you can run:\n\n```bash\nmake install-pre-commit\n```\n\nTo install the pre-commit hooks and play around with them.\n\n### 3. Build the images\n\nRun:\n\n```bash\nmake build services=\"postgres bootstrap-dbt\"\n```\n\nThis will build all the required images.\n\n### 4. Create the services\n\nRun:\n\n```bash\nmake up services=\"postgres bootstrap-dbt\"\n```\n\nThis will create a PostgreSQL database, and all the raw tables, and run a command that populates those tables with the provided data.\n\n### 5. Connect to the DB locally\n\nOpen DBeaver, and set up the connection to the database. If you didn't modify the `.env` files, you can use these credentials:\n\n- User: `clara`\n- Password: `clara`\n- Host: `localhost`\n- Port: `5440`\n- DB: `clara`\n\nThen, please open the `queries.sql` and `view.sql` files and run queries in DBeaver to verify the results.\n\nIf you don't have DBeaver, you can run your queries from PostgreSQL's terminal with [psql](https://www.postgresql.org/docs/14/app-psql.html). To do this, please run:\n\n```bash\nmake execute-sql\n```\n\nThen you can run them from the terminal.\n\n## Creating the data model\n\nIn this section, we'll materialize the data model with `dbt`.\n\nIn your terminal, run:\n\n```bash\nmake dbt-run-model node=\"--target prod\"\n```\n\nAnd wait until all the models are finished.\n\n## Assignment resolution\n\n### How I've created the data model\n\n#### 1. Understanding the raw data\n\nFirst of all, I've manually inspected the provided raw data by digging into it. Then, I took a look at [the GitHub Events API docs](https://docs.github.com/en/rest/using-the-rest-api/github-event-types?apiVersion=2022-11-28).\n\nOnce I had that in mind, I understood the relations between the provided data. Here's an ERD:\n\n\u003cimg src=\"./images/raw_erd.png\" alt=\"raw_erd\" width=\"500\" height=\"250\" style=\"vertical-align:middle\"\u003e\u003cbr\u003e\n\nThe relationship highlights are:\n\n- One actor/user can have multiple events (e.g., `event_type = 'PushEvent'` and different commit SHAs)\n- One repository can have multiple events\n- One commit represents one single transaction\n\n#### 2. Analyzing deeply the raw data\n\nTaking a closer look into the raw data, I realized that there were some duplicates in the `repos` and `users` tables, and I've found (mainly) 2 strange things in those tables.\n\nFirst, there are 2 different usernames with the same id (`59176384`):\n\n```sql\nSELECT\n    id\n    , COUNT(DISTINCT username) AS num_of_users_per_id\nFROM raw.actors\nGROUP BY 1\nORDER BY 2 DESC\nLIMIT 5\n```\n\nThe usernames are:\n\n| id | username |\n|---|---|\n| 59176384 | starnetwifi |\n| 59176384 | starwifi88 |\n\nSo I decided to use `DISTINCT ON` in the pipeline as deduplication logic, so the first row remains.\n\nSecond, there are 14 repositories ID repeated with different names:\n\n```sql\nSELECT\n    id\n    , COUNT(DISTINCT name) AS num_of_names_per_id\nFROM raw.repos\nGROUP BY 1\nORDER BY 2 DESC\nLIMIT 15\n```\n\nFor example, the ID `230999134` has the following names:\n\n| id | name |\n|---|---|\n| 230999134 | hseera/dynamodb-billing-mode |\n| 230999134 | hseera/dynamodb-update-capacity |\n| 230999134 | hseera/update-dynamodb-capacity |\n\nSo I took the same logic into account in the pipeline.\n\nThese decisions were taken because no further explanations were provided.\n\nAnother thing that is worth mentioning is that the `PullRequestEvent` event doesn't have [the payload data](https://docs.github.com/en/rest/using-the-rest-api/github-event-types?apiVersion=2022-11-28#event-payload-object-for-pullrequestevent), so it's impossible to distinguish the events between opened, edited, closed, etc. I've assumed that the `PullRequestEvent` corresponds to the PR `opened` event.\n\nThis is because of the nature of the first question:\n\n\u003e Top 10 active users sorted by the amount of PRs created and commits pushed\n\nThe real question that I'll be answering is:\n\n\u003e Top 10 active users sorted by the amount of PRs events and commits pushed\n\nPlease take into account that, as per the question, the commits do not necessarily have to be related to the same PR.\n\nFinally, I understood that the phrase `active users` refers not to a bot.\n\n#### 3. Create draft queries to answer the questions\n\nI thought:\n\n\u003e I have the questions that I need to answer, so... how does a SQL query that answer them might look like?\n\n_I'm assuming that the data consumers are familiar with SQL. If this is not the case, the solution might be to create a specific report schema and tables with the results of the following queries._\n\nLet's think about the first one:\n\n\u003e Top 10 active users sorted by the amount of PRs created and commits pushed\n\nIt will look somehow like these:\n\n```sql\nSELECT\n    dim_users.user_id\n    , dim_users.username\n    , COUNT(*) AS num_prs_created_and_commits_pushed\nFROM some_schema.fct_events\nLEFT JOIN some_schema.dim_users\n    ON fct_events.user_id = dim_users.id\nWHERE fct_events.\"type\" IN ('PushEvent', 'PullRequestEvent')\nGROUP BY 1, 2\nORDER BY 3 DESC\nLIMIT 10\n```\n\nWhere:\n\n- `dim_users` is a dimension table, containing the user ID and username\n- `fct_events` is the fact table, containing all the events\n\nSo at first sight, the `dim_users` can be an [SCD type 2](https://en.wikipedia.org/wiki/Slowly_changing_dimension), as the username rarely changes over time, but it can. It seemed an overkill for this specific case, so I decided to model it as a type 0.\n\nPerforming a similar thing for the rest of the questions:\n\n\u003e Top 10 repositories sorted by the amount of commits pushed\n\u003e\n\u003e Top 10 repositories sorted by the amount of watch events\n\nI realized that the queries would be quite similar to the previous one, and the other dimensions were very straightforward. So, these tables were created too:\n\n- `dim_commits` is a dimension table, containing the commit ID, the commit SHA, and the event ID\n- `dim_repos` is a dimension table, containing the repo ID and name\n\n#### 4. Create the models\n\nI decided to use [classic modular data modeling techniques](https://www.getdbt.com/analytics-engineering/modular-data-modeling-technique#what-is-modular-data-modeling), and thought about these layers:\n\n- `staging`: just a copy of the landing/source tables with some types casting (if needed), in order to standardize\n- `intermediate`: here I'll place reusable models, with some deduplication logic\n- `marts`: here I'll place the final models, in a star schema (facts surrounded by dimensions)\n\nSince the raw data doesn't need much processing (just some deduplication logic), all of the models in the `staging` and `intermediate` layers will be quite similar, and the only difference will be the deduplication logic. I've created a macro to apply [the DRY principle](https://docs.getdbt.com/terms/dry) in these layers.\n\nThe final lineage graph is as follows:\n\n![lineage_graph](./images/lineage.png)\n\n### SQL queries for reporting\n\nUsing the data model created with `dbt`, you can answer the required questions.\n\nPlease, run these queries in DBeaver to verify the results.\n\n```sql\n-- Top 10 active users sorted by the amount of PRs created and commits pushed\nSELECT\n    fct_events.user_id AS user_id\n    , dim_users.username AS username\n    , COUNT(*) AS num_prs_created_and_commits_pushed\nFROM reporting.fct_events\nLEFT JOIN reporting.dim_users\n    ON fct_events.user_id = dim_users.id\nWHERE fct_events.\"type\" IN ('PushEvent', 'PullRequestEvent')\n    AND NOT username ~* '-bot|\\[bot\\]|bot$'\nGROUP BY 1, 2\nORDER BY 3 DESC\nLIMIT 10\n```\n\n```sql\n-- Top 10 repositories sorted by the amount of commits pushed\nSELECT\n    fct_events.repo_id AS repo_id\n    , dim_repos.name AS repo_name\n    , COUNT(*) AS num_commits_per_repo\nFROM reporting.fct_events\nLEFT JOIN reporting.dim_repos\n    ON fct_events.repo_id = dim_repos.id\nWHERE fct_events.commit_sha IS NOT NULL\nGROUP BY 1, 2\nORDER BY 3 DESC\nLIMIT 10\n```\n\n```sql\n-- Top 10 repositories sorted by the amount of watch events\nSELECT\n    fct_events.repo_id AS repo_id\n    , dim_repos.name AS repo_name\n    , COUNT(*) AS num_watch_events_per_repo\nFROM reporting.fct_events\nLEFT JOIN reporting.dim_repos\n    ON fct_events.repo_id = dim_repos.id\nWHERE fct_events.\"type\" = 'WatchEvent'\nGROUP BY 1, 2\nORDER BY 3 DESC\nLIMIT 10\n```\n\n### Model contracts and tests\n\nI've added some tests in the `intermediate` and `reporting` layers to verify the correctness of the data and ensure the data quality.\n\nGenerally speaking, the tests aim to ensure:\n\n- No ID is missing\n- Data types are as expected\n- There are no duplicates\n- Critical columns are present\n\nTo run the tests, open a terminal and run:\n\n```bash\nmake dbt-test-model node=\"--target prod\"\n```\n\nAlso, there are some model contracts enforced in the `intermediate` and `reporting` layers, in order to avoid inserting duplicated fields, nulls, etc., and to ensure the models' relations.\n\n## Running the dbt workflow in an Airflow DAG\n\nIf you're an Airflow fan (like me), you can set up an environment to run the dbt pipeline in an Airflow DAG.\n\nTo do this, please run (these commands are similar to the setup process):\n\n```bash\nmake build\nmake up\n```\n\nThen, go to [http://localhost:8080/](http://localhost:8080/) and log in with the credentials `airflow:airflow`. You'll find a DAG named `transformations`, please go ahead and click on it. You'll see a DAG like this:\n\n![airflow_dag](./images/airflow_dag.png)\n\nIf you want to test it, just click on the toggle button and run the pipeline.\n\nPlease note that the models are run before their tests in the same DAG.\n\n## More commands and help\n\nIf you're struggling with some commands, please run `make help` to get all the available commands.\n\n## About the development tools\n\nI've used [poetry](https://python-poetry.org/) to manage the project's dependencies. If you want to install it in your local machine, please run:\n\n```bash\nmake install-poetry\n```\n\nAnd then run:\n\n```bash\nmake install-project\n```\n\nThen you'll have all the dependencies installed, and a virtual environment created in this very directory. This is useful, for example, if you're using VS Code and want to explore the code. Also, you might want to use [pyenv](https://github.com/pyenv/pyenv) to install Python 3.10.12.\n\nAll the code in this project has been linted and formatted with these tools:\n\n- [black](https://black.readthedocs.io/en/stable/)\n- [isort](https://pycqa.github.io/isort/)\n- [mypy](https://mypy.readthedocs.io/en/stable/)\n- [ruff](https://docs.astral.sh/ruff/)\n- [sqlfluff](https://docs.astral.sh/sqlfluff/)\n\nI just cloned the repo and want to play around with the pre-commit framework? Just run:\n\n```bash\nmake nox-hooks\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhvignolo87%2Fanalytics_engineer_assignment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhvignolo87%2Fanalytics_engineer_assignment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhvignolo87%2Fanalytics_engineer_assignment/lists"}