https://github.com/mitgar14/etl-workshop-1
Workshop #1 (Data Engineer) for the ETL course using Pandas, Matplotlib, SQLAlchemy and Power BI for the creation of the dashboard.
https://github.com/mitgar14/etl-workshop-1
data-engineer data-visualization etl pandas postgresql powerbi python sqlalchemy
Last synced: 11 months ago
JSON representation
Workshop #1 (Data Engineer) for the ETL course using Pandas, Matplotlib, SQLAlchemy and Power BI for the creation of the dashboard.
- Host: GitHub
- URL: https://github.com/mitgar14/etl-workshop-1
- Owner: mitgar14
- Created: 2024-08-18T17:19:46.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-18T17:36:02.000Z (12 months ago)
- Last Synced: 2025-02-18T18:39:49.870Z (12 months ago)
- Topics: data-engineer, data-visualization, etl, pandas, postgresql, powerbi, python, sqlalchemy
- Language: Jupyter Notebook
- Homepage:
- Size: 4.93 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Workshop #1: Data Engineer 
Realized by **Martín García** ([@mitgar14](https://github.com/mitgar14)).
## Overview ✨
In this workshop we use randomly generated data on candidates stored in a CSV. With this data we run loading, cleaning and transformation processes to find interesting insights using the following tools:
* Python 3.12 ➜ [Download site](https://www.python.org/downloads/)
* Jupyter Notebook ➜ [VS Code tool for using notebooks](https://youtu.be/ZYat1is07VI?si=BMHUgk7XrJQksTkt)
* PostgreSQL ➜ [Download site](https://www.postgresql.org/download/)
* Power BI (Desktop version) ➜ [Download site](https://www.microsoft.com/es-es/power-platform/products/power-bi/desktop)
The libraries needed for Python are
* Pandas
* Matplotlib
* Seaborn
* SQLAlchemy
* Dotenv
These libraries are included in the Poetry project config file (*pyproject.toml*). The step-by-step installation will be described later.
## Dataset Information 
The dataset used (*candidates.csv*) has 50,000 rows and 10 columns describing each candidate registered for the recruitment process.
This dataset is further transformed to be better processed by the visual analysis tool.
Initially, the column names of the dataset and their respective Dtype are:
* First Name ➜ Object
* Last Name ➜ Object
* Email ➜ Object
* Country ➜ Object
* Application Date ➜ Object
* YOE (years of experience) ➜ Integer
* Seniority ➜ Object
* Technology ➜ Object
* Code Challenge Score ➜ Integer
* Technical Interview Score ➜ Integer
## Run the project 
> Although in this case the example is done with Ubuntu using WSL, this process can be done for any operating system (OS).
### Clone the repository
Execute the following command to clone the repository:
```bash
git clone https://github.com/mitgar14/etl-workshop-1.git
```
#### Demonstration of the process

### Enviromental variables
> From now on, the steps will be done in VS Code.
To establish the connection to the database, we use a module called *connection.py*. In this Python script we call a file where our environment variables are stored, this is how we will create this file:
1. We create a directory named ***env*** inside our cloned repository.
2. There we create a file called ***.env***.
3. In that file we declare 6 enviromental variables. Remember that the variables in this case go without double quotes, i.e. the string notation (`"`):
```python
PG_HOST = # host address, e.g. localhost or 127.0.0.1
PG_PORT = # PostgreSQL port, e.g. 5432
PG_USER = # your PostgreSQL user
PG_PASSWORD = # your user password
PG_DRIVER = postgresql+psycopg2
PG_DATABASE = # your database name, e.g. postgres
```
#### Demonstration of the process

### Installing the dependencies with *Poetry*
> To install Poetry follow [this link](https://elcuaderno.notion.site/Poetry-8f7b23a0f9f340318bbba4ef36023d60?pvs=4).
1. Enter the Poetry shell with `poetry shell`.
2. Once the virtual environment is created, execute `poetry install` to install the dependencies. In some case of error with the *.lock* file, just execute `poetry lock` to fix it.
3. Now you can execute the notebooks!
#### Demonstration of the process

### Running the notebooks
We execute the 3 notebooks following the next order. You can run it just pressing the "Execute All" button:
1. *001_rawDataLoad.ipynb*
2. *002_candidatesEDA.ipynb*
3. *003_cleanDataLoad.ipynb*

Remember to choose **the right Python kernel** at the time of running the notebook and **install the *ipykernel*** to support Jupyter notebooks in VS Code.
### Connecting the database with Power BI
1. Open Power BI Desktop and create a new dashboard. Select the *Get data* option - be sure you choose the "PostgreSQL Database" option.

2. Insert the PostgreSQL Server and Database Name.

3. Fill in the following fields with your credentials.

4. If you manage to connect to the database the following tables will appear:

5. Choose the candidates_hired table and start making your own visualizations!

## Thank you! 💕🐍
Thanks for visiting my project. Any suggestion or contribution is always welcome 👄.