https://github.com/deepakramani/de_zc2023
data engineering zoom camp 2023
https://github.com/deepakramani/de_zc2023
Last synced: 2 months ago
JSON representation
data engineering zoom camp 2023
- Host: GitHub
- URL: https://github.com/deepakramani/de_zc2023
- Owner: deepakramani
- License: gpl-3.0
- Created: 2023-01-21T22:53:58.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-02-06T20:25:43.000Z (over 2 years ago)
- Last Synced: 2024-11-26T15:50:36.396Z (6 months ago)
- Language: Python
- Size: 75.2 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Course repo for DE zoom camp 2023
All the course related things will be here.
## Clone, install docker and docker-compose
The instructions were tested on Ubuntu 20.04 LTS. It is recommended to run these instruction in an Ubuntu based machine.
```
sudo apt update && sudo apt install make git gzip unzip -y
cd ~ && git clone https://github.com/dr563105/de_zc2023.git
cd de_zc2023
make install_docker
source ~/.bashrc
```
Logout and log in back to the instance. To test docker if it is working, run```
docker run --rm hello-world # should return "Hello from Docker!" without errors
```## Download the dataset
Make a folder and download the necessary files.
```
cd ~/de_zc2023
make download_data
```
## Initialise Postgres DB and start itExport postgres environment variables.
```
export POSTGRES_USER=yourDBname #postgres
export POSTGRES_PASSWORD=yourDBuserpassword #postgres
export POSTGRES_DB=yourDBname #nyc-taxi-db
```Run the following to initialise DB and start it. On the first run, the command will create a DB, create the tables and ingest data into it. It will take about 2 minutes to be active. Successive calls will be much quicker.
```
cd ~/de_zc2023
make start_db (if you get an docker-compose not found error. Run `source ~/.bashrc`)
```To log into the docker container and check if DB is setup properly,
```
docker exec -it de_zc2023-nytaxi_postgres-1 psql -U postgres nyc-taxi-db
```To learn more about PSQL, read [this](https://gist.github.com/Kartones/dd3ff5ec5ea238d4c546) Github gist.
Navigate to the [homework](./hw/) folder and try to execute queries inside the `sql` file. If everything is setup properly, you should be able to get the results.
## Stop DB```
make stop_db
```## ETL using Prefect
Install Miniconda. We will conda as base and use `pipenv` to install other libraries.
```
make install_conda
```
Log out/in again into the machine.Install pipenv
```
cd ~/de_zc2023
make setup_pipenv
```
Add port `4200` to ingress firewall rule.Setup and start prefect
**In terminal 1**
```
export EC2_IP="" # replace double quotes with the EC2 IP address
make setup_prefect && make start_prefect
```