Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/teresaromero/palmer-penguins

Mid-Bootcamp project for Core Code school Big Data & Machine Learning course.
https://github.com/teresaromero/palmer-penguins

docker docker-compose flask jupyter-notebook mongodb mongoshell pandas pymongo python streamlit streamlit-application vega-lite

Last synced: about 1 month ago
JSON representation

Mid-Bootcamp project for Core Code school Big Data & Machine Learning course.

Awesome Lists containing this project

README

        

![Title Image](docs/hero.jpeg "Palmer Penguins Hero")

# Insignts for Palmer Archipelago Penguins

[![Images Build and Push](https://github.com/teresaromero/palmer-penguins/actions/workflows/docker.yml/badge.svg?branch=development&event=push)](https://github.com/teresaromero/palmer-penguins/actions/workflows/docker.yml)
[![GitHub license](https://img.shields.io/github/license/teresaromero/palmer-penguins)](https://github.com/teresaromero/palmer-penguins/blob/development/LICENSE.md)

## About the Data

Data has been gathered from different sources listed below:

- [Kaggle Dataset](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data)
- [National Geographic](https://www.nationalgeographic.com/animals/birds/facts/gentoo-penguin)
- [National Geographic](https://www.nationalgeographic.com/animals/birds/facts/adelie-penguin)
- [National Geographic](https://www.nationalgeographic.com/animals/birds/facts/chinstrap-penguin)

## Motivation

The purpose for this project is educational. This project is the first of two to be done in the Data Bootcamp at :tangerine: [Core Code School](https://www.corecode.school/bootcamp/big-data-machine-learning).

Requirements for the project is to build a data app. This app should have a backend built with Flask, a frontend built with Streamlit and a database (Postgres or MongoDB).

## Stack

- API - Python, Flask, PyMongo
- Data - MongoDB, Python, Jupyter Notebook, Pandas, mongoshell script
- Streamlit - Python, Pandas, Streamlit API, Streamlit State API
- Other tools - Commitizen, GitHub Projects, GitHub Actions, Okteto

## Services

The project has 3 main services: Database / API(backend) / Streamlit(frontend). Lets describe these services:

### :bar_chart: DATA

The data service is a custom mongodb image where the used data is added to the database in the init phase.

The original csv `source/penguins_lter.csv` is transformed into `database/docker-entrypoint-initdb.d/seed.json` by running `generate-seed-data.py`.

Once having the seed for the database, building the mongo image, `mongo-init.js` mongoshell script will create the admin and api users, and create the database with the different collections.

#### Collections

- `kaggle-raw-data`- the seed.json itself
- `ng-species-raw-data`- the species.json collection from web-scrapping NG
- `individuals` - collection with each penguin information regarding measures, each document has pointers to `islands`, `regions`, `species`, `studynames`
- `islands` - collection with the data regarding the island
- `regions` - collection with the data regarding the region
- `species` - collection with the data regarding the species
- `studynames` - collection with the data regarding the species

This collections are extracted from `kaggle-raw-data` in order to be able to include extra data for each collection without changing the `individuals` collection that is the main one.

### :gear: API

The API is a backend service for the streamlit frontend and the one that comunicates with the database. The sub-repo for the api is structured as follows:

- `main.py`- entry point for the flask server.
- `config.py`- env variables all in the same place.
- `routes`- dir with all the routes, entry point to the API.
- `controllers`- dir with the controllers for each route, responsible to exec the code for that route.
- `libs` - utils used along the project for different porpuses.
- `decorators` - custom decorator methods.

#### Routes

- `GET - /` - returns all the documents found for this collection on the database
- `PATCH - //` - modify the document `` of the ``. The payload should be compliant with the collections fields.

#### Decorators

- `handle_error` - for each route, this decorator catches the errors and returns a json error response.
- `validate_route`- as root route is based on parameters, this decorator checks the collection exists at the db, if not it throws an error before accesing the controller.

#### Libs

- `mongo_client`- setup for the mongodb connection using `flask_pymongo`.
- `response`- utils to return different responses.

### :sparkles: STREAMLIT

This is the service where the data is displayed. This sub-repo is structured as following:

- `main.py` - entry point for the streamlit app
- `utils` - dir with methods used along the project
- `pages`- dir with the pages available in the streamlit app
- `components` - dir with the components used along the project
- `api` - dir with the methods used to call backend to retrieve data

#### Features and Screenshots

![Multi-page streamlit app](docs/st_home.jpeg "Multi-page streamlit app")

![Dashboard with data visualizations](docs/st_data_viz.jpeg "Dashboard with data visualizations")

![Read database data](docs/st_datasets.jpeg "Read database data")

![Edit database data](docs/st_data_edit.jpeg "Edit database data")

![Single page with species detail](docs/st_species.jpeg "Single page with species detail")

## Installation

You can clone the repo and run `docker-compose up`.

### .ENV

Env variables needed to run the project

- `MONGO_URI` - uri for MongoDB DB (incl. db-name).
- `MONGO_DBNAME` - database name where all data will be stored.

- `MONGO_ADMIN_USERNAME` - username for the database admin user.
- `MONGO_ADMIN_PASSWORD` - password for the database admin user.

- `MONGO_API_USERNAME` - username for the database user used in the api.
- `MONGO_API_PASSWORD` - password for the database user used in the api.

- `FLASK_DEBUG` - flag to run Flask in debug mode, `False` or `True`.
- `FLASK_ENV` - environment where Flask is running, `development`.

- `API_URL` - url for the API.
- `API_PORT` - the port where the API will be available.

## WIP

Some features have been not included on this first version, so here are some WIP and future work to be done on this repo:

- Production pipeline for API and Streamlit
- Refactor MongoDB seed
- Add Auth to Flask API
- Enable PDF download of visualizations
- Add more visualizations

## Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update tests as appropriate.

## License

[MIT](https://choosealicense.com/licenses/mit/)