Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/teresaromero/palmer-penguins
Mid-Bootcamp project for Core Code school Big Data & Machine Learning course.
https://github.com/teresaromero/palmer-penguins
docker docker-compose flask jupyter-notebook mongodb mongoshell pandas pymongo python streamlit streamlit-application vega-lite
Last synced: about 1 month ago
JSON representation
Mid-Bootcamp project for Core Code school Big Data & Machine Learning course.
- Host: GitHub
- URL: https://github.com/teresaromero/palmer-penguins
- Owner: teresaromero
- License: mit
- Created: 2021-07-24T08:43:17.000Z (over 3 years ago)
- Default Branch: development
- Last Pushed: 2021-08-16T09:04:52.000Z (over 3 years ago)
- Last Synced: 2023-04-05T20:56:26.973Z (over 1 year ago)
- Topics: docker, docker-compose, flask, jupyter-notebook, mongodb, mongoshell, pandas, pymongo, python, streamlit, streamlit-application, vega-lite
- Language: Python
- Homepage:
- Size: 4.43 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
README
![Title Image](docs/hero.jpeg "Palmer Penguins Hero")
# Insignts for Palmer Archipelago Penguins
[![Images Build and Push](https://github.com/teresaromero/palmer-penguins/actions/workflows/docker.yml/badge.svg?branch=development&event=push)](https://github.com/teresaromero/palmer-penguins/actions/workflows/docker.yml)
[![GitHub license](https://img.shields.io/github/license/teresaromero/palmer-penguins)](https://github.com/teresaromero/palmer-penguins/blob/development/LICENSE.md)## About the Data
Data has been gathered from different sources listed below:
- [Kaggle Dataset](https://www.kaggle.com/parulpandey/palmer-archipelago-antarctica-penguin-data)
- [National Geographic](https://www.nationalgeographic.com/animals/birds/facts/gentoo-penguin)
- [National Geographic](https://www.nationalgeographic.com/animals/birds/facts/adelie-penguin)
- [National Geographic](https://www.nationalgeographic.com/animals/birds/facts/chinstrap-penguin)## Motivation
The purpose for this project is educational. This project is the first of two to be done in the Data Bootcamp at :tangerine: [Core Code School](https://www.corecode.school/bootcamp/big-data-machine-learning).
Requirements for the project is to build a data app. This app should have a backend built with Flask, a frontend built with Streamlit and a database (Postgres or MongoDB).
## Stack
- API - Python, Flask, PyMongo
- Data - MongoDB, Python, Jupyter Notebook, Pandas, mongoshell script
- Streamlit - Python, Pandas, Streamlit API, Streamlit State API
- Other tools - Commitizen, GitHub Projects, GitHub Actions, Okteto## Services
The project has 3 main services: Database / API(backend) / Streamlit(frontend). Lets describe these services:
### :bar_chart: DATA
The data service is a custom mongodb image where the used data is added to the database in the init phase.
The original csv `source/penguins_lter.csv` is transformed into `database/docker-entrypoint-initdb.d/seed.json` by running `generate-seed-data.py`.
Once having the seed for the database, building the mongo image, `mongo-init.js` mongoshell script will create the admin and api users, and create the database with the different collections.
#### Collections
- `kaggle-raw-data`- the seed.json itself
- `ng-species-raw-data`- the species.json collection from web-scrapping NG
- `individuals` - collection with each penguin information regarding measures, each document has pointers to `islands`, `regions`, `species`, `studynames`
- `islands` - collection with the data regarding the island
- `regions` - collection with the data regarding the region
- `species` - collection with the data regarding the species
- `studynames` - collection with the data regarding the speciesThis collections are extracted from `kaggle-raw-data` in order to be able to include extra data for each collection without changing the `individuals` collection that is the main one.
### :gear: API
The API is a backend service for the streamlit frontend and the one that comunicates with the database. The sub-repo for the api is structured as follows:
- `main.py`- entry point for the flask server.
- `config.py`- env variables all in the same place.
- `routes`- dir with all the routes, entry point to the API.
- `controllers`- dir with the controllers for each route, responsible to exec the code for that route.
- `libs` - utils used along the project for different porpuses.
- `decorators` - custom decorator methods.#### Routes
- `GET - /` - returns all the documents found for this collection on the database
- `PATCH - //` - modify the document `` of the ``. The payload should be compliant with the collections fields.#### Decorators
- `handle_error` - for each route, this decorator catches the errors and returns a json error response.
- `validate_route`- as root route is based on parameters, this decorator checks the collection exists at the db, if not it throws an error before accesing the controller.#### Libs
- `mongo_client`- setup for the mongodb connection using `flask_pymongo`.
- `response`- utils to return different responses.### :sparkles: STREAMLIT
This is the service where the data is displayed. This sub-repo is structured as following:
- `main.py` - entry point for the streamlit app
- `utils` - dir with methods used along the project
- `pages`- dir with the pages available in the streamlit app
- `components` - dir with the components used along the project
- `api` - dir with the methods used to call backend to retrieve data#### Features and Screenshots
![Multi-page streamlit app](docs/st_home.jpeg "Multi-page streamlit app")
![Dashboard with data visualizations](docs/st_data_viz.jpeg "Dashboard with data visualizations")
![Read database data](docs/st_datasets.jpeg "Read database data")
![Edit database data](docs/st_data_edit.jpeg "Edit database data")
![Single page with species detail](docs/st_species.jpeg "Single page with species detail")
## Installation
You can clone the repo and run `docker-compose up`.
### .ENV
Env variables needed to run the project
- `MONGO_URI` - uri for MongoDB DB (incl. db-name).
- `MONGO_DBNAME` - database name where all data will be stored.- `MONGO_ADMIN_USERNAME` - username for the database admin user.
- `MONGO_ADMIN_PASSWORD` - password for the database admin user.- `MONGO_API_USERNAME` - username for the database user used in the api.
- `MONGO_API_PASSWORD` - password for the database user used in the api.- `FLASK_DEBUG` - flag to run Flask in debug mode, `False` or `True`.
- `FLASK_ENV` - environment where Flask is running, `development`.- `API_URL` - url for the API.
- `API_PORT` - the port where the API will be available.## WIP
Some features have been not included on this first version, so here are some WIP and future work to be done on this repo:
- Production pipeline for API and Streamlit
- Refactor MongoDB seed
- Add Auth to Flask API
- Enable PDF download of visualizations
- Add more visualizations## Contributing
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
## License
[MIT](https://choosealicense.com/licenses/mit/)