https://github.com/piotrlaczkowski/data-science-template

Documentation available at: [dockbook](https://piotrlaczkowski.github.io/data-science-template/)
https://github.com/piotrlaczkowski/data-science-template

Last synced: 8 months ago
JSON representation

Documentation available at: [dockbook](https://piotrlaczkowski.github.io/data-science-template/)

Host: GitHub
URL: https://github.com/piotrlaczkowski/data-science-template
Owner: piotrlaczkowski
License: mit
Created: 2018-11-02T11:54:06.000Z (almost 7 years ago)
Default Branch: master
Last Pushed: 2018-11-02T11:55:17.000Z (almost 7 years ago)
Last Synced: 2025-01-03T14:26:59.645Z (9 months ago)
Language: Python
Homepage:
Size: 899 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

jimsghstars - piotrlaczkowski/data-science-template - Documentation available at: [dockbook](https://piotrlaczkowski.github.io/data-science-template/) (Python)

README

# Cookiecutter Template For BackMarket Data Science Projects
**Helping to promote an uniform data analysis and rules**

*New to Docker? Check out this writeup on containers vs virtual machines and how Docker fits in:*

*https://medium.freecodecamp.org/a-beginner-friendly-introduction-to-containers-vms-and-docker-79a9e3e119b*

*Cookiecutter is a command-line utility that automatically scaffolds new projects for you based on a template (referred to as cookiecutters):*

*http://cookiecutter.readthedocs.io/en/latest/readme.html*

This cookiecutter is used in conjunction with a base development image available in [Docker Hub](https://hub.docker.com/r/manifoldai/docker-ml-dev/) to provide an out-of-the-box ready environment for many Data Science and Machine Learning project use cases.
After running this cookiecutter and the provided start script a developer will have a local development setup that looks like this:

![docker local dev](https://s3-us-west-1.amazonaws.com/manifold-public-no-vpn/torus_local_dev.png)

By scaffolding your data science projects using this cookiecutter you will get:

- Project Docker image built with your own Dockerfile for project specific requirements
- Docker Compose configuration that dynamically binds to a free host port and forwards to the jupyter server listening port inside the container
- Shared volume configuration for accessing and executing all your project code inside of the controlled container environment
- Ability to edit code using your favorite IDE on your host machine and seeing real-time changes to the runtime environment
- Jupyter notebook fully configured with nb-extensions ready for development and feature engineering
- Common data science and plotting libraries pre-installed in the container environment to start working immediately

There are several downstream benefits for moving to a container-first workflow in terms of model and inference engine deployment/delivery.
By using containers early in the development cycle you can remove a lot of the configuration management issues that waste developer time and ultimately lower quality of deliverables.

## Getting Started
1. Install Docker:
- For Mac: https://store.docker.com/editions/community/docker-ce-desktop-mac
- For Windows: https://store.docker.com/editions/community/docker-ce-desktop-windows
- For Linux: Go to this page and choose the appropriate install for your Linux distro: https://www.docker.com/community-edition
- Install Docker Compose (https://docs.docker.com/compose/install/#install-compose):
```bash
$ sudo curl -L https://github.com/docker/compose/releases/download/1.21.0/docker-compose-$(uname -s)-$(uname -m) -o /usr/local/bin/docker-compose
```
```bash
$ sudo chmod +x /usr/local/bin/docker-compose
```
Test the installation:
```bash
$ docker-compose --version
docker-compose version 1.21.0, build 1719ceb
```
2. Install Python Cookiecutter package: http://cookiecutter.readthedocs.org/en/latest/installation.html >= 1.4.0
``` bash
$ pip install cookiecutter
```
It is recommended to set up a central virtualenv or condaenv for cookiecutter and any other "system" wide Python packages you may need.
3. Run the cookiecutter docker data science template to scaffold your new project:
``` bash
$ cookiecutter https://github.com/manifoldai/docker-cookiecutter-data-science.git
```
4. Answer all of the cookiecutter prompts for project name, description, license, etc.
5. Run the start script from the level of your new project directory:
``` bash
$ ./start.sh
```
6. After the project image builds check which host port is being forwarded to the Jupyter notebook server inside the running container:
``` bash
$ docker ps
```
7. Using any browser access your notebook at localhost:{port}
8. Start working!

For more details on what packages are available pre-installed in the base image see the manifoldai/docker-ml-dev repository page on [Docker Hub](https://hub.docker.com/r/manifoldai/docker-ml-dev/).
### Project Structure
The directory structure of your new project looks like this:

```
├── LICENSE
├── Dockerfile
├── docker-compose.yml
├── docker_clean_all.sh
├── start.sh
├── Makefile
├── README.md
├── data
│ ├── external
│ ├── interim
│ ├── processed
│ └── raw
│
├── docs
│
├── models
│
├── notebooks
│
│
│
├── references
│
├── reports
│ └── figures
│
├── requirements.txt
│
│
├── src
│ ├── __init__.py
│ │
│ ├── data
│ │ └── make_dataset.py
│ │
│ ├── features
│ │ └── build_features.py
│ │
│ ├── models
│ │ │
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization
│ └── visualize.py
│
└── tox.ini
``` <- New project Dockerfile that sources from base ML dev image <- Docker Compose configuration file <- Helper script to remove all containers and images from your system <- Script to run docker compose and any other project specific initialization steps <- Makefile with commands like `make data` or `make train` <- The top-level README for developers using this project. <- Data from third party sources. <- Intermediate data that has been transformed. <- The final, canonical data sets for modeling. <- The original, immutable data dump. <- A default Sphinx project; see sphinx-doc.org for details <- Trained and serialized models, model predictions, or model summaries <- Jupyter notebooks. Naming convention is a number (for ordering), the creator's initials, and a short `-` delimited description, e.g. `1.0-jqp-initial-data-exploration`. <- Data dictionaries, manuals, and all other explanatory materials. <- Generated analysis as HTML, PDF, LaTeX, etc. <- Generated graphics and figures to be used in reporting <- The requirements file for reproducing the analysis environment, e.g. generated with `pip freeze > requirements.txt` <- Source code for use in this project. <- Makes src a Python module <- Scripts to download or generate data <- Scripts to turn raw data into features for modeling <- Scripts to train models and then use trained models to make predictions <- Scripts to create exploratory and results oriented visualizations <- tox file with settings for running tox; see tox.testrun.org

## Video Demo
[![Torus Demo Youtube](http://img.youtube.com/vi/RgRmT4W8nTY/0.jpg)](http://www.youtube.com/watch?v=RgRmT4W8nTY)

## Helpful Resources
- Docker command cheatsheet: https://www.docker.com/sites/default/files/Docker_CheatSheet_08.09.2016_0.pdf
- Dockerfile reference: https://docs.docker.com/engine/reference/builder/
- Docker Compose reference: https://docs.docker.com/compose/compose-file/
- Kitematic (GUI interface to work with Docker. Highly recommended if you are new to Docker!): https://kitematic.com/

## Contributing
PRs and feature requests very welcome!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/piotrlaczkowski/data-science-template

Awesome Lists containing this project

README