https://github.com/hackoregon/2019hackordatasciencetemplate

Template to get the 2019 data science parts of a Hack Oregon project started :)
https://github.com/hackoregon/2019hackordatasciencetemplate

Last synced: about 1 year ago
JSON representation

Template to get the 2019 data science parts of a Hack Oregon project started :)

Host: GitHub
URL: https://github.com/hackoregon/2019hackordatasciencetemplate
Owner: hackoregon
License: mit
Created: 2019-04-14T18:11:25.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2022-12-08T04:59:06.000Z (over 3 years ago)
Last Synced: 2025-02-02T04:23:38.768Z (over 1 year ago)
Language: Jupyter Notebook
Size: 67.4 KB
Stars: 2
Watchers: 4
Forks: 0
Open Issues: 33
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Purpose
This is meant for use when you are:
1. setting up a GitHub data science project structure locally
2. extracting and reproducing the software setup from `Google Colaboratory` notebook instances or from Amazon `SageMaker`

# Naming convention for Hack Oregon data science github projects
* `2019-{project-name}-{data-science}`

# Different versions of Data Science docker templates
This contains Dockerfile templates in different flavors for getting started
on the data science parts of a `HackOregon` project.

1) `master` branch contains basic Python based dependencies
2) `R` branch contains R-based dependencies
3) `MLflow-py` for experimental Python workflow that uses `MLflow`
4) others coming soon

# What the template does:
1. set up a recommended folder structure with `cookercutter`
2. set up library dependencies for extracting documentation as a website
* `Python`: help set up `Sphinx` for extracting docstring documentation about the APIs
* `R`: help set up `KnitR` and `ROxygen2` for extracting the comments from
different parts of the R code
3. set up testing infrastructure for validating the correctness of the code
* `Python`: We recommend to use one of the `pytest` or `unittest` frameworks
*
4. Reproduce library setup from Cloud-based notebook instances
e.g.
* `AWS SageMaker`
* [R usage example with KnitR reports](https://rstudio-pubs-static.s3.amazonaws.com/456313_9f8f6ba90b7a4a70a5f8cef7753d2d19.html)
* `Google Cloud Colaboratory`

# Recommended folder structure
```
├── LICENSE
├── build
│   ├── Makefile
│ ├──
│ │
│ ├──
│   └── Dockerfile
│
├── README.md
│
├── data
│   ├── 1_raw
│   ├── 2_interim
│   │
│   └── 3_processed
│
├── docs
│
├── models
│
├── notebooks
│
│
│
├── references
│
├── reports
│   └── figures
│
│
├── setup.py
├── src
│   ├── __init__.py
│ │
│   ├── data
│   │   └── make_dataset.py
│ │
│   ├── features
│   │   └── build_features.py
│ │
│   ├── models
│ │ │
│   │   ├── predict_model.py
│   │   └── train_model.py
│ │
│   └── visualization
│   └── visualize.py
│
└── tox.ini
``` <- all the files needed to build the code dependencies <- Makefile with commands like `make data` or `make train` requirements.txt <- The requirements file for reproducing the analysis environment, generated with `pip freeze > requirements.txt` docker-compose.yml<- The docker-compose file starting resources <- The dockerfile that uses requirements.txt file. <- The top-level README for developers using this project. <- You are encouraged to include links to metadata <- Original raw data dump. <- Intermediate data that has been transformed, recommended format for relational datais parquet. <- The final, canonical data sets for modeling. <- A default Sphinx project; see sphinx-doc.org for details <- Trained and serialized models, model predictions, or model summaries <- Jupyter notebooks. Naming convention is a number (for ordering), the creator's initials, and a short `-` delimited description, e.g. `1.0-jqp-initial-data-exploration`. <- Manuals, and all other explanatory materials. <- Generated analysis as HTML, PDF, LaTeX, etc. <- Generated graphics and figures to be used in reporting <- makes project pip installable (pip install -e .) so src can be imported <- Source code for use in this project. <- Makes src a Python module <- Scripts to download or generate data <- Scripts to turn raw data into features for modeling <- Scripts to train models and then use trained models to make predictions <- Scripts to create exploratory and results oriented visualizations <- tox file with settings for running tox; see tox.testrun.org

--------

Project based on the cookiecutter data science project template. #cookiecutterdatascience

# Data storage in our public S3 bucket
raw-data = `hacko-data-archive`
clean-data = ? # in the future

## Storing non-sensitive data to S3 data buckets
* have a data science manager (or data scientist) of your project contact Michael to get an AWS account

## Getting non-sensitive data from S3 data buckets
```
from sagemaker import get_execution_role

role = get_execution_role()
bucket = 'hacko-data-archieve'
# example data key, change this
data_key = '2018-neighborhood-development/JSON/pdx_bicycle/pdx_bike_counts.csv'

data_location = 's3://{}/{}'.format(bucket, data_key)
output_location = 's3://{}/{}'.format(bucket, data_key)
```

## SageMaker
We may spin up allow `sagemaker` instances for projects with big compute and / or data needs.
* Naming convention for notebooks instances:
* `PROJECTNAME_AUTHOR_NAME`

# Past version of the Docker container template
https://github.com/hackoregon/data-science-pet-containers

# Using AWS using CLI
Put your credentials in
```
~/.aws/credential
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hackoregon/2019hackordatasciencetemplate

Awesome Lists containing this project

README