{"id":24911568,"url":"https://github.com/hackoregon/2019hackordatasciencetemplate","last_synced_at":"2025-03-28T03:15:03.808Z","repository":{"id":37593887,"uuid":"181349652","full_name":"hackoregon/2019HackORDataScienceTemplate","owner":"hackoregon","description":"Template to get the 2019 data science parts of a Hack Oregon project started :) ","archived":false,"fork":false,"pushed_at":"2022-12-08T04:59:06.000Z","size":69,"stargazers_count":2,"open_issues_count":33,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-02-02T04:23:38.768Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hackoregon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-14T18:11:25.000Z","updated_at":"2019-06-28T01:40:59.000Z","dependencies_parsed_at":"2023-01-24T11:15:30.170Z","dependency_job_id":null,"html_url":"https://github.com/hackoregon/2019HackORDataScienceTemplate","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackoregon%2F2019HackORDataScienceTemplate","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackoregon%2F2019HackORDataScienceTemplate/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackoregon%2F2019HackORDataScienceTemplate/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hackoregon%2F2019HackORDataScienceTemplate/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hackoregon","download_url":"https://codeload.github.com/hackoregon/2019HackORDataScienceTemplate/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245960813,"owners_count":20700781,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-02-02T04:21:02.170Z","updated_at":"2025-03-28T03:15:03.778Z","avatar_url":"https://github.com/hackoregon.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Purpose\nThis is meant for use when you are:\n1. setting up a GitHub data science project structure locally\n2. extracting and reproducing the software setup from `Google Colaboratory` notebook instances or from Amazon `SageMaker`\n\n# Naming convention for Hack Oregon data science github projects \n* `2019-{project-name}-{data-science}`\n\n\n# Different versions of Data Science docker templates\nThis contains Dockerfile templates in different flavors for getting started\non the data science parts of a `HackOregon` project. \n\n1) `master` branch contains basic Python based dependencies \n2) `R` branch contains R-based dependencies \n3) `MLflow-py` for experimental Python workflow that uses `MLflow`\n4) others coming soon \n\n\n# What the template does:\n1. set up a recommended folder structure with `cookercutter`\n2. set up library dependencies for extracting documentation as a website\n\t* `Python`: help set up `Sphinx` for extracting docstring documentation about the APIs \n\t* `R`: help set up `KnitR` and `ROxygen2` for extracting the comments from\n\t\t\tdifferent parts of the R code\n3. set up testing infrastructure for validating the correctness of the code\n\t* `Python`: We recommend to use one of the `pytest` or `unittest` frameworks \n\t* \n4. Reproduce library setup from Cloud-based notebook instances \ne.g. \n* `AWS SageMaker`\n\t\t* [R usage example with KnitR reports](https://rstudio-pubs-static.s3.amazonaws.com/456313_9f8f6ba90b7a4a70a5f8cef7753d2d19.html)\n* `Google Cloud Colaboratory`\n\n# Recommended folder structure \n```\n    ├── LICENSE\n    ├── build\t\t      \u003c- all the files needed to build the code dependencies\n    │   ├── Makefile \t      \u003c- Makefile with commands like `make data` or `make train`\n    │   ├── requirements.txt  \u003c- The requirements file for reproducing the analysis \n    │   │         \t\t environment, generated with `pip freeze \u003e requirements.txt`\n    │   ├── docker-compose.yml\u003c- The docker-compose file starting resources \n    │   └── Dockerfile \t      \u003c- The dockerfile that uses requirements.txt file.\n    │\n    ├── README.md             \u003c- The top-level README for developers using this project.\n    │\n    ├── data\t\t      \u003c- You are encouraged to include links to metadata\n    │   ├── 1_raw             \u003c-  Original raw data dump.\n    │   ├── 2_interim         \u003c- Intermediate data that has been transformed, \n    │   │         \t\t recommended format for relational datais parquet.\n    │   └── 3_processed       \u003c- The final, canonical data sets for modeling.\n    │\n    ├── docs                  \u003c- A default Sphinx project; see sphinx-doc.org for details\n    │\n    ├── models                \u003c- Trained and serialized models, model predictions, or model summaries\n    │\n    ├── notebooks             \u003c- Jupyter notebooks. Naming convention is a number (for ordering),\n    │                            the creator's initials, and a short `-` delimited description, e.g.\n    │                            `1.0-jqp-initial-data-exploration`.\n    │\n    ├── references            \u003c- Manuals, and all other explanatory materials.\n    │\n    ├── reports               \u003c- Generated analysis as HTML, PDF, LaTeX, etc.\n    │   └── figures           \u003c- Generated graphics and figures to be used in reporting\n    │\n    │\n    ├── setup.py              \u003c- makes project pip installable (pip install -e .) so src can be imported\n    ├── src                   \u003c- Source code for use in this project.\n    │   ├── __init__.py       \u003c- Makes src a Python module\n    │   │\n    │   ├── data              \u003c- Scripts to download or generate data\n    │   │   └── make_dataset.py\n    │   │\n    │   ├── features       \u003c- Scripts to turn raw data into features for modeling\n    │   │   └── build_features.py\n    │   │\n    │   ├── models         \u003c- Scripts to train models and then use trained models to make\n    │   │   │                 predictions\n    │   │   ├── predict_model.py\n    │   │   └── train_model.py\n    │   │\n    │   └── visualization  \u003c- Scripts to create exploratory and results oriented visualizations\n    │       └── visualize.py\n    │\n    └── tox.ini            \u003c- tox file with settings for running tox; see tox.testrun.org\n```\n\n--------\n\n\u003cp\u003e\u003csmall\u003eProject based on the \u003ca target=\"_blank\" href=\"https://drivendata.github.io/cookiecutter-data-science/\"\u003ecookiecutter data science project template\u003c/a\u003e. #cookiecutterdatascience\u003c/small\u003e\u003c/p\u003e\n\n# Data storage in our public S3 bucket\nraw-data = `hacko-data-archive`\nclean-data = ?  # in the future\n\n## Storing non-sensitive data to S3 data buckets \n* have a data science manager (or data scientist) of your project  contact Michael to get an AWS account \n\n## Getting non-sensitive data from S3 data buckets \n```\nfrom sagemaker import get_execution_role\n\nrole = get_execution_role()\nbucket = 'hacko-data-archieve'\n# example data key, change this\ndata_key = '2018-neighborhood-development/JSON/pdx_bicycle/pdx_bike_counts.csv'\n\ndata_location = 's3://{}/{}'.format(bucket, data_key)\noutput_location = 's3://{}/{}'.format(bucket, data_key)\n```\n\n## SageMaker \nWe may spin up allow `sagemaker` instances for projects with big compute and / or data needs.\n* Naming convention for notebooks instances: \n\t* `PROJECTNAME_AUTHOR_NAME`\n\n# Past version of the Docker container template \nhttps://github.com/hackoregon/data-science-pet-containers\n\n# Using AWS using CLI \nPut your credentials in \n```\n~/.aws/credential\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackoregon%2F2019hackordatasciencetemplate","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhackoregon%2F2019hackordatasciencetemplate","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhackoregon%2F2019hackordatasciencetemplate/lists"}