{"id":13936609,"url":"https://github.com/mbernico/snape","last_synced_at":"2026-01-18T15:33:17.778Z","repository":{"id":83088113,"uuid":"75018425","full_name":"mbernico/snape","owner":"mbernico","description":"Snape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression and then adds in 'realism' features such as complex formating, varying scales, categorical variables, and missing values.","archived":false,"fork":false,"pushed_at":"2020-05-20T20:27:02.000Z","size":191,"stargazers_count":165,"open_issues_count":3,"forks_count":21,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-08-08T23:24:04.182Z","etag":null,"topics":["classification","dataset","python","regression","snape","students"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mbernico.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2016-11-28T22:05:27.000Z","updated_at":"2024-08-03T06:56:47.000Z","dependencies_parsed_at":"2023-03-12T17:27:45.703Z","dependency_job_id":null,"html_url":"https://github.com/mbernico/snape","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbernico%2Fsnape","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbernico%2Fsnape/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbernico%2Fsnape/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbernico%2Fsnape/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mbernico","download_url":"https://codeload.github.com/mbernico/snape/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":226686729,"owners_count":17666928,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","dataset","python","regression","snape","students"],"created_at":"2024-08-07T23:02:50.579Z","updated_at":"2026-01-18T15:33:17.752Z","avatar_url":"https://github.com/mbernico.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"[![Build status](https://travis-ci.org/mbernico/snape.svg?branch=master)](https://travis-ci.org/mbernico/snape)\n[![Coverage Status](https://coveralls.io/repos/github/mbernico/snape/badge.svg?branch=master)](https://coveralls.io/github/mbernico/snape?branch=master)\n\n# Snape\n\nSnape is a convenient artificial dataset generator that wraps sklearn's make_classification and make_regression\nand then adds in 'realism' features such as complex formating, varying scales, categorical variables,\nand missing values.\n\n## Motivation\n\nSnape was primarily created for academic and educational settings.  It has been used to create datasets that are unique per\nstudent, per assignment for various homework assignments.  It has also been used to create class wide assessments in\nconjunction with 'Kaggle In the Classroom.'\n\nOther users have suggested non-academic uses cases as well, including 'interview screening problems,' model comparison,\netc.\n\n## Installation\n\n\n### Via Github\n```bash\ngit clone https://github.com/mbernico/snape.git\ncd snape\npython setup.py install\n```\n### Via pip\n*Coming Soon...*\n\n## Quick Start\n\nSnape can run either as a python module or as a command line application.\n\n### Command Line Usage\n\n#### Creating a Dataset\n\nFrom the main directory in the git repo:\n```bash\n\npython snape/make_dataset.py -c example/config_classification.json\n```\nWill use the configuration file example/config_classification.json to create an artificial dataset called 'my_dataset'\n(which is specified in the json config, more on this later...).\n\nThe dataset will consist of three files:\n*  my_dataset_train.csv   (80% of the artificial dataset with all dependent and independent variables)\n*  my_dataset_test.csv    (20% of the artificial dataset with only the dependent variables present)\n*  my_dataset_testkey.csv (the same 20% as _test, including the dependent variables)\n\nNote that if a star schema is generated, additional csv files will be generated. There will be one extra csv file per dimension, but only the main 'fact table' dataset will be split into test and train files. \n\nThe train and test files can be given to a student.  The student can respond with a file of predictions, which can be\nscored against the testkey as follows:\n\n#### Scoring a Dataset\n\n```bash\nsnape/score_dataset.py  -p example/student_predictions.csv  -k example/student_testkey.csv\n```\nSnape's score_dataset.py will attempt to detect the problem type and then score it, printing some metrics\n\n\n```\nProblem Type Detection: binary\n---Binary Classification Score---\n             precision    recall  f1-score   support\n\n          0       0.81      0.99      0.89      1601\n          1       0.50      0.06      0.11       399\n\navg / total       0.75      0.80      0.73      2000\n```\n\n\n### Python Module Usage\n\n\n#### Creating a Dataset\n```python\nfrom snape.make_dataset import make_dataset\n\n# configuration json examples can be found in doc\nconf = {\n    \"type\": \"classification\",\n    \"n_classes\": 2,\n    \"n_samples\": 1000,\n    \"n_features\": 10,\n    \"out_path\": \"./\",\n    \"output\": \"my_dataset\",\n    \"n_informative\": 3,\n    \"n_duplicate\": 0,\n    \"n_redundant\": 0,\n    \"n_clusters\": 2,\n    \"weights\": [0.8, 0.2],\n    \"pct_missing\": 0.00,\n    \"insert_dollar\": \"Yes\",\n    \"insert_percent\": \"Yes\",\n    \"n_categorical\": 0,\n    \"star_schema\": \"No\",\n    \"label_list\": []\n}\n\nmake_dataset(config=conf)\n```\n\n\n#### Scoring a Dataset\n\n```python\nfrom snape.score_dataset import score_dataset\n\n# a dataset's testkey can be compared to a prediction file using score_dataset()\nresults = score_dataset(y_file=\"student_testkey.csv\", y_hat_file=\"student_predictions.csv\")\n# results is a tuple of (a_primary_metric, classification_report)\nprint(\"AUC = \" + str(results[0]))\nprint(results[1])\n````\n\n\n## Dataset Generation Config\n\n1.  [Classification JSON](doc/config_classification.json.md)\n2.  [Regression JSON](doc/config_regression.json.md)\n\n\n## Why Snape?\nSnape is primarily used for creating complex datasets that *challenge* students and teach defense against the dark\narts of machine learning.  :)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbernico%2Fsnape","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmbernico%2Fsnape","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbernico%2Fsnape/lists"}