{"id":20644682,"url":"https://github.com/epistasislab/digen","last_synced_at":"2025-04-16T02:09:26.328Z","repository":{"id":44719639,"uuid":"322763051","full_name":"EpistasisLab/digen","owner":"EpistasisLab","description":"Diverse and generative ML benchmarks","archived":false,"fork":false,"pushed_at":"2022-08-09T20:17:50.000Z","size":112063,"stargazers_count":15,"open_issues_count":1,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-16T02:09:08.239Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://epistasislab.github.io/digen/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EpistasisLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-12-19T04:05:59.000Z","updated_at":"2025-03-24T08:35:18.000Z","dependencies_parsed_at":"2022-07-20T14:48:04.671Z","dependency_job_id":null,"html_url":"https://github.com/EpistasisLab/digen","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Fdigen","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Fdigen/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Fdigen/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EpistasisLab%2Fdigen/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EpistasisLab","download_url":"https://codeload.github.com/EpistasisLab/digen/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249183105,"owners_count":21226142,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-16T16:17:14.400Z","updated_at":"2025-04-16T02:09:26.307Z","avatar_url":"https://github.com/EpistasisLab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# What is DIGEN?\n\nDiverse and Generative ML benchmark (DIGEN) is a modern machine learning benchmark, which includes:\n- 40 datasets in tabular numeric format specially designed to differentiate the performance of some of the leading Machine Learning (ML) methods, and\n- a package to perform reproducible benchmarking that simplifies comparison of performance of the methods.\n\nDIGEN provides comprehensive information on the datasets, including:\n- ground truth - a mathematical formula presenting how the target was generated for each of the datasets\n- the results of exploratory analysis, which includes feature correlation and histogram showing how binary endpoint was calculated.\n- multiple statistics on the datasets, including the AUROC, AUPRC and F1 scores\n- each dataset comes with Reveiver-Operating Characteristics (ROC) and Precision-Recall (PRC) charts for tuned ML methods, \n- a boxplot with projected performance of the leading methods after hyper-parameter tuning (100 runs of each method started with different random seed)\n\nApart from providing a collection of datasets and tuned ML methods, DIGEN provides tools to easily tune and optimize parameters of any novel ML method, as well as visualize its performance in comparison with the leading ones.\nDIGEN also offers tools for reproducibility.\n\n\n# Dependencies\n\nThe following packages are required to use DIGEN:\n\n    pandas\u003e=1.05\n    numpy\u003e=1.19.5\n    optuna\u003e=2.4.0\n    scikit-learn\u003e=0.22.2\n    importlib_resources\n\n\n# Installing DIGEN\n\nThe best way to install DIGEN is using pip, e.g. as a user:\n\n    pip install -U digen\n\n\n# Using DIGEN\n\nA non-peer reviewed paper is available at https://arxiv.org/pdf/2107.06475.pdf\n\nApart from the datasets, DIGEN provides a comprehensive toolbox for analyzing the performance of a chosen ML method.\nDIGEN uses [Optuna](https://github.com/optuna/optuna), a state of the art framework for optimizing hyper-parameters \n\nPlease refer to our online documentation at [https://epistasislab.github.io/digen](https://epistasislab.github.io/digen)\n\n\n# Citing DIGEN\n\n\nIf you found this resource to be helpful, please cite it the following way:\n\n```\n@article{orzechowski2021generative,\n  title={Generative and reproducible benchmarks for comprehensive evaluation of machine learning classifiers},\n  author={Orzechowski, Patryk and Moore, Jason H},\n  journal={arXiv preprint arXiv:2107.06475},\n  year={2021}\n}\n```\n\n# Tutorials\n\n[DIGEN Tutorial](https://github.com/EpistasisLab/digen/blob/main/DIGEN%20Tutorial.ipynb) is a great place to start exploring our package.\nFor advanced use, e.g. customization, manipulations with the charts, additional statistics on the collection, please check our [Advanced Tutorial](https://github.com/EpistasisLab/digen/blob/main/DIGEN%20Advanced.ipynb).\n\n\n# Included ML classifiers:\n\nThe following methods were included in our benchmark:\n- Decision Tree\n- Gradient Boosting\n- K-Nearest Neighbors\n- LightGBM\n- Logistic Regression\n- Random Forest\n- SVC\n- XGBoost\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepistasislab%2Fdigen","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fepistasislab%2Fdigen","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fepistasislab%2Fdigen/lists"}