{"id":34036460,"url":"https://github.com/kalininalab/datasail","last_synced_at":"2026-04-08T12:02:36.424Z","repository":{"id":110032620,"uuid":"598109632","full_name":"kalininalab/DataSAIL","owner":"kalininalab","description":"DataSAIL is a tool to split datasets while reducing information leakage.","archived":false,"fork":false,"pushed_at":"2026-04-03T13:22:56.000Z","size":46242,"stargazers_count":49,"open_issues_count":6,"forks_count":4,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-03T17:51:24.864Z","etag":null,"topics":["dataset-split","ilp","ilp-problem","machine-learning","optimization","scip"],"latest_commit_sha":null,"homepage":"https://datasail.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kalininalab.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-02-06T12:24:01.000Z","updated_at":"2026-04-03T13:23:02.000Z","dependencies_parsed_at":"2023-10-24T14:27:03.353Z","dependency_job_id":"d92cf074-76a9-49a9-951a-634eaecdc1f8","html_url":"https://github.com/kalininalab/DataSAIL","commit_stats":null,"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"purl":"pkg:github/kalininalab/DataSAIL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalininalab%2FDataSAIL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalininalab%2FDataSAIL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalininalab%2FDataSAIL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalininalab%2FDataSAIL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kalininalab","download_url":"https://codeload.github.com/kalininalab/DataSAIL/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kalininalab%2FDataSAIL/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31554110,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T10:21:54.569Z","status":"ssl_error","status_checked_at":"2026-04-08T10:21:38.171Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset-split","ilp","ilp-problem","machine-learning","optimization","scip"],"created_at":"2025-12-13T20:24:47.962Z","updated_at":"2026-04-08T12:02:36.364Z","avatar_url":"https://github.com/kalininalab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataSAIL: Data Splitting Against Information Leaking \n\n![testing](https://github.com/kalininalab/datasail/actions/workflows/test.yaml/badge.svg)\n[![docs-image](https://readthedocs.org/projects/glyles/badge/?version=latest)](https://datasail.readthedocs.io/en/latest/index.html)\n[![codecov](https://codecov.io/gh/kalininalab/DataSAIL/branch/main/graph/badge.svg)](https://codecov.io/gh/kalininalab/DataSAIL)\n[![anaconda](https://anaconda.org/kalininalab/datasail/badges/version.svg)](https://anaconda.org/kalininalab/datasail)\n[![update](https://anaconda.org/kalininalab/datasail/badges/latest_release_date.svg)](https://anaconda.org/kalininalab/datasail)\n[![license](https://anaconda.org/kalininalab/datasail/badges/license.svg)](https://anaconda.org/kalininalab/datasail)\n[![downloads](https://anaconda.org/kalininalab/datasail/badges/downloads.svg)](https://anaconda.org/kalininalab/datasail)\n![Python 3](https://img.shields.io/badge/python-3-blue.svg)\n[![DOI](https://zenodo.org/badge/598109632.svg)](https://doi.org/10.5281/zenodo.13938602)\n\nDataSAIL, short for Data Splitting Against Information Leakage, is a versatile tool designed to partition data while \nminimizing similarities between the partitions. Inter-sample similarities can lead to information leakage, resulting \nin an overestimation of the model's performance in certain training regimes.\n\nDataSAIL was initially developed for machine learning workflows involving biological datasets, but its utility extends to\nany type of datasets. It can be used through a command line interface or integrated as a Python package, making it\naccessible and user-friendly. The tool is licensed under the MIT license, ensuring it remains open source and freely\navailable here on GitHub.\n\nA detailed documentation of the package, explanations, examples, and much more are given on DataSAIL's [ReadTheDocs page](https://datasail.readthedocs.io/en/latest/index.html). \n\n## Installation\n\nDataSAIL is available for all modern versions of Python (v3.9 or newer). We ship two versions of DataSAIL:\n- `DataSAIL`: The full version of DataSAIL, which includes all third-party clustering algorithms and is available on conda for linux and OSX (called `datasail`).\n- `DataSAIL-lite`: A lightweight version of DataSAIL, which does not include any third-party clustering algorithms and is available on PyPI (called `datasail`) and conda (called `datasail-lite`).\n\n**_NOTE:_** There is a naming-inconsitency between the conda and PyPI versions of DataSAIL. The lite version is called `datasail-lite` on conda, while it is called `datasail` on PyPI. This will be fixed in the future, but for now, please be aware of this inconsistency.\n\n## Usage\n\nDataSAIL is installed as a command-line tool. So, in the conda environment, DataSAIL has been installed to, you can run \n\n````shell\ndatasail --e-type P --e-data \u003cpath_to_fasta\u003e --e-sim mmseqs --output \u003cpath_to_output_path\u003e --technique C1e\n````\n\nto split a set of proteins that have been clustered using mmseqs. For a full list of arguments, run `datasail -h` and checkout [ReadTheDocs](https://datasail.readthedocs.io/). There is a more detailed explanation of the arguments and example notebooks. The runtime largy depends on the number and type of splits to be computed and the size of the dataset. For small datasets (less then 10k samples) DataSAIL finished within minutes. On large datasets (more than 100k samples) it can take several hours to complete.\nRegardless of which installation command was used, DataSAIL can be executed by running\n\n````shell\ndatasail -h\n````\n\nin the command line and see the parameters DataSAIL takes. DataSAIL can also directly be included as a normal package into your Python program using\n\n````python\nfrom datasail.sail import datasail\nsplits = datasail(...)\n````\n\nFor more information about the parameters, please read through the [documentation page](https://datasail.readthedocs.io/en/latest/interfaces/cli.html).\n\n## When to use DataSAIL and when not to use\n\n![splits](docs/imgs/phylOverview_splits.png)\nDataSAIL offers a variety of ways to split one-dimensional and multi-dimensional data. Here exemplarily shown for a generic protein property prediction task and a protein-ligand interaction prediction dataset.\n\nThe datasplit employed should always reflect the inference reality the model is facing. So, if the model is intended to perform well on unseen data, the validation and test data shall be new between splits.\n\nFor more information, please see our [guideline to selecting datasplits]() in the documentation.\n\n## Citation\n\nIf you used DataSAIL to split your data, please cite DataSAIL in your publication.\n````\n@article{joeres2025datasail,\n  title={Data splitting to avoid information leakage with DataSAIL},\n  author={Joeres, Roman and Blumenthal, David B. and Kalinina, Olga V.},\n  journal={Nature Communications},\n  volume={16},\n  pages={3337},\n  year={2025},\n  doi={10.1038/s41467-025-58606-8},\n}\n````\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkalininalab%2Fdatasail","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkalininalab%2Fdatasail","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkalininalab%2Fdatasail/lists"}