{"id":43928354,"url":"https://github.com/insitro/kindel","last_synced_at":"2026-02-06T23:35:37.036Z","repository":{"id":312891645,"uuid":"867788126","full_name":"insitro/kindel","owner":"insitro","description":"KinDEL is a large DNA-encoded library dataset containing two kinase targets (DDR1 and MAPK14) for benchmarking machine learning models.","archived":false,"fork":false,"pushed_at":"2025-09-02T15:52:15.000Z","size":86,"stargazers_count":22,"open_issues_count":0,"forks_count":8,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-09-02T17:39:32.206Z","etag":null,"topics":["dataset","dna-encoded-library","kinase","machine-learning"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2410.08938","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/insitro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-10-04T18:09:52.000Z","updated_at":"2025-09-02T15:52:18.000Z","dependencies_parsed_at":"2025-09-02T17:39:33.632Z","dependency_job_id":"2ca5ffcc-b008-4c44-acf3-c62d75dfbdfb","html_url":"https://github.com/insitro/kindel","commit_stats":null,"previous_names":["insitro/kindel"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/insitro/kindel","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/insitro%2Fkindel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/insitro%2Fkindel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/insitro%2Fkindel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/insitro%2Fkindel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/insitro","download_url":"https://codeload.github.com/insitro/kindel/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/insitro%2Fkindel/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29180533,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T23:15:33.022Z","status":"ssl_error","status_checked_at":"2026-02-06T23:15:09.128Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset","dna-encoded-library","kinase","machine-learning"],"created_at":"2026-02-06T23:35:34.296Z","updated_at":"2026-02-06T23:35:37.031Z","avatar_url":"https://github.com/insitro.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv style=\"text-align: center\"\u003e\n\u003ch1\u003eKinDEL: DNA-Encoded Library Dataset For Kinase Inhibitors\u003c/h1\u003e\n\u003c/div\u003e\n\nKinDEL is a large DNA-encoded library dataset containing two kinase\ntargets (DDR1 and MAPK14) for benchmarking machine learning models.\n\nICML poster/paper: https://icml.cc/virtual/2025/poster/45034\n\nPreprint: https://arxiv.org/abs/2410.08938\n\n## Usage\n\n### Installation\n\nFirst create environment with dependencies:\n\n```bash\ncurl -fsSL https://pixi.sh/install.sh | bash\npixi install\n```\n\n### Benchmarking\n\nRun the following command to train a model:\n\n```bash\npixi shell  # activate the environment\nPYTHONPATH=. redun -c kindel/.redun run kindel/run.py train \\\n    --model \u003cmodel_name, e.g. xgboost_local\u003e \\\n    --output-dir out \\\n    --targets ddr1 mapk14 \\\n    --splits random disynthon \\\n    --split-indexes 1 2 3 4 5\n```\n\nwhere `\u003cmodel_name\u003e` has to start with one of the following prefixes:\n* xgboost\n* rf\n* knn\n* dnn\n* gin\n* compose\n\n### Collecting results\n\nTo collect the model performance results after training, you\ncan use the `results.py` script, providing the path to the\nmodel output files:\n\n```bash\npython results.py --model-path [path]\n```\n\n### Datasets\n\nAll datasets are located in AWS S3 at the URL: `s3://kin-del-2024/data`.\nYou can preview the data using [42basepairs](https://42basepairs.com/) here: https://42basepairs.com/browse/s3/kin-del-2024\n\nThe recommended **training dataset** is stored in the `{target}_1M.parquet`\nfiles, which contain top 1M molecules from the DEL screen that were used to\ntrain ML models used in our benchmark.\n\nData splits are generated in the `splits/{target}_{random/disynthon}.parquet` files,\nand the training/validation/testing datasets can be loaded using the\nfollowing code:\n\n```python\nfrom kindel.utils.data import get_training_data\n\ndf_train, df_valid, df_test = get_training_data(target, split_index=split_index)\n```\n\nThe results in the benchmark are calculated for the **held-out testing sets** stored\nin the `heldout/{target}_{on/off}dna.csv` files, which contain Kd measurements\nfor the on- and off-DNA compounds. Using the `in_library` argument you can specify\nif only the in-library or the extended heldout set is returned. This data can be\nloaded using the following code:\n\n```python\nfrom kindel.utils.data import get_testing_data\n\ndata = get_testing_data(target, in_library=True)\nprint(data['on'])\nprint(data['off'])\n```\n\nThe full dataset can be downloaded using the following code:\n\n```python\nfrom kindel.utils.data import download_kindel\n\ndf = download_kindel(target)\n```\n\n### Data structure\n\nAll dataset files contain the following columns:\n- `smiles` - the SMILES representation of the molecule\n- `molecule_hash` - a molecular hash constructed from the synthons that uniquely identifies the molecule\n- `smiles_a` - the SMILES of the synthon A\n- `smiles_b` - the SMILES of the synthon B\n- `smiles_c` - the SMILES of the synthon C\n\nSome compounds in the heldout set do not contain synthon SMILES\nstrings and the molecule hash. It means that these compounds\nwere picked from outside the DEL (external compounds in the\nextended set).\n\nBesides the molecular structure information, the heldout datasets\ncontain the `kd` column with the experimental Kd measurements.\nThe DEL compounds in the training dataset files additionally\ncontain the following columns:\n- `seq_target_1`, `seq_target_2`, `seq_target_3` - sequence counts of the molecules bound to the target in triplicate\n- `seq_matrix_1`, `seq_matrix_2`, `seq_matrix_3` - sequence counts of the molecules bound to the control in triplicate\n- `seq_load` - the pre population of the molecule\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finsitro%2Fkindel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Finsitro%2Fkindel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Finsitro%2Fkindel/lists"}