{"id":31805733,"url":"https://github.com/theislab/chemcpa","last_synced_at":"2025-10-11T02:57:50.749Z","repository":{"id":40684354,"uuid":"383423931","full_name":"theislab/chemCPA","owner":"theislab","description":"Code for \"Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution\", NeurIPS 2022.","archived":false,"fork":false,"pushed_at":"2025-02-06T15:11:37.000Z","size":245541,"stargazers_count":123,"open_issues_count":7,"forks_count":30,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-13T07:03:14.151Z","etag":null,"topics":["disentanglement","drug-discovery","genomics","perturbation","single-cell","transfer-learning"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2204.13545","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/theislab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2021-07-06T10:08:11.000Z","updated_at":"2025-08-20T06:16:33.000Z","dependencies_parsed_at":"2023-01-25T14:30:51.269Z","dependency_job_id":"0dadd0c0-63f9-4d65-a569-40c732737d74","html_url":"https://github.com/theislab/chemCPA","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/theislab/chemCPA","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theislab%2FchemCPA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theislab%2FchemCPA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theislab%2FchemCPA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theislab%2FchemCPA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/theislab","download_url":"https://codeload.github.com/theislab/chemCPA/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/theislab%2FchemCPA/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279005953,"owners_count":26084009,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-11T02:00:06.511Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["disentanglement","drug-discovery","genomics","perturbation","single-cell","transfer-learning"],"created_at":"2025-10-11T02:57:48.569Z","updated_at":"2025-10-11T02:57:50.740Z","avatar_url":"https://github.com/theislab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution\n\nCode accompanying the [NeurIPS 2022 paper](https://neurips.cc/virtual/2022/poster/53227) ([PDF](https://openreview.net/pdf?id=vRrFVHxFiXJ)).\n\n![architecture of CCPA](docs/chemCPA.png)\n\nOur talk on chemCPA at the M2D2 reading club is available [here](https://m2d2.io/talks/m2d2/predicting-single-cell-perturbation-responses-for-unseen-drugs/).\nA [previous version](https://arxiv.org/abs/2204.13545) of this work was a spotlight paper at ICLR MLDD 2022.\nCode for this previous version can be found under the `v1.0` git tag.\n\n## Codebase overview\n\n- `chemCPA/`: contains the code for the model, the data, and the training loop.\n- `embeddings`: There is one folder for each molecular embedding model we benchmarked. Each contains an `environment.yml` with dependencies. We generated the embeddings using the provided notebooks and saved them to disk, to load them during the main training loop.\n- `experiments`: Each folder contains a `README.md` with the experiment description, a `.yaml` file with the seml configuration, and a notebook to analyze the results.\n- `notebooks`: Example analysis notebooks.\n- `preprocessing`: Notebooks for processing the data. For each dataset there is one notebook that loads the raw data.\n- `tests`: A few very basic tests.\n\nAll experiments where run through [seml](https://github.com/TUM-DAML/seml).\nThe entry function is `ExperimentWrapper.__init__` in `chemCPA/seml_sweep_icb.py`.\nFor convenience, we provide a script to run experiments manually for debugging purposes at `chemCPA/manual_seml_sweep.py`.\nThe script expects a `manual_run.yaml` file containing the experiment configuration.\n\nAll notebooks also exist as Python scripts (converted through [jupytext](https://github.com/mwouts/jupytext)) to make them easier to review.\n\n## Getting started\n\n#### Environment\nThe easiest way to get started is to use a docker image we provide\n```\ndocker run -it -p 8888:8888 --platform=linux/amd64 registry.hf.space/b1ro-chemcpa:latest\n```\nthis image contains the source code and all dependencies to run the experiments.\nBy default it runs a jupyter server on port 8888.\n\nAlternatively you may clone this repository and setup your own environment by running:\n\n```python\nconda env create -f environment.yml\npython setup.py install -e .\n```\n\n\n\n#### Datasets\nThe datasets are not included in the docker image, but get automatically downloaded when you run the notebooks that require them. The datasets may alternatively be downloaded manually using the python tool in the `raw_data/dataset.py` folder. Usage is:\n```\npython raw_data/dataset.py --list\npython raw_data/dataset.py --dataset \u003cdataset_name\u003e\n```\n\nor you may use the following links:\n- [weight checkpoints](https://f003.backblazeb2.com/file/chemCPA-models/chemCPA_models.zip)\n- [hyperparameter configuration](https://f003.backblazeb2.com/file/chemCPA-models/finetuning_num_genes.json)\n- [raw datasets](https://dl.fbaipublicfiles.com/dlp/cpa_binaries.tar)\n- [processed datasets](https://f003.backblazeb2.com/file/chemCPA-datasets/)\n- [embeddings](https://drive.google.com/drive/folders/1KzkhYptcW3uT3j4GQpDdAC1DXEuXe49J?usp=share_link)\n\nSome of the notebooks use a *drugbank_all.csv* file, which can be downloaded from [here](https://go.drugbank.com/) (registration needed).\n\n#### Data preparation\nTo train the models, first the raw data needs to be processed.\nThis can be done by running the notebooks inside the `preprocessing/` folder in a sequential order.\nAlternatively, you may run \n\n```\npython preprocessing/run_notebooks.py\n```\nA description of the preprocessing steps is given in the `preprocessing/README.md` file and in the headers\nof individual notebooks. Section 4 of the paper is also highly relevant.\n\n#### Training the models\nRun \n```\npython chemCPA/train_hydra.py\n```\n\n## Citation\n\nYou can cite our work as:\n\n```\n@inproceedings{hetzel2022predicting,\n  title={Predicting Cellular Responses to Novel Drug Perturbations at a Single-Cell Resolution},\n  author={Hetzel, Leon and Böhm, Simon and Kilbertus, Niki and Günnemann, Stephan and Lotfollahi, Mohammad and Theis, Fabian J},\n  booktitle={NeurIPS 2022},\n  year={2022}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftheislab%2Fchemcpa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftheislab%2Fchemcpa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftheislab%2Fchemcpa/lists"}