{"id":20425069,"url":"https://github.com/databio/region2vec_eval","last_synced_at":"2026-04-19T01:37:40.976Z","repository":{"id":243724239,"uuid":"813100747","full_name":"databio/region2vec_eval","owner":"databio","description":null,"archived":false,"fork":false,"pushed_at":"2024-06-14T18:16:03.000Z","size":183,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-12-09T13:03:41.476Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/databio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-10T13:36:37.000Z","updated_at":"2024-06-14T03:57:58.000Z","dependencies_parsed_at":"2024-06-10T22:14:59.892Z","dependency_job_id":"c3cc0e7b-226b-49ec-b060-8e0151b65dad","html_url":"https://github.com/databio/region2vec_eval","commit_stats":null,"previous_names":["databio/region2vec_eval"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/databio/region2vec_eval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fregion2vec_eval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fregion2vec_eval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fregion2vec_eval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fregion2vec_eval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/databio","download_url":"https://codeload.github.com/databio/region2vec_eval/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/databio%2Fregion2vec_eval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31991720,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T20:23:30.271Z","status":"ssl_error","status_checked_at":"2026-04-18T20:23:29.375Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T07:12:10.460Z","updated_at":"2026-04-19T01:37:40.955Z","avatar_url":"https://github.com/databio.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Methods for evaluating unsupervised vector representations of genomic regions\nThis repository contains code and instructions to reproduce the results presented in the paper. The proposed evaluation metrics are implemented in [geniml.eval](https://github.com/databio/geniml/tree/master/geniml/eval).\n\n## Requirements\n- [geniml](https://github.com/databio/geniml)\n- beautifulsoup4\n- python=3.9\n- pybedtools\n- [bedtools](https://bedtools.readthedocs.io/en/latest/content/installation.html)\n\n```\ngit clone git@github.com:databio/geniml.git\ncd geniml\npip install -e .\n```\nAfter installing `geniml`, add `bedtools` binary to the environment variable `PATH`.\n\n## Preparation\n### Customize the configurations\nChange the constants defined in `config.py`. Below are the descriptions for the constants.\n```yaml\nDATA_URL: link to the dataset\nDATA_FOLDER: folder that stores the downloaded the dataset\nTRAIN_SCRIPTS_FOLDER: folder that stores all the generated training scripts\nMODELS_FOLDER: folder that stores all the trained models\nUNIVERSES_FOLDER: folder that stores all the universes\nEVAL_RESULTS_FOLDER: folder that stores all the evaluation results\n```\n### Download the dataset\nRun the following command:\n```bash\npython -m src.download_dataset\n```\nOr download all the [content](http://big.databio.org/region2vec_eval/tfbs_dataset/) to `DATA_FOLDER`.\n### Prepare universes\nWe provided all the seven universes used in the paper at [hg19 universes](http://big.databio.org/region2vec_eval/universes/). Download the universes to `UNIVERSES_FOLDER` specified in `config.py`.\n\nWe used the following code to generate the universes except the DHS universe, which is an external universe. You can use the same code to generate the universes based on your data, only to change `DATA_FOLDER` in `config.py` and the total number of files passed to `-n`.\n```bash\n# The Merge (100) universe\npython -m src.gen_universe -m merge -n 690 -d 100\n# The Merge (1k) universe\npython -m src.gen_universe -m merge -n 690 -d 1000\n# The Merge (10k) universe\npython -m src.gen_universe -m merge -n 690 -d 10000\n# The Tiling (1k) universe\npython -m src.gen_universe -m tile -v hg19 -n 690 -t 1000\n# The Tiling (5k) universe\npython -m src.gen_universe -m tile -v hg19 -n 690 -t 5000\n# The Tiling (25k) universe\npython -m src.gen_universe -m tile -v hg19 -n 690 -t 25000\n```\n### Train embedding models\nYou can download all the trained models to `MODELS_FOLDER` (in `config.py`) at [models](http://big.databio.org/region2vec_eval/tfbs_models/). Note that `Large`, `Medium` and `Small` correspond to `Merge (100)`, `Merge (1k)` and `Merge (10k)`, respectively, in the paper.\n\nWe used the following steps to get all the models.\n\n1. Generate training scripts via \n    ```bash\n    python -m src.gen_train_scripts\n    ```\n2. Then, go to the `TRAIN_SCRIPTS_FOLDER` (specified in `config.py`) folder, and run all the scripts there to get trained models.\n\n    Note that in `gen_train_scripts.py`, we include seven universes, three initial learning rates, two embedding dimensions, and two context window sizes.\n    Therefore, for each universe, we will train 12 Region2Vec models, and in total, we will have 84 Region2Vec models.\n\n3. After training Region2Vec models, run the following code to generate base embeddings, namely Binary, PCA-10D, and PCA-100D, for each of the seven universes.\n    ```bash\n    python -m src.get_base_embeddings\n    ```\n\nTo obtain the results in Table S2, run the following code\n```bash\npython -m src.assess_universe\n```\nNote that we do not assess the original universes. Since Region2Vec will filter out some low-frequency regions in a universe based on the training data, we focused on the acutal universes with regions that have embeddings.\n\n## Evaluate region embeddings\nRun the following scripts to obtain the evaluation results.\n```bash\npython -m src.eval_script --type GDSS\npython -m src.eval_script --type NPS\npython -m src.eval_script --type CTS\npython -m src.eval_script --type RCS\n```\n\nTo speed up the process, you can split the universes into batches (Line 209, `eval_script.py`)\n```python\nbatches = [\n        (\"tile1k\", \"tile25k\"),\n        (\"tile5k\", \"Small\"),\n        (\"Large\", \"Medium\",\"dhs\"),\n    ]\n```\nThen, run the evaluation on each batch in parallel. For example, \n```bash\npython -m src.eval_script --type GDSS --batch 0\n```\nwill evaluate models for the Tiling (1k) and Tiling (25k) universes.\n\n## Downstream tasks\nWe designed cell type and antibody type classification tasks for the trained region embeddings. We randomly selected 60% of all the BED files as the training files and the remaining as test files. We divided the BED files five times with different random seeds. The file splits are stored in `classification_data`. The code that generates the splits can be found in `classification.ipynb`.\n\nRun the classification using the following script:\n```bash\npython -m src.classification\n```\n\n## Analyze results\nWe used the Jupyter notebook `result_analysis.ipynb` to generate all the figures and calculate the results.\n\n## Generate embedding visualizations\nThe visualizations of different sets of region embeddings can be found at [embed_visualization](http://big.databio.org/region2vec_eval/embed_visualization/).\n\nWe used the following command to generate UMAP visualizations of all sets of region embeddings.\n```bash\npython -m src.visualization\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Fregion2vec_eval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatabio%2Fregion2vec_eval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatabio%2Fregion2vec_eval/lists"}