{"id":44964518,"url":"https://github.com/bioinfomachinelearning/cryovirusdb","last_synced_at":"2026-02-18T14:09:54.017Z","repository":{"id":212531412,"uuid":"725625222","full_name":"BioinfoMachineLearning/CryoVirusDB","owner":"BioinfoMachineLearning","description":"A dataset of labeled virus particles in cryo-EM micrographs (images) for training and testing machine learning methods of virus particle picking","archived":false,"fork":false,"pushed_at":"2024-01-09T17:09:15.000Z","size":175,"stargazers_count":26,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-09-09T16:34:16.847Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/BioinfoMachineLearning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-11-30T14:38:04.000Z","updated_at":"2024-11-25T19:22:29.000Z","dependencies_parsed_at":"2024-01-09T18:27:36.951Z","dependency_job_id":"ce742167-7a35-427f-bfa4-c0ef004f72c3","html_url":"https://github.com/BioinfoMachineLearning/CryoVirusDB","commit_stats":null,"previous_names":["bioinfomachinelearning/cryovirusdb"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/BioinfoMachineLearning/CryoVirusDB","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FCryoVirusDB","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FCryoVirusDB/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FCryoVirusDB/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FCryoVirusDB/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/BioinfoMachineLearning","download_url":"https://codeload.github.com/BioinfoMachineLearning/CryoVirusDB/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/BioinfoMachineLearning%2FCryoVirusDB/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29581632,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-18T13:56:48.962Z","status":"ssl_error","status_checked_at":"2026-02-18T13:54:34.145Z","response_time":162,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-18T14:09:53.319Z","updated_at":"2026-02-18T14:09:54.010Z","avatar_url":"https://github.com/BioinfoMachineLearning.png","language":"Python","readme":"# CryoVirusDB: An Annotated Dataset for AI-Based Virus Particle Identification in Cryo-EM Micrographs\nCryoVirusDB is a dataset of labeled virus particles in cryo-EM micrographs (images) for training and testing machine learning methods of virus particle picking. This repository contains scripts used to crawl, download, process, annotate, and post procress the CryoVirusDB dataset.\n\n![Picture1](https://github.com/BioinfoMachineLearning/CryoVirusDB/assets/24986485/4def3167-c8c9-4a46-ab94-b6a57146c078)\n\u003csmall\u003e Figure: Conceptual overview of Cryo-EM single particle analysis from particle selection to 3D reconstruction of virus. (A) Stack of ideal micrographs where the true virus particles are picked (encircled yellow), (B) Extracted virus particles from micrographs with fixed box size. (C) Multiple 2D classes to facilitate stack cleaning and the removal of false particles.  (D) Reconstructed 3D structure of the virus from 2D images using series of computational techniques. \u003c/small\u003e\n\n## Data Download and Extraction in one of the three ways\n\n## Option 1: Direct download all data from our server\n\nPath to CryoVirusDB Dataset: https://calla.rnet.missouri.edu/CryoVirusDB/\n\nEach EMPIAR ID in CryoVirusDB is available as a compressed file (tar.gz) that can be downloaded by simply clicking on the file. Once you have downloaded the file, you must extract its contents. If you are using a Windows operating system, you can use tools such as WinRAR or 7zip to extract the file. \\\nOR \\\nTo download and extract dataset (ex: 11060), use command: \\\n`wget https://calla.rnet.missouri.edu/CryoVirusDB/11060.tar.gz` \\\n`tar -zxvf 11060.tar.gz -C` \n\n\n## Option 2: Use scripts to download all the cryo-EM micrographs from EMPIAR and virus particles coordinates from Zenodo\n`git clone https://github.com/BioinfoMachineLearning/CryoVirusDB.git` \\\n`cd micrographs_download_scripts`\n### Requirements\n- Python 3.8+\n- Required packages: pandas, wget, openpyxl\n\n### Install Dependencies\n```bash\npip install pandas wget openpyxl\n```\n\n### Usage\n\n### Download Specific EMPIAR ID\n```bash\npython download_micrographs_from_EMPIAR.py --emd_id 10192 -o Data_Downloads\n```\n\n### Download All EMPIAR IDs\n```bash\npython download_micrographs_from_EMPIAR.py --all -o Data_Downloads\n```\n#### Command Line Arguments\n\n| Argument | Description | Required |\n|----------|-------------|----------|\n| `--emd_id` | EMPIAR ID to download (e.g., 10192) | Yes (unless `--all`) |\n| `--all` | Download all EMPIAR IDs from catalogue | Yes (unless `--emd_id`) |\n| `-o`, `--output` | Output directory path | Yes |\n| `--catalogue` | Path to catalogue Excel file | No (default: `micrographs_download_catalogue.xlsx`) |\n\n\nThese commands will enable you to download all the motion corrected micrographs from EMPIAR. Next, you should retrieve the virus particle labels from Zenodo by accessing this link: https://zenodo.org/record/10397742\n\n\n## Option 3: Download a light version of the data: CryoVirusDB_Lite, if space constraints\nIf storage space is a concern, researchers can opt for a more lightweight version of CryoVirusDB called CryoVirusDB_Lite.  \nCryoVirusDB_Lite includes truncated versions of the original micrographs and particle ground truth files that result in a total storage size of 76 GB, making it easier to store and transfer. This version includes an 8-bit representation of micrographs in JPG format, along with the necessary particle coordinate files for 9 Cryo-EM virus datasets.\n\nPath to CryoVirusDB_Lite Dataset: https://calla.rnet.missouri.edu/CryoVirusDB_Lite/ \\\nThe steps to download and extract the data files are identical to the instructions provided in option 1.\n\n\n## CryoVirusDB Dataset Directory Structure:\n\n![Picture5](https://github.com/BioinfoMachineLearning/CryoVirusDB/assets/24986485/b0b24c85-476d-43dd-b4e6-d77685f058fe)\n\n\nCryoVirusDB is an expert-labeled dataset containing coordinates of accurately selected virus particles in cryo-EM micrographs. CryoVirusDB comprises 9,941 micrographs featuring 9 different viruses along with the coordinates of 0.3 Million virus particles in total. We anticipate that CryoVirusDB will enhance the capabilities of deep learning in accurately identifying virus particles in cryo-EM micrographs, thereby facilitating the subsequent 2D-3D reconstruction process. \n\n\n## Data Records\n\n\nEach data folder (titled after the corresponding EMPIAR dataset ID) for all expert labelled data includes the following: motion corrected micrographs, ground truth, and particles stack. \n\n\n## CryoVirusDB Statistics\nStatistics of true virus particles for each EMPIAR database in CryoVirusDB: \n\n| **SN** | **EMPAIR ID** | **Virus Type**                      | **Number of\u003cbr\u003eMicrographs** | **Micrograph size** | **Particle\u003cbr\u003eDiameter (px)** | **Number of True\u003cbr\u003eVirus Particles** |\n| ------ | ------------- | ----------------------------------- | ---------------------------- | ------------------- | ----------------------------- | ------------------------------------- |\n| 1      | 10192         | Feline calicivirus                  | 1000                         | (4096, 4096)        | 470                           | 9660                                  |\n| 2      | 11060         | Nudaurelia capensis omega virus     | 1276                         | (4096, 4096)        | 516                           | 11916                                 |\n| 3      | 10203         | Macrobrachium rosenbergii nodavirus | 1000                         | (3838, 3710)        | 377                           | 16601                                 |\n| 4      | 10033         | Human parechovirus 3                | 1000                         | (4096, 4096)        | 350                           | 55732                                 |\n| 5      | 10652         | Coxsackievirus                      | 1127                         | (3838, 3710)        | 374                           | 11144                                 |\n| 6      | 10341         | Bovine enterovirus                  | 1274                         | (4096, 4096)        | 376                           | 22694                                 |\n| 7      | 10193         | Feline calicivirus                  | 1000                         | (4096, 4096)        | 516                           | 96126                                 |\n| 8      | 10205         | Cowpea mosaic virus                 | 1000                         | (4096, 4096)        | 310                           | 81037                                 |\n| 9      | 10555         | Nudaurelia capensis omega virus     | 1264                         | (4096, 4096)        | 564                           | 34488                                 |\n|        |               | **Total**                           | **9941**                     |                     |                               | **339398**                            |\n\n## Data Usage for ML-Based Applications:\n\nResearchers can use CryoVirusDB to train and test their Machine Learning / Deep Learning based methods for automated cryo-EM virus particle picking. \n\nUsers are supposed to use motion corrected 2D images (micrographs) as input. The virus particle's coordinate information for corresponding micrographs are located inside 'ground_truth' \u003e\u003e\n'particle_coordinates' folder. The file naming convention for both the micrographs and their corresponding particle's coordinate are same for user's ease. \n\n### Example: \nFor EMPIAR 11060, the motion corrected micrograph is: 11060\u003e\u003emicrographs\u003e\u003emicrograph1.mrc \nand the corresponding particle's coordinate information is found here: 11060\u003e\u003eground_truth\u003e\u003eparticle_coordinates\u003e\u003emicrograph1.csv\n\nThe particle stack is: 11060\u003e\u003eparticles_stack\u003e\u003emicrograph1.mrc \nand the corresponding star file for all virus particles in EMPIAR 11060 is store as .star file in: 11060\u003e\u003eground_truth\u003e\u003eempiar-micrograph1.star \n\n\n-----\n\n## Rights and Permissions\nOpen Access \\\nThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.\n\n\n** Link to CryoVirusDB paper ** : https://www.biorxiv.org/content/10.1101/2023.12.25.573312v1\n\n## Cite this work\nIf you use the code or data associated with this research work or otherwise find this data useful, please cite: \\\n\n### CryoVirusDB\n@article {Gyawali2023.12.25.573312, \\\n\tauthor = {Rajan Gyawali and Ashwin Dhakal and Liguo Wang and Jianlin Cheng}, \\\n\ttitle = {CryoVirusDB: A Labeled Cryo-EM Image Dataset for AI-Driven Virus Particle Picking}, \\\n\tyear = {2023}, \\\n\tdoi = {10.1101/2023.12.25.573312}, \\\n\tpublisher = {Cold Spring Harbor Laboratory}, \\\n    journal = {bioRxiv} \\\n\tURL = { https://www.biorxiv.org/content/10.1101/2023.12.25.573312v1 }\n}\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbioinfomachinelearning%2Fcryovirusdb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbioinfomachinelearning%2Fcryovirusdb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbioinfomachinelearning%2Fcryovirusdb/lists"}