{"id":23505682,"url":"https://github.com/gamcil/cblaster_benchmarks","last_synced_at":"2025-05-08T22:44:52.540Z","repository":{"id":113168289,"uuid":"389836646","full_name":"gamcil/cblaster_benchmarks","owner":"gamcil","description":"Scripts used in benchmarking MultiGeneBlast and cblaster","archived":false,"fork":false,"pushed_at":"2021-07-27T04:52:24.000Z","size":6270,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-07T20:14:35.931Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gamcil.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-27T03:29:34.000Z","updated_at":"2021-07-27T07:00:59.000Z","dependencies_parsed_at":"2023-07-16T21:33:22.111Z","dependency_job_id":null,"html_url":"https://github.com/gamcil/cblaster_benchmarks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamcil%2Fcblaster_benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamcil%2Fcblaster_benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamcil%2Fcblaster_benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gamcil%2Fcblaster_benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gamcil","download_url":"https://codeload.github.com/gamcil/cblaster_benchmarks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252949281,"owners_count":21830153,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-25T09:38:59.114Z","updated_at":"2025-05-07T20:14:40.746Z","avatar_url":"https://github.com/gamcil.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# cblaster_benchmarks\nScripts used in benchmarking MultiGeneBlast and cblaster.\n\nCompare prediction of characterised fungal BGCs stored in the MIBiG database against a local database consisting of Aspergillus genomes stored in the GenBank database.\nAll performance benchmarks were generated using the perf command line utility on a workstation computer with an Intel(R) Xeon(R) E5-2630 CPU (2.40GHz) and 64gb of RAM.\n\n| File | Description |\n| ---- | ----------- |\n| ``run_cblaster.sh`` | Runner script for timed cblaster searches |\n| ``run_mgb.sh`` | Runner script for timed MultiGeneBlast searches |\n| ``extract_gene_counts.sh`` | Script for extracting number of genes per query from MultiGeneBlast output |\n| ``extract_search_times.py`` | Script for extracting elapsed times from perf stat output |\n| ``perf/*`` | Output of perf stat for database creation and search |\n| ``output/*`` | Search output files for both tools |\n| ``times.csv`` | Search times (s) extracted from perf output using ``extract_search_times.py`` |\n| ``plot_times.R`` | Script used to plot ``times.csv`` |\n\n## Download Aspergillus genomes from the NCBI using datasets\nThe NCBI datasets tool was used to retrieve annotated Aspergillus genome assemblies:\n\n\tdatasets download genome taxon \"Aspergillus\" \\\n\t\t--assembly-source genbank \\\n\t\t--exclude-gff3 \\\n\t\t--exclude-protein \\\n\t\t--exclude-rna \\\n\t\t--exclude-seq \\\n\t\t--include-gbff \\\n\t\t--filename genomes.zip \\\n\t\t--dehydrated\n\nThese were then hydrated to obtain 151 GenBank assemblies:\n\n\tunzip genomes.zip -d genomes\n\tdatasets rehydrate --directory genomes/\n\nIn some cases, gene clusters are directly deposited as separate records in the Nucleotide database and do not have a corresponding genome assembly.\nThus, we also retrieved these records using:\n\n\tesearch -db nuccore \\\n\t\t-query \"aspergillus\"[orgn] AND \"gene cluster\"[title] |\\\n\tefetch -format gb\n\nThis resulted in an additional 90 records.\n\n## Download Aspergillus BGCs from the MIBiG database\nAs of version 2.0, the MIBiG database contains 88 clusters found in Aspergillus genomes.\nThe GenBank files for each of these clusters were retrieved from MIBiG to be used as queries:\n\n\twget https://dl.secondarymetabolites.org/mibig/mibig_gbk_2.0.tar.gz\n\ttar xzvf mibig_gbk_2.0.tar.gz\n\nDuplicates, as well as files containing less than two genes were discarded from the dataset, resulting in a final set of 80 query clusters:\n\n\tBGC0000004 BGC0000006 BGC0000007 BGC0000008\n\tBGC0000009 BGC0000010 BGC0000011 BGC0000013\n\tBGC0000022 BGC0000045 BGC0000057 BGC0000088\n\tBGC0000101 BGC0000129 BGC0000152 BGC0000156\n\tBGC0000160 BGC0000161 BGC0000170 BGC0000292\n\tBGC0000293 BGC0000355 BGC0000356 BGC0000361\n\tBGC0000372 BGC0000442 BGC0000627 BGC0000673\n\tBGC0000682 BGC0000684 BGC0000686 BGC0000811\n\tBGC0000818 BGC0000900 BGC0000901 BGC0000959\n\tBGC0000977 BGC0000983 BGC0001037 BGC0001067\n\tBGC0001084 BGC0001118 BGC0001122 BGC0001123\n\tBGC0001143 BGC0001238 BGC0001239 BGC0001290\n\tBGC0001304 BGC0001306 BGC0001371 BGC0001399\n\tBGC0001400 BGC0001403 BGC0001445 BGC0001446\n\tBGC0001475 BGC0001515 BGC0001516 BGC0001517\n\tBGC0001518 BGC0001544 BGC0001547 BGC0001616\n\tBGC0001621 BGC0001668 BGC0001679 BGC0001699\n\tBGC0001708 BGC0001712 BGC0001718 BGC0001722\n\tBGC0001839 BGC0001857 BGC0001874 BGC0001988\n\tBGC0001990 BGC0001995 BGC0001996 BGC0001998\n\nNotably, the clusters for cyclopiazonic acid in A. oryzae (BGC0000977) and notoamide A in A. sp. MF297-2 (BGC0001084) were not caught by the queries above for building the search database.\nThese were manually obtained from the NCBI and added to the database (NCBI accessions AB506492.1 and HM622670.1, respectively).\nAdditionally, the regions containing clusters for ferrichrome in A. oryzae (BGC0000900) and A. niger (BGC0000901), and squalestatin S1 in A. sp. Z5 (BGC0001839), lacked any sequence feature annotations on the NCBI; annotated MIBiG GenBank files were used instead.\n\nIn total, 243 sequence records were used to build each search database.\n\n## Setting up cblaster environment\nA Conda virtual environment was created with the following tools:\n\n\tpython=3.9.6\n\tcblaster==1.3.8\n\thmmer==3.3.2\n\tdiamond==2.0.11\n\n## Setting up MultiGeneBlast environment\nA Conda virtual environment was created with the following tools:\n\n\tpython=2.7\n\tpysvg==0.2.2\n\tmuscle==3.8.1551\n\nMultiGeneBlast v1.1.13 was downloaded from the SourceForge repository and extracted to a folder.\nEach MultiGeneBlast function was called from the specific script file involved (makedb.py to construct a database, multigeneblast.py to search).\nThey also had to be run from within the MultiGeneBlast source code directory.\nBinaries for various dependencies used by MultiGeneBlast are distributed with the source code.\nThese binaries required the installation of 32 bit libraries (libbzip2) in order for MultiGeneBlast to run succesfully.\n\n## Benchmarking of database construction\nAll retrieved files were then used to construct search databases.\nA separate directory was used for each tool.\n\ncblaster makedb was run using default settings, using all available cores and no sequence batching:\n\n\tcblaster makedb -n database genomes/*.gbk\n\ncblaster extracted 1828599 genes from the 243 sequence records and created cblaster databases (FASTA, SQLite3 and DIAMOND) in 121.42 seconds.\n\nMultiGeneBlast was also run using default settings:\n\n\tpython2.7 makedb.py database genomes/*.gbk\n\nFor the same dataset, MultiGeneBlast took a total of 2801.12 seconds.\n\n## Benchmarking of local searches\nCluster sequence records retrieved from MIBiG were then searched against the created databases.\n\ncblaster searches were run using ``cblaster.sh``.\nClustering parameters were loosened given the total variation between separate query clusters.\ncblaster completed all 88 searches in 584.0136618 seconds (~9.73 minutes).\n\nMultiGeneBlast was run using ``mgb.sh``.\nNotably, the -distancekb argument was set to 30 (kilobases) in order to match the corresponding --gap argument in cblaster.\nAdditionally, MultiGeneBlast searches using GenBank files as queries require a sequence range to be set using the -from and -to arguments, whereas cblaster does not.\nThese were set to 1 and 100000, respectively, in order to capture the entire length of each cluster.\n\nMultiGeneBlast completed all searches in 13130.21488 seconds (~3.65 hours).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgamcil%2Fcblaster_benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgamcil%2Fcblaster_benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgamcil%2Fcblaster_benchmarks/lists"}