{"id":20531786,"url":"https://github.com/mrzresearcharena/blast","last_synced_at":"2026-01-19T20:32:57.307Z","repository":{"id":159887396,"uuid":"267204682","full_name":"mrzResearchArena/BLAST","owner":"mrzResearchArena","description":"Easy Way to Generate PSSM from the FASTA Sequences","archived":false,"fork":false,"pushed_at":"2023-12-14T17:00:19.000Z","size":8912,"stargazers_count":2,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-04-14T06:12:23.037Z","etag":null,"topics":["bioinformatics","bioinformatics-software","computational-biology"],"latest_commit_sha":null,"homepage":"http://rafsanjani.pythonanywhere.com/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrzResearchArena.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-05-27T02:47:28.000Z","updated_at":"2023-07-18T09:11:49.000Z","dependencies_parsed_at":"2023-12-14T17:52:39.432Z","dependency_job_id":"47cacc64-5c94-476c-a4c3-f62ba19533de","html_url":"https://github.com/mrzResearchArena/BLAST","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mrzResearchArena/BLAST","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrzResearchArena%2FBLAST","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrzResearchArena%2FBLAST/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrzResearchArena%2FBLAST/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrzResearchArena%2FBLAST/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrzResearchArena","download_url":"https://codeload.github.com/mrzResearchArena/BLAST/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrzResearchArena%2FBLAST/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28583853,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-19T19:46:29.903Z","status":"ssl_error","status_checked_at":"2026-01-19T19:45:54.560Z","response_time":67,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","bioinformatics-software","computational-biology"],"created_at":"2024-11-16T00:09:50.404Z","updated_at":"2026-01-19T20:32:57.292Z","avatar_url":"https://github.com/mrzResearchArena.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# BLAST: Basic Local Alignment Search Tool\n\nI describe the procedure for the PSSM generation from the FASTA sequences. It is asynchronous parallel processing that can process up to n-sequence at a time. I spend a considerable amount of time on the PSSM generation purpose. It is definitely a hard and tedious procedure, but I make it easy so that other researchers can use it efficiently. People can use it for PSSM generation; unfortunately, I did not check the benchmark yet.\n\n\u0026nbsp;\n\u0026nbsp;\n\n\n### Step 0: Graphical Representation: How can we generate PSSM using asynchronous parallel processing?\n\n\u003cimg src=\"https://github.com/mrzResearchArena/BLAST/blob/master/asyn-PSSM.jpeg\" class=\"center\" title=\"asyn-PSSM\" width=\"850\" height=\"450\" /\u003e\n\n\u0026nbsp;\n\u0026nbsp;\n\n### Step 1: Download the BLAST Tool:\n\nPlease find the latest version of BLAST tool from the given website (https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). Then download one of the files as per your operating system (OS) requirement. As I am a Linux OS user, that is why I downloaded \"ncbi-blast-...-linux.tar.gz\". Please don't worry about the version; it usually changes over time.\n\n\u0026nbsp;\n\n```console\nuser@machine:~$ wget 'https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-2.10.1+-x64-linux.tar.gz' ### Fetch from website\nuser@machine:~$ tar -xvzf ncbi-blast-2.10.1+-x64-linux.tar.gz                                                           ### Extract the tool after the download\n```\n\n\u0026nbsp;\n\u0026nbsp;\n\n\n### Step 2: Download the Non-redundant (NR) Proteins Database:\n\nWe can download the `nr` database from official website (https://ftp.ncbi.nlm.nih.gov/blast/db/), and the downloading processes are given below.\n\n\u0026nbsp;\n\n#### Option-1: Downloading Process:\n\n```console\nuser@machine:~$ wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz'\nuser@machine:~$ wget 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nr.*.tar.gz.md5'\n```\n\n#### Option-2: Downloading Process (aka Update the Current Database):\n```console\nuser@machine:~$ /home/user/ncbi-blast-2.10.1+/bin/update_blastdb.pl --decompress nr [*]\n```\n\n\n###### Notes:\n1. The database had 39 segments, and the initial file size was approximately 100GB; we would get around 450GB after the extraction (Until 2020).\n   - The database was 54 segments (Last Update: August 1, 2021).\n   - The database is now 78 segments; the initial file size was approximately 193GB, and we got around xxGB after the extraction; (Last Update: July 12, 2023).\n4. The segments change frequently.\n5. We can get the `update_blastdb.pl` file from BLAST tool.\n\n\u0026nbsp;\n\u0026nbsp;\n\n\n### Step 3: Update the Non-redundant (NR) Proteins Database (Optional):\nWhen the `nr` database will be old, no need to download (or upgrade) rather than update the previous one.\n\n\u0026nbsp;\n\n#### Update Process:\n```console\nuser@machine:~$ /home/user/ncbi-blast-2.10.1+/bin/update_blastdb.pl --decompress nr [*]\n```\n\n###### Notes:\n1. We can get the `update_blastdb.pl` file from BLAST tool.\n2. Plese run the script from the `nr` directory (or folder), otherwise it won't work.\n\n\n\u0026nbsp;\n\u0026nbsp;\n\n### Step 4: Extract the Non-redundant (NR) Proteins Database:\n\n#### Extract  `*.tar.gz` Files:\n```bash\nn=38   ### If the number of the segment is n, then we will use n-1.\n\nfor i in $(seq 0 1 $n); do\n    if [ $i -lt 10 ]; then\n        tar -xvzf nr.0$i.tar.gz\n    else\n        tar -xvzf nr.$i.tar.gz\n    fi\ndone\n```\n\n###### Notes:\n1. A question can arise why I used 38 in the loop? The answer is, I got the 39 segments in the `nr` directory (or folder).\n2. Please make sure that how many segments you have, then update the value of `n`. We can find it from `nr.pal` in `nr` directory (or folder).\n3. Plese run the script from the `nr` directory (or folder), otherwise it won't work.\n4. We will find the update decompress procedure from given [URL](https://github.com/mrzResearchArena/BLAST/blob/master/decompress-NR.sh).\n\n\u0026nbsp;\n\u0026nbsp;\n\n### Step 5: Split Multile FASTA File into Single FASTA Files:\n\n```python\nFile = '/home/user/Bioinformatics/multiSequences.fa'\n\nfrom Bio import SeqIO # Install (If you don't have it.): pip install biopython\n\nC= 1\nfor record in SeqIO.parse(File, 'fasta'):\n    openFile = open(str(C) + '.fasta', 'w')\n    SeqIO.write(record, openFile, 'fasta')\n    C += 1\n#end-for\n```\n\n\u0026nbsp;\n\n###### Notes:\n1. I renamed the origial name of FASTA sequence as it is helpful for tracking the implementation.\n2. I used sequential numerical order rather than the original sequence name.\n3. Renaming the sequence is optional.\n4. We will find the updated FASTA splitting procedure from given [URL](https://github.com/mrzResearchArena/BLAST/blob/master/splitFASTA.py).\n5. We can also use Colab for the splitting Multiple FASTA sequence into single sequences [[Update Implementation](https://github.com/mrzResearchArena/BLAST/blob/master/Split-FASTA-using-BioPython-Colab.ipynb)].\n\n\n\u0026nbsp;\n\u0026nbsp;\n\n### Step 6: Tracking the Original Sequence Name (Optional):\n\n```python\nFile = '/home/rafsanjani/Downloads/TS88.fa'\n\nfrom Bio import SeqIO\n\nC= 1\n\nprint('Original-Sequence-Name, Renamed-Sequence, Corresponding-PSSM')\nfor record in SeqIO.parse(File, 'fasta'):\n    print('{}, {}.fasta, {}.fasta.pssm'.format(record.id, C, C))\n    C += 1\n#end-for\n```\n\n\u0026nbsp;\n\n###### Notes:\n1. As I rename the sequences, you can track the original sequences.\n2. You will find the update procedure from given [URL](https://github.com/mrzResearchArena/BLAST/blob/master/renamedSequencesOrder.py).\n\n\u0026nbsp;\n\u0026nbsp;\n\n### Step 7: Implementation/Generate PSSMs:\n```python\n###\ndatabase = '/home/learning/mrzResearchArena/NR/nr'   # Please, set path where \"nr\" database directory is located.\nPSSM = '/home/learning/mrzResearchArena/PSSM'        # Please, set path where PSSM directory is located.\ncore = 8                                             # multiprocessing.cpu_count()\n###\n\n###\nimport multiprocessing\nimport time\nimport glob\nimport os\nos.chdir(PSSM)\n###\n\n###\ndef runPSIBLAST(file):\n    try:\n        os.system('/home/learning/ncbi-blast-2.10.1+/bin/psiblast -query {} -db {} -out {}.out -num_iterations 3 -out_ascii_pssm {}.pssm -inclusion_ethresh 0.001 -comp_based_stats 0 -num_threads 1'.format(file, database, file, file))\n    except:\n        print('PSI-BLAST is error for the sequence {}!'.format(file))\n        return '{}, is error.'.format(file)\n\n    return '{}, is done.'.format(file)\n#end-def\n###\n\n###\nbegin   = time.time()\npool    = multiprocessing.Pool(processes=core)\nresults = [ pool.apply_async(runPSIBLAST, args=(file,)) for file in glob.glob('*.fasta') ] # for x in range(1, 10)\n###\n\n###\noutputs = [result.get() for result in results]\nend = time.time()\n###\n\n###\nprint(sorted(outputs))\nprint()\nprint('Time elapsed: {} seconds.'.format(end - begin))\n###\n```\n\n\u0026nbsp;\n\n###### Notes:\n1. You will find the update procedure from given [URL](https://github.com/mrzResearchArena/BLAST/blob/master/asynParallel.py).\n\n\u0026nbsp;\n\u0026nbsp;\n\n\n**Acknowledgement:** I would like to thank you to Professor [Iman Dehzangi](https://scholar.google.com/citations?user=RkamSRYAAAAJ\u0026hl=en), who helped me initially for the PSSM generation.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrzresearcharena%2Fblast","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrzresearcharena%2Fblast","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrzresearcharena%2Fblast/lists"}