{"id":22155936,"url":"https://github.com/citiususc/bigbwa","last_synced_at":"2025-07-26T07:32:34.595Z","repository":{"id":28984959,"uuid":"32511528","full_name":"citiususc/BigBWA","owner":"citiususc","description":"BigBWA is a new tool that uses the Big Data technology Hadoop to boost the performance of the Burrows–Wheeler aligner (BWA).","archived":false,"fork":false,"pushed_at":"2022-07-12T10:16:12.000Z","size":6839,"stargazers_count":31,"open_issues_count":3,"forks_count":7,"subscribers_count":9,"default_branch":"master","last_synced_at":"2024-04-16T11:35:30.822Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/citiususc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-03-19T09:14:16.000Z","updated_at":"2024-04-11T08:18:09.000Z","dependencies_parsed_at":"2022-09-03T17:31:25.893Z","dependency_job_id":null,"html_url":"https://github.com/citiususc/BigBWA","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2FBigBWA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2FBigBWA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2FBigBWA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/citiususc%2FBigBWA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/citiususc","download_url":"https://codeload.github.com/citiususc/BigBWA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227660764,"owners_count":17800418,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-02T02:33:39.464Z","updated_at":"2024-12-02T02:33:40.091Z","avatar_url":"https://github.com/citiususc.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"# What's BigBWA about? #\n\n**BigBWA** is a tool to run the Burrows-Wheeler Aligner--[BWA][1] on a [Hadoop][2] cluster. The current version of BigBWA (2.1, november 2016) supports the following BWA algorithms:\n\n\n* **BWA-MEM**\n* **BWA-backtrack**\n* **BWA-SW**\n\nAll of them work with paired and single-end reads.\n\nIf you use **BigBWA**, please cite this article:\n\n\u003e José M. Abuin, Juan C. Pichel, Tomás F. Pena and Jorge Amigo. [\"BigBWA: approaching the Burrows–Wheeler aligner to Big Data technologies\"][4]. Bioinformatics 31(24), pp. 4003-4005, 2015.\n\nA version for [Apache Spark][6] is available [here][7].\n\n# Structure #\nSince version 2.0 the project keeps a standard Maven structure. The source code is in the *src/main* folder. Inside it, we can find two subfolders:\n\n* **java** - Here is where the Java code is stored.\n* **native** - Here the BWA native code (C) and the glue logic for JNI is stored.\n\n# Getting started #\n\n## Requirements\nRequirements to build **BigBWA** are the same than the ones to build BWA, with the only exception that the *JAVA_HOME* environment variable should be defined. If not, you can define it in the */src/main/native/Makefile.common* file. \n\nIt is also needed to include the flag *-fPIC* in the *Makefile* of the considered BWA version. To do this, the user just need to add this option to the end of the *CFLAGS* variable in the BWA Makefile. Considering bwa-0.7.15, the original Makefile contains:\n\n\tCFLAGS=\t\t-g -Wall -Wno-unused-function -O2\n\nand after the change it should be:\n\n\tCFLAGS=\t\t-g -Wall -Wno-unused-function -O2 -fPIC\n\nAdditionaly, and as **BigBWA** is built with Maven since version 0.2, also have it in the user computer is needed.\n\n## Building\nThe default way to build **BigBWA** is:\n\n\tgit clone https://github.com/citiususc/BigBWA.git\n\tcd BigBWA\n\tmvn package\n\t\t\nThis will create the *target* folder, which will contain the *jar* file needed to run **BigBWA**:\n\n* **BigBWA-2.1.jar** - jar file to launch with Hadoop.\n\n## Configuring\nSince version 2.0 there is no need of configuring any Hadoop parameter. The only requirement is that the YARN containers need to have at least 7500MB of memory available (for the human genome case).\n\n## Running BigBWA ##\n**BigBWA** requires a working Hadoop cluster. Users should take into account that at least 7500MB of free memory per map are required (each map loads into memory the bwa index). Note that **BigBWA** uses disk space in the Hadoop *tmp* directory.\n\nHere it is an example of how to run **BigBWA** using the BWA-MEM paired algorithm. This example assumes that our index is stored in all the cluster nodes at */Data/HumanBase/* . The index can be obtained with BWA, using \"bwa index\".\n\nFirst, we get the input Fastq reads from the [1000 Genomes Project][3] ftp:\n\n\twget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12750/sequence_read/ERR000589_1.filt.fastq.gz\n\twget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12750/sequence_read/ERR000589_2.filt.fastq.gz\n\t\nNext, the downloaded files should be uncompressed:\n\n\tgzip -d ERR000589_1.filt.fastq.gz\n\tgzip -d ERR000589_2.filt.fastq.gz\n\t\nand prepared to be used by BigBWA:\n\n\tpython src/utils/Fq2FqBigDataPaired.py ERR000589_1.filt.fastq ERR000589_2.filt.fastq ERR000589.fqBD\n\n\thdfs dfs -copyFromLocal ERR000589.fqBDP ERR000589.fqBDP\n\t\nFinally, we can execute **BigBWA** on the Hadoop cluster:\n\n\tyarn jar BigBWA-2.1.jar com.github.bigbwa.BigBWA -D mapreduce.input.fileinputformat.split.minsize=123641127\n\t-D mapreduce.input.fileinputformat.split.maxsize=123641127\n\t-D mapreduce.map.memory.mb=7500\n\t-w \"-R @RG\\tID:foo\\tLB:bar\\tPL:illumina\\tPU:illumina\\tSM:ERR000589 -t 2\"\n\t-m -p --index /Data/HumanBase/hg19 -r ERR000589.fqBDP ExitERR000589\n\nOptions:\n\n* **-m** - Sequence alignment algorithm.\n* **-p** - Use paired-end reads.\n* **-w \"args\"** - Can be used to pass arguments directly to BWA (ex. \"-t 4\" to specify the amount of threads to use per instance of BWA).\n* **--index index_prefix** - Index prefix is specified. The index must be available in all the cluster nodes at the same location.\n* The last two arguments are the input and output HDFS files.\n\n\nIf you want to check all the available options, execute the command:\n\n\tyarn jar BigBWA-2.1.jar com.github.bigbwa.BigBWA -h\n\t\nThe commands are:\n\n    BigBWA performs genomic alignment using bwa in a Hadoop/YARN cluster\n     usage: yarn jar --class com.github.bigbwa.BigBWA BigBWA-2.1.jar\n           [-a | -b | -m] [-h] [-i \u003cIndex prefix\u003e]   [-n \u003cNumber of\n           partitions\u003e] [-p | -s] [-r]  [-w \u003c\"BWA arguments\"\u003e]\n           \u003cFASTQ file\u003e \u003cSAM file output\u003e\n    Help options: \n      -h, --help                                       Shows this help\n    \n    Input FASTQ reads options: \n      -p, --paired                                     Paired reads will be used as input FASTQ reads\n      -s, --single                                     Single reads will be used as input FASTQ reads\n    \n    BWA algorithm options: \n      -a, --aln                                        The ALN algorithm will be used\n      -b, --bwasw                                      The bwasw algorithm will be used\n      -m, --mem                                        The MEM algorithm will be used\n    \n    Index options: \n      -i, --index \u003cIndex prefix\u003e                       Prefix for the index created by bwa to use - setIndexPath(string)\n    \n    Spark options: \n      -n, --partitions \u003cNumber of partitions\u003e          Number of partitions to divide input - setPartitionNumber(int)\n    \n    Reducer options: \n      -r, --reducer                                    The program is going to merge all the final results in a reducer phase\n    \n    BWA arguments options: \n      -w, --bwa \u003c\"BWA arguments\"\u003e                      Arguments passed directly to BWA\n\nAfter the execution, to move the output to the local filesystem use: \n\n\thdfs dfs -copyToLocal ExitERR000589/part-r-00000 ./\n\t\nIn case there is no reducer, the output will be split into several pieces. In order to put it together users could use one of our Python utils or \"samtools merge\":\n\n\thdfs dfs -copyToLocal ExitERR000589/Output* ./\n\tpython src/utils/FullSam.py ./ ./OutputFile.sam\n\t\n##Frequently asked questions (FAQs)\n\n1. [I can not build the tool because *jni_md.h* or *jni.h* is missing.](#building1)\n\n####\u003ca name=\"building1\"\u003e\u003c/a\u003e1. I can not build the tool because *jni_md.h* or *jni.h* is missing.\nYou need to set correctly your *JAVA_HOME* environment variable or you can set it in Makefile.common.\n\n[1]: https://github.com/lh3/bwa\n[2]: https://hadoop.apache.org/\n[3]: http://www.1000genomes.org/\n[4]: http://dx.doi.org/10.1093%2Fbioinformatics%2Fbtv506\n[6]: http://spark.apache.org/\n[7]: https://github.com/citiususc/SparkBWA\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitiususc%2Fbigbwa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcitiususc%2Fbigbwa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcitiususc%2Fbigbwa/lists"}