Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mskcc/marianas
Software for processing molecular barcoding (UMI)-based NGS data
https://github.com/mskcc/marianas
Last synced: about 1 month ago
JSON representation
Software for processing molecular barcoding (UMI)-based NGS data
- Host: GitHub
- URL: https://github.com/mskcc/marianas
- Owner: mskcc
- License: apache-2.0
- Created: 2020-05-22T00:33:23.000Z (over 4 years ago)
- Default Branch: master
- Last Pushed: 2020-05-22T01:15:09.000Z (over 4 years ago)
- Last Synced: 2024-03-15T08:38:53.361Z (9 months ago)
- Language: Java
- Size: 13.3 MB
- Stars: 8
- Watchers: 6
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Marianas: Deep Next Generation Sequencing, Molecular Barcoding
This software was developed at the Innovation Lab, Center for Molecular Oncology, Memorial Sloan Kettering Cancer Center (MSKCC).
Marianas is a Java software that processes NGS sequencing data where reads contain molecular barcodes (UMIs). This processing involves 3 distinct steps at various points in the workflow:
1. UMI clipping: Identify UMIs and bridge bases. ADD UMI pair to fastq read name, discard bridge bases
2. Collapsing: Process standard bam file containing aligned, UMI-clipped reads. Identify PCR duplicates by using mapping position and UMIs. Correct PCR and sequencing errors by taking consensus among PCR duplicates. Create collapsed fastq files
3. Bam separation: From the collapsed bam, create separate bam files contianing duplex-only and simplex-only reads## Java
Java 1.8 or above is required.
## Dependencies (bundled with the release jar)
1. BioinfoUtils
2. HTSJDK
3. Google Guava
4. Apache Commons IO## Using Marianas
**1. UMI clipping**
java -server -Xms8g -Xmx8g -cp Marianas.jar org.mskcc.marianas.umi.duplex.fastqprocessing.ProcessLoopUMIFastq read1.fastq.gz read2.fastq.gz umi-length
This will create a pair of _umi-clipped.fastq.gz files in the current working directory
**2. Collapsing and making consensus reads**
Collapsing involves 3 sub-steps: First pass that collapses the "left" read cluster of the read pairs, followed by a sort step on the first pass output, followed by second pass that collapses the "right" read cluster of the read pairs.
A.
java -server -Xms8g -Xmx8g -cp Marianas.jar org.mskcc.marianas.umi.duplex.DuplexUMIBamToCollapsedFastqFirstPass bam-file pileup-file minMappingQuality minBaseQuality UMIMismatches UMIWobble minConsensusPercent reference-fasta-fileThe pileup file is generated by running Waltz bam metrics module. The values currently used for some of the parameters are as follows:
minMappingQuality: 1
minBaseQuality: 20
UMIMismatches: 0
UMIWobble: 1
minConsensusPercent: 90This will produce some QC files as well as first-pass.txt in the current working directory. first-pass.txt contains the collapsed seuquences and base qualities for the first pass. This file must be sorted before running second pass.
B.
sort -S 8G -k 6,6n -k 8,8n first-pass.txt > first-pass.mate-position-sorted.txt
C.
java -server -Xms8g -Xmx8g -cp Marianas.jar org.mskcc.marianas.umi.duplex.DuplexUMIBamToCollapsedFastqSecondPass standard-bam-file pileup-file minMappingQuality minBaseQuality mismatches wobble minConsensusPercent reference-fasta-file first-pass.mate-position-sorted.txt
This will produce collapsed_R1_.fastq and collapsed_R2_.fastq files along with some QC files.
**3. Separating simplex and duplex bams from original collapsed bam**
java -server -Xms8g -Xmx8g -cp Marianas.jar org.mskcc.marianas.umi.duplex.postprocessing.SeparateBams collapsed.bam
This will produce collapsed-duplex.bam and collapsed-simplex.bam (based on the input bam file name) in the current working directory.
**4. Polishing**
A. Building frequency tables
java -server -Xms8g -Xmx8g -cp Marianas.jar org.mskcc.marianas.polishing.NoiseFrequencyBuilder pileup-directory
where pileup-directory contains Waltz pileups for the chosen set of normals. This will create 2 files: af-frequencies.txt and count-frequencies.txt. This step is executed only when there is a change in the chosen set of normals.
B. Polishing variants
java -server -Xms8g -Xmx8g -cp Marianas.jar org.mskcc.marianas.polishing.Polisher maf-file depth-column-name alt-count-column-name af-frequencies-file count-frequencies-file
These columns are required in the input maf file: Chromosome, Start_Position, Variant_Type, Reference_Allele, Tumor_Seq_Allele2.
The command will create a -polished.maf file that has 3 additional columns at the end:
Polishing_Position_Average_Coverage: Average coverage at the mutation position in the normal set
Polishing_P_Value: Polishing p-value that can be used to filter out low confidence calls
Fragment_Count->Samples_Map: A set of key-value pairs for the mutation where key is the alt count and value is the number of normals with that alt countPolishing will only be performed for point substitutions and single base deletions. The above fields will contain a "-" for other types of mutations.