https://github.com/wckdouglas/sequencing_tools
A python package storing different tools for different NGS operations
https://github.com/wckdouglas/sequencing_tools
bam-files fastq ngs ngs-analysis rna rna-seq-analysis
Last synced: 3 months ago
JSON representation
A python package storing different tools for different NGS operations
- Host: GitHub
- URL: https://github.com/wckdouglas/sequencing_tools
- Owner: wckdouglas
- License: mit
- Created: 2017-04-25T21:38:35.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2021-09-13T00:20:24.000Z (almost 4 years ago)
- Last Synced: 2025-02-03T09:45:43.671Z (4 months ago)
- Topics: bam-files, fastq, ngs, ngs-analysis, rna, rna-seq-analysis
- Language: HTML
- Homepage: https://sequencing-tools.readthedocs.io/en/latest/
- Size: 4.83 MB
- Stars: 7
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Sequencing tools #
[](https://github.com/wckdouglas/sequencing_tools/actions/workflows/miniconda_CI.yml)[](https://sequencing-tools.readthedocs.io/en/latest/?badge=latest)[](https://hub.docker.com/repository/docker/wckdouglas/sequencing_tools/general)
This repo stores different utils for TGIRT-seq/NGS pipelines. This consist of several [ready-to-use scripts](#scripts) and two [module](#modules) for operations on fastq files and BAM alignments.
---
## Installation ##
```
git clone https://github.com/wckdouglas/sequencing_tools.git
cd sequencing_tools
pip install -r requirements.txt
python setup.py install --user
```Prerequisits:
* pysam>0.11.0
* numpy>=1.12.1
* cython>=0.25
* matplotlib>=2.0.0
* scipy>=0.19.0
* seaborn>=0.7.1
* networkx
* ujson
* six
* pytest---
Ready-to-use scripts
* [umi clipper](#clip)
* [Paired end fastq merger](#merge_pe)
* [Unique BAM splitter](#split_bam)
* [Multimap BAM reducer](#fix_multi)
* [BAM to BED converter](#b2b)
* [UMI tagger](#bam_tag)
* [BED demultiplexer](#dedup_bed)
* [FastQ deinterleave](#deinterleaved)
* [Poisson UMI](#poisson_umi)
* [BAM base counter](#base_count)UMI clipper
Clipping UMI from paired end fastq
```
usage: seqtools clipFQ [-h] [-o OUT_FILE] -1 FASTQ1 -2 FASTQ2 [-x IDXBASE]
[-q BARCODECUTOFF] [-a MISMATCH] [-r {read1,read2}]Clip the barcode sequence and attached to the front of seq id
optional arguments:
-h, --help show this help message and exit
-o OUT_FILE, --out_file OUT_FILE
Interleaved Fastq files (default: -)
-1 FASTQ1, --fastq1 FASTQ1
Paired end Fastq file 1 with four line/record
-2 FASTQ2, --fastq2 FASTQ2
Paired end Fastq file 2 with four line/record
-x IDXBASE, --idxBase IDXBASE
how many base in 5' end as index? (default:
XXXXXXXXXXXXX) X as umi bases, can also add constant
regions at the back, such as XXXXCATGC, if CATGC is
the constant region
-q BARCODECUTOFF, --barcodeCutOff BARCODECUTOFF
Average base calling quality for barcode sequence
(default=20)
-a MISMATCH, --mismatch MISMATCH
Allow how many mismatch in constant region (deflaut:
1)
-r {read1,read2}, --read {read1,read2}
Which read is the UMI on?
```Paired end fastq merger
Merging paired end fastq to a fragment, only output reads that are successfully merged.
```
usage: seqtools mergepe [-h] [-i INTERLEAVED] [-o OUTFILE] [-m MIN_LEN]
[-e ERROR] [-a]Merging interleaved, paired-end fastq file and output overlapped regions only
with error correction using cutadapt module to find overlapping regionsoptional arguments:
-h, --help show this help message and exit
-i INTERLEAVED, --interleaved INTERLEAVED
Interleaved Fastq files (default: -)
-o OUTFILE, --outfile OUTFILE
Merged fastq file (default: -)
-m MIN_LEN, --min_len MIN_LEN
Minimum length of sequence to output (default: 18)
-e ERROR, --error ERROR
Maximum error rate of alignment (default: 0.1)
-a, --all Output all bases (default: only overlapping regions)
```Unique bam splitter
This script split bowtie2/hisat2 bam file to uniquely mapped or multiple mapped alignments
```
usage: seqtools splitBam [-h] [-a {bowtie2,hisat2}] [-i INBAM] -o OUTPREFIXSplitting bam file to uniquely mapped and multiple mapped (only support
bowtie2 and hisat2)optional arguments:
-h, --help show this help message and exit
-a {bowtie2,hisat2}, --aligner {bowtie2,hisat2}
PRogram generating the bam file (default: bowtie2)
-i INBAM, --inBam INBAM
bam file name, or stdin (default: stdin)
-o OUTPREFIX, --outprefix OUTPREFIX
output prefix ($OUTPUTPREFIX.unique.bam,
$OUTPUTPREFIX.multi.bam)
```Multimap BAM reducer
This scripts process multiply mapped reads and output ribosomal reads or the shortest read pairs from the alignments
input bam file should be generated from split_uniq_bam.py```
usage: seqtools filterMulti [-h] -i INFILE -o OUTFILE [-b] [-z]Process multiply mapped bam and get either shortest insert size or ribosomal
readsoptional arguments:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Input bam/sam file (use < - > for stdin)
-o OUTFILE, --outfile OUTFILE
Output bam/sam file (use < - > for stdout)
-b, --bam_in Provide this flag if bam instead of sam is used as
input
-z, --bam_out Provide this flag if bam is needed for output
```BAM to bed converter
```
usage: seqtools bam2bed [-h] -i IN_BAM [-o OUT_BED] [-m MIN_SIZE] [-M MAX_SIZE]
[-t TAG] [-a]Making paired-end bam into bed file for every fragment
optional arguments:
-h, --help show this help message and exit
-i IN_BAM, --in_bam IN_BAM
BAM file name, or stdin (-) ** name sorted
-o OUT_BED, --out_bed OUT_BED
BED file output (default: - )
-m MIN_SIZE, --min_size MIN_SIZE
minimum fragment size to report
-M MAX_SIZE, --max_size MAX_SIZE
minimum fragment size to report
-t TAG, --tag TAG tag to extract
-a, --all output supplementary as single fragment
```UMI tagger
Adding UMI sequence to RX tag in bam. Alignments must be from fastq files generated by [clip_fastq.py](#clip)
```
usage: seqtools bamTag [-h] -i IN_BAM [-o OUT_BAM] [-t TAG] [-d DELIM]
[-f FRAGMENT]Putting UMI as RX tag in bam, enables picard Markduplicates with
BARCODE_TAG=RXoptional arguments:
-h, --help show this help message and exit
-i IN_BAM, --in_bam IN_BAM
BAM file name, or stdin (-)
-o OUT_BAM, --out_bam OUT_BAM
BAM file output (default: - )
-t TAG, --tag TAG Tag id (default: RX )
-d DELIM, --delim DELIM
Deliminator separating read id and bc (default: _ )
-f FRAGMENT, --fragment FRAGMENT
after splitting read name using {delim}, which fragment
is UMI? can use -1 as last piece (default: 0)
```Deuplicate BED
Deduplicate fragments with UMI in a bed file.
```
usage: seqtools dedup [-h] [-i INFILE] [-o OUTFILE] [-t THRESHOLD]
[-d DELIM] [-f F]Demultiplexing UMI bed file with the follwing columns: 1. chrom name 2. start
3. end 4. {$UMI}_{$READ_ID} 5. score 6.strand The program internally used
hamming distance matrix of the barcodes to generate connected network and
identified UMI clustersoptional arguments:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
input bedfile, can be "-" or "/dev/stdin" for stdin
(default: -).Format should be sorted BED: 1. chrom 2.
start 3. end 4. {barcode}_{id} 6. strand
-o OUTFILE, --outfile OUTFILE
output bedfile, can be "-" or "/dev/stdout" for stdin
(default: -).
-t THRESHOLD, --threshold THRESHOLD
How many error between barcodes can be tolerated?
(default = 1)
-d DELIM, --delim DELIM
Deliminator separating read id and bc (default: _ )
-f F after splitting read name using {delim}, which
fragmnet is UMI? can use -1 as last piece (default: 0)
```Deinterleaved fastq
Deinterleaved an interleaved fastq file.
```
usage: seqtools deinterleave [-h] [-i INFQ] -1 READ1 [-2 READ2]Deinterleaving an interleaved fastq file
optional arguments:
-h, --help show this help message and exit
-i INFQ, --infq INFQ interleaved fq (defualt: -)
-1 READ1, --read1 READ1
read 1 fastq output
-2 READ2, --read2 READ2
read 2 fastq output
```Adjust fragment counts for UMI saturations
UMI saturation can be a problem for highly-expressed geens, this tool adjust the fragment count by implementing a [poisson model](https://www.ncbi.nlm.nih.gov/pubmed/21562209).

where *m* is the diversity of UMI (*4^nt*, *nt* is number of bases as UMI), *k* is the number of different UMI being detected for the fragment and *n* is the true number of fragments that we are interested at.
Input for this tool is a 6-columns BED file with each fragment named as {UMI}_{READ_ID}
```
usage: seqtools calcUMI [-h] -i IN_BED [-o OUT_BED] [--umi UMI]Adjusting fragment count for UMI saturation, see paper: Counting individual
DNA molecules by the stochastic attachment of diverse labels.optional arguments:
-h, --help show this help message and exit
-i IN_BED, --in_bed IN_BED
BED file name, or stdin (-) ** name sorted
-o OUT_BED, --out_bed OUT_BED
BED file output (default: - )
--umi UMI Number of nucleotide as umi (default: 6)
```Extracting base count from BAM along the genome/genomic regions
```
usage: seqtools pileup [-h] -i BAM -f FASTA [-b BASES] [-q QUAL]
[-c CROP] [-r BED] [--no_indel]
[--min_coverage MIN_COVERAGE]
[--concensus_fasta CONCENSUS_FASTA]Pileup whole genome, only output bases where coverage > 0
optional arguments:
-h, --help show this help message and exit
-i BAM, --bam BAM Input bam file (indexed)
-f FASTA, --fasta FASTA
reference fasta file
-b BASES, --bases BASES
number of bases to look at every iteration (default:
100000)
-q QUAL, --qual QUAL base quality to filter (defulat: 30)
-c CROP, --crop CROP Crop how many bases from ends (defulat: 0)
-r BED, --bed BED bed file for regions (default: whole genome)
--no_indel Not considering alignments with Indel
--min_coverage MIN_COVERAGE
Minimum coverage to output
--concensus_fasta CONCENSUS_FASTA
Generate concensus fasta (only work if bed file is
provided)
```---
Modules
Detail usage can be found at [readthedocs](https://sequencing-tools.readthedocs.io/en/latest/)
### viz_tools.cor_plot ####
A specialized paired correlation plot
```
from sequencing_tools.viz_tools import cor_plot
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_style('white')d = np.random.rand(100,10)
d = pd.DataFrame(d)
fig = plt.figure()
cor_plot(d, fig)
fig.savefig('cor.png',bbox_inches='tight')
```