Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stephenturner/oneliners

Useful bash one-liners for bioinformatics.
https://github.com/stephenturner/oneliners

Last synced: about 1 month ago
JSON representation

Useful bash one-liners for bioinformatics.

Awesome Lists containing this project

README

        

# Bioinformatics one-liners

[![DOI](https://zenodo.org/badge/3882/stephenturner/oneliners.svg)](https://zenodo.org/badge/latestdoi/3882/stephenturner/oneliners)

Useful bash one-liners useful for bioinformatics (and [some, more generally useful](#etc)).

## Contents

- [Sources](#sources)
- [Basic awk & sed](#basic-awk--sed)
- [awk & sed for bioinformatics](#awk--sed-for-bioinformatics)
- [sort, uniq, cut, etc.](#sort-uniq-cut-etc)
- [find, xargs, and GNU parallel](#find-xargs-and-gnu-parallel)
- [seqtk](#seqtk)
- [GFF3 Annotations](#gff3-annotations)
- [Other generally useful aliases for your .bashrc](#other-generally-useful-aliases-for-your-bashrc)
- [Etc.](#etc)

## Sources

*
*
*
*
*
*
*
*
*

## Basic awk & sed

[[back to top](#contents)]

Extract fields 2, 4, and 5 from file.txt:

awk '{print $2,$4,$5}' input.txt

Print each line where the 5th field is equal to ‘abc123’:

awk '$5 == "abc123"' file.txt

Print each line where the 5th field is *not* equal to ‘abc123’:

awk '$5 != "abc123"' file.txt

Print each line whose 7th field matches the regular expression:

awk '$7 ~ /^[a-f]/' file.txt

Print each line whose 7th field *does not* match the regular expression:

awk '$7 !~ /^[a-f]/' file.txt

Get unique entries in file.txt based on column 2 (takes only the first instance):

awk '!arr[$2]++' file.txt

Print rows where column 3 is larger than column 5 in file.txt:

awk '$3>$5' file.txt

Sum column 1 of file.txt:

awk '{sum+=$1} END {print sum}' file.txt

Compute the mean of column 2:

awk '{x+=$2}END{print x/NR}' file.txt

Replace all occurances of `foo` with `bar` in file.txt:

sed 's/foo/bar/g' file.txt

Trim leading whitespaces and tabulations in file.txt:

sed 's/^[ \t]*//' file.txt

Trim trailing whitespaces and tabulations in file.txt:

sed 's/[ \t]*$//' file.txt

Trim leading and trailing whitespaces and tabulations in file.txt:

sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

Delete blank lines in file.txt:

sed '/^$/d' file.txt

Delete everything after and including a line containing `EndOfUsefulData`:

sed -n '/EndOfUsefulData/,$!p' file.txt

Remove duplicates while preserving order

awk '!visited[$0]++' file.txt

## awk & sed for bioinformatics

[[back to top](#contents)]

Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):

cat file.txt | awk '$1=="1"' | awk '$3>=1000000' | awk '$3<=2000000'

Basic sequence statistics. Print total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, its frequency, and percentage of total in file.fq:

cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

Convert .bam back to .fastq:

samtools view file.bam | awk 'BEGIN {FS="\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > file.fq

Keep only top bit scores in blast hits (best bit score only):

awk '{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }' blastout.txt

Keep only top bit scores in blast hits (5 less than the top):

awk '{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }' blastout.txt

Split a multi-FASTA file into individual FASTA files:

awk '/^>/{s=++d".fa"} {print > s}' multi.fa

Output sequence name and its length for every sequence within a fasta file:

cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

Convert a FASTQ file to FASTA:

sed -n '1~4s/^@/>/p;2~4p' file.fq > file.fa

Extract every 4th line starting at the second line (extract the sequence from FASTQ file):

sed -n '2~4p' file.fq

Print everything except the first line

awk 'NR>1' input.txt

Print rows 20-80:

awk 'NR>=20&&NR<=80' input.txt

Calculate the sum of column 2 and 3 and put it at the end of a row:

awk '{print $0,$2+$3}' input.txt

Calculate the mean length of reads in a fastq file:

awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' input.fastq

Convert a VCF file to a BED file

sed -e 's/chr//' file.vcf | awk '{OFS="\t"; if (!/^#/){print $1,$2-1,$2,$4"/"$5,"+"}}'

Create a tab-delimited transcript-to-gene mapping table from a GENCODE GFF. The `substr(x,s,n)` returns _n_ characters from string _x_ starting from position _s_. This gets rid of the quotes and semicolon.

bioawk -c gff '$feature=="transcript" {print $group}' gencode.v28.annotation.gtf | awk -F ' ' '{print substr($4,2,length($4)-3) "\t" substr($2,2,length($2)-3)}' > txp2gene.tsv

extract specific reads from fastq file according to reads name :

zcat a.fastq.gz | awk 'BEGIN{RS="@";FS="\n"}; $1~/readsName/{print $2; exit}'

count missing sample in vcf file per line:

bcftools query -f '[%GT\t]\n' a.bcf | awk '{miss=0};{for (x=1; x<=NF; x++) if ($x=="./.") {miss+=1}};{print miss}' > nmiss.count

Interleave paired-end fastq files:

paste <(paste - - - - < reads-1.fastq) \
<(paste - - - - < reads-2.fastq) \
| tr '\t' '\n' \
> reads-int.fastq

Decouple an interleaved fastq file:

paste - - - - - - - - < reads-int.fastq \
| tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
| cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

## sort, uniq, cut, etc.

[[back to top](#contents)]

Number each line in file.txt:

cat -n file.txt

Count the number of unique lines in file.txt

cat file.txt | sort -u | wc -l

Find lines shared by 2 files (assumes lines within file1 and file2 are unique; pipe to `wd -l` to count the _number_ of lines shared):

sort file1 file2 | uniq -d

# Safer
sort -u file1 > a
sort -u file2 > b
sort a b | uniq -d

# Use comm
comm -12 file1 file2

Sort numerically (with logs) (g) by column (k) 9:

sort -gk9 file.txt

Find the most common strings in column 2:

cut -f2 file.txt | sort | uniq -c | sort -k1nr | head

Pick 10 random lines from a file:

shuf file.txt | head -n 10

Print all possible 3mer DNA sequence combinations:

echo {A,C,T,G}{A,C,T,G}{A,C,T,G}

Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:

cat interleaved.fq |paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > deinterleaved_1.fq) | cut -f 5-8 | tr "\t" "\n" > deinterleaved_2.fq

Take a fasta file with a bunch of short scaffolds, e.g., labeled `>Scaffold12345`, remove them, and write a new fasta without them:

samtools faidx genome.fa && grep -v Scaffold genome.fa.fai | cut -f1 | xargs -n1 samtools faidx genome.fa > genome.noscaffolds.fa

Display hidden control characters:

python -c "f = open('file.txt', 'r'); f.seek(0); file = f.readlines(); print file"

## find, xargs, and GNU parallel

[[back to top](#contents)]

*Download GNU parallel at .*

Search for .bam files anywhere in the current directory recursively:

find . -name "*.bam"

Delete all .bam files (Irreversible: use with caution! Confirm list BEFORE deleting):

find . -name "*.bam" | xargs rm

Find the largest file in a directory:

find . -type f -printf '%s %p\n' | sort -nr | head -1 | cut -d' ' -f2-

Count the number of lines in all files with a specific extension in a directory

find . -type f -name '*.txt' -exec wc -l {} \; | awk '{total += $1} END {print total}'

Find all directories in the current directory that contain files with the extension ".log" and compress them into a tar archive:

find . -type f -name "*.log" -printf "%h\0" | sort -uz | xargs -0 tar -czvf logs.tar.gz

Rename all .txt files to .bak (backup *.txt before doing something else to them, for example):

find . -name "*.txt" | sed "s/\.txt$//" | xargs -i echo mv {}.txt {}.bak | sh

Chastity filter raw Illumina data (grep reads containing `:N:`, append (-A) the three lines after the match containing the sequence and quality info, and write a new filtered fastq file):

find *fq | parallel "cat {} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-$' > {}.filt.fq"

Run FASTQC in parallel 12 jobs at a time:

find *.fq | parallel -j 12 "fastqc {} --outdir ."

Index your bam files in parallel, but only echo the commands (`--dry-run`) rather than actually running them:

find *.bam | parallel --dry-run 'samtools index {}'

## seqtk

[[back to top](#contents)]

*Download seqtk at . Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.*

Convert FASTQ to FASTA:

seqtk seq -a in.fq.gz > out.fa

Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to `N` (the 2nd):

seqtk seq -aQ64 -q20 in.fq > out.fa
seqtk seq -aQ64 -q20 -n N in.fq > out.fa

Fold long FASTA/Q lines and remove FASTA/Q comments:

seqtk seq -Cl60 in.fa > out.fa

Convert multi-line FASTQ to 4-line FASTQ:

seqtk seq -l0 in.fq > out.fq

Reverse complement FASTA/Q:

seqtk seq -r in.fq > out.fq

Extract sequences with names in file `name.lst`, one sequence name per line:

seqtk subseq in.fq name.lst > out.fq

Extract sequences in regions contained in file `reg.bed`:

seqtk subseq in.fa reg.bed > out.fa

Mask regions in `reg.bed` to lowercases:

seqtk seq -M reg.bed in.fa > out.fa

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

Trim low-quality bases from both ends using the Phred algorithm:

seqtk trimfq in.fq > out.fq

Trim 5bp from the left end of each read and 10bp from the right end:

seqtk trimfq -b 5 -e 10 in.fa > out.fa

Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:

seqtk seq -l0 -1 interleaved.fq > deinterleaved_1.fq
seqtk seq -l0 -2 interleaved.fq > deinterleaved_2.fq

## GFF3 Annotations

[[back to top](#contents)]

Print all sequences annotated in a GFF3 file.

cut -s -f 1,9 yourannots.gff3 | grep $'\t' | cut -f 1 | sort | uniq

Determine all feature types annotated in a GFF3 file.

grep -v '^#' yourannots.gff3 | cut -s -f 3 | sort | uniq

Determine the number of genes annotated in a GFF3 file.

grep -c $'\tgene\t' yourannots.gff3

Extract all gene IDs from a GFF3 file.

grep $'\tgene\t' yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)'

Print length of each gene in a GFF3 file.

grep $'\tgene\t' yourannots.gff3 | cut -s -f 4,5 | perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)'

FASTA header lines to GFF format (assuming the length is in the header as an appended "\_length" as in [Velvet](http://www.ebi.ac.uk/~zerbino/velvet/) assembled transcripts):

grep '>' file.fasta | awk -F "_" 'BEGIN{i=1; print "##gff-version 3"}{ print $0"\t BLAT\tEXON\t1\t"$10"\t95\t+\t.\tgene_id="$0";transcript_id=Transcript_"i;i++ }' > file.gff

## Other generally useful aliases for your .bashrc

[[back to top](#contents)]

Get a prompt that looks like `user@hostname:/full/path/cwd/:$ `

export PS1="\u@\h:\w\\$ "

Never type `cd ../../..` again (or use [autojump](https://github.com/joelthelion/autojump), which enables you to navigate the filesystem faster):

alias ..='cd ..'
alias ...='cd ../../'
alias ....='cd ../../../'
alias .....='cd ../../../../'
alias ......='cd ../../../../../'

Browse 'up' and 'down'

alias u='clear; cd ../; pwd; ls -lhGgo'
alias d='clear; cd -; ls -lhGgo'

Ask before removing or overwriting files:

alias mv="mv -i"
alias cp="cp -i"
alias rm="rm -i"

My favorite `ls` aliases:

alias ls="ls -1p --color=auto"
alias l="ls -lhGgo"
alias ll="ls -lh"
alias la="ls -lhGgoA"
alias lt="ls -lhGgotr"
alias lS="ls -lhGgoSr"
alias l.="ls -lhGgod .*"
alias lhead="ls -lhGgo | head"
alias ltail="ls -lhGgo | tail"
alias lmore='ls -lhGgo | more'

Use `cut` on space- or comma- delimited files:

alias cuts="cut -d \" \""
alias cutc="cut -d \",\""

Pack and unpack tar.gz files:

alias tarup="tar -zcf"
alias tardown="tar -zxf"

Or use a generalized `extract` function:

# as suggested by Mendel Cooper in "Advanced Bash Scripting Guide"
extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2) tar xvjf $1 ;;
*.tar.gz) tar xvzf $1 ;;
*.tar.xz) tar Jxvf $1 ;;
*.bz2) bunzip2 $1 ;;
*.rar) unrar x $1 ;;
*.gz) gunzip $1 ;;
*.tar) tar xvf $1 ;;
*.tbz2) tar xvjf $1 ;;
*.tgz) tar xvzf $1 ;;
*.zip) unzip $1 ;;
*.Z) uncompress $1 ;;
*.7z) 7z x $1 ;;
*) echo "don't know how to extract '$1'..." ;;
esac
else
echo "'$1' is not a valid file!"
fi
}

Use `mcd` to create a directory and `cd` to it simultaneously:

function mcd { mkdir -p "$1" && cd "$1";}

Go up to the parent directory and list it's contents:

alias u="cd ..;ls"

Make grep pretty:

alias grep="grep --color=auto"

Refresh your `.bashrc`:

alias refresh="source ~/.bashrc"

Edit your `.bashrc`:

alias eb="vi ~/.bashrc"

Common typos:

alias mf="mv -i"
alias mroe="more"
alias c='clear'
alias emacs='vim'

Show your `$PATH` in a prettier format:

alias showpath='echo $PATH | tr ":" "\n" | nl'

Use [pandoc](http://johnmacfarlane.net/pandoc/) to convert a markdown file to PDF:

# USAGE: mdpdf document.md document.md.pdf
alias mdpdf="pandoc -s -V geometry:margin=1in -V documentclass:article -V fontsize=12pt"

Find text in any file (`ft "mytext" *.txt`):

function ft { find . -name "$2" -exec grep -il "$1" {} \;; }

## Etc
[[back to top](#contents)]

Run the last command as root:

sudo !!

Place the argument of the most recent command on the shell:

'ALT+.' or ' .'

Type partial command, kill this command, check something you forgot, yank the command, resume typing:

[...]

Jump to a directory, execute a command, and jump back to the current directory:

(cd /tmp && ls)

Stopwatch (`Enter` or `ctrl-d` to stop):

time read

Create a script of the last executed command:

echo "!!" > foo.sh

Reuse _all_ parameter of the previous command line:

!*

List or delete all files in a folder that don't match a certain file extension (e.g., list things that are _not_ compressed; remove anything that is _not_ a `.foo` or `.bar` file):

ls !(*.gz)
rm !(*.foo|*.bar)

Insert the last command without the last argument:

!:-

Rapidly invoke an editor to write a long, complex, or tricky command:

fc

Print a specific line (e.g. line 42) from a file:

sed -n 42p

Terminate a frozen SSH session (enter a new line, type the `~` key then the `.` key):

[ENTER]~.

Remove blank lines from a file using grep and save output to new file:

grep . filename > newfilename

Find large files (e.g., >500M):

find . -type f -size +500M

Exclude a column with cut (e.g., all but the 5th field in a tab-delimited file):

cut -f5 --complement

Find files containing text (`-l` outputs only the file names, `-i` ignores the case `-r` descends into subdirectories)

grep -lir "some text" *