https://github.com/stephenturner/oneliners

Useful bash one-liners for bioinformatics.
https://github.com/stephenturner/oneliners
Last synced: 2 months ago
JSON representation
Useful bash one-liners for bioinformatics.
Host: GitHub
URL: https://github.com/stephenturner/oneliners
Owner: stephenturner
Created: 2013-10-19T11:33:26.000Z (over 11 years ago)
Default Branch: master
Last Pushed: 2023-09-09T21:23:36.000Z (almost 2 years ago)
Last Synced: 2025-05-08T20:49:28.217Z (2 months ago)
Homepage:
Size: 552 KB
Stars: 1,915
Watchers: 153
Forks: 522
Open Issues: 5
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

Awesome-Bioinformatics - Bioinformatics One Liners - Git repo of useful single line commands. (Data Processing / Command Line Utilities)
awesome-security-collection - **1030**星 - liners for bioinformatics. (<a id="8c5a692b5d26527ef346687e047c5c21"></a>收集)
README

        # Bioinformatics one-liners

[![DOI](https://zenodo.org/badge/3882/stephenturner/oneliners.svg)](https://zenodo.org/badge/latestdoi/3882/stephenturner/oneliners)

Useful bash one-liners useful for bioinformatics (and [some, more generally useful](#etc)).  

## Contents

- [Sources](#sources)

- [Basic awk & sed](#basic-awk--sed)

- [awk & sed for bioinformatics](#awk--sed-for-bioinformatics)

- [sort, uniq, cut, etc.](#sort-uniq-cut-etc)

- [find, xargs, and GNU parallel](#find-xargs-and-gnu-parallel)

- [seqtk](#seqtk)

- [GFF3 Annotations](#gff3-annotations)

- [Other generally useful aliases for your .bashrc](#other-generally-useful-aliases-for-your-bashrc)

- [Etc.](#etc)

## Sources

* 

* 

* 

* 

* 

* 

* 

* 

* 

## Basic awk & sed

[[back to top](#contents)]

Extract fields 2, 4, and 5 from file.txt:

    awk '{print $2,$4,$5}' input.txt

Print each line where the 5th field is equal to ‘abc123’:

    awk '$5 == "abc123"' file.txt

Print each line where the 5th field is *not* equal to ‘abc123’:

    awk '$5 != "abc123"' file.txt

Print each line whose 7th field matches the regular expression:

    awk '$7  ~ /^[a-f]/' file.txt

Print each line whose 7th field *does not* match the regular expression:

    awk '$7 !~ /^[a-f]/' file.txt

Get unique entries in file.txt based on column 2 (takes only the first instance):

    awk '!arr[$2]++' file.txt

Print rows where column 3 is larger than column 5 in file.txt:

    awk '$3>$5' file.txt

Sum column 1 of file.txt:

    awk '{sum+=$1} END {print sum}' file.txt

Compute the mean of column 2:

    awk '{x+=$2}END{print x/NR}' file.txt

Replace all occurances of `foo` with `bar` in file.txt:

    sed 's/foo/bar/g' file.txt

Trim leading whitespaces and tabulations in file.txt:

    sed 's/^[ \t]*//' file.txt

Trim trailing whitespaces and tabulations in file.txt:

    sed 's/[ \t]*$//' file.txt

Trim leading and trailing whitespaces and tabulations in file.txt:

    sed 's/^[ \t]*//;s/[ \t]*$//' file.txt

Delete blank lines in file.txt:

    sed '/^$/d' file.txt

Delete everything after and including a line containing `EndOfUsefulData`:

    sed -n '/EndOfUsefulData/,$!p' file.txt

Remove duplicates while preserving order

    awk '!visited[$0]++' file.txt

## awk & sed for bioinformatics

[[back to top](#contents)]

Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):

    cat file.txt | awk '$1=="1"' | awk '$3>=1000000' | awk '$3<=2000000'

Basic sequence statistics. Print total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, its frequency, and percentage of total in file.fq:

    cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'

Convert .bam back to .fastq:

    samtools view file.bam | awk 'BEGIN {FS="\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > file.fq

Keep only top bit scores in blast hits (best bit score only):

    awk '{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }' blastout.txt

Keep only top bit scores in blast hits (5 less than the top):

    awk '{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }' blastout.txt

Split a multi-FASTA file into individual FASTA files:

    awk '/^>/{s=++d".fa"} {print > s}' multi.fa

Output sequence name and its length for every sequence within a fasta file:

    cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'

Convert a FASTQ file to FASTA:

    sed -n '1~4s/^@/>/p;2~4p' file.fq > file.fa

Extract every 4th line starting at the second line (extract the sequence from FASTQ file):

    sed -n '2~4p' file.fq

Print everything except the first line

    awk 'NR>1' input.txt

Print rows 20-80:

    awk 'NR>=20&&NR<=80' input.txt

Calculate the sum of column 2 and 3 and put it at the end of a row:

    awk '{print $0,$2+$3}' input.txt

Calculate the mean length of reads in a fastq file:

    awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' input.fastq

Convert a VCF file to a BED file

    sed -e 's/chr//' file.vcf | awk '{OFS="\t"; if (!/^#/){print $1,$2-1,$2,$4"/"$5,"+"}}'

Create a tab-delimited transcript-to-gene mapping table from a GENCODE GFF. The `substr(x,s,n)` returns _n_ characters from string _x_ starting from position _s_. This gets rid of the quotes and semicolon.

    bioawk -c gff '$feature=="transcript" {print $group}' gencode.v28.annotation.gtf | awk -F ' ' '{print substr($4,2,length($4)-3) "\t" substr($2,2,length($2)-3)}' > txp2gene.tsv

extract specific reads from fastq file according to reads name :

    zcat a.fastq.gz | awk 'BEGIN{RS="@";FS="\n"}; $1~/readsName/{print $2; exit}'

count missing sample in vcf file per line:

    bcftools query -f '[%GT\t]\n' a.bcf |  awk '{miss=0};{for (x=1; x<=NF; x++) if ($x=="./.") {miss+=1}};{print miss}' > nmiss.count

Interleave paired-end fastq files:

    paste <(paste - - - - < reads-1.fastq) \

          <(paste - - - - < reads-2.fastq) \

        | tr '\t' '\n' \

        > reads-int.fastq

Decouple an interleaved fastq file:

    paste - - - - - - - - < reads-int.fastq \

        | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \

        | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

## sort, uniq, cut, etc.

[[back to top](#contents)]

Number each line in file.txt:

    cat -n file.txt

Count the number of unique lines in file.txt

    cat file.txt | sort -u | wc -l

Find lines shared by 2 files (assumes lines within file1 and file2 are unique; pipe to `wd -l` to count the _number_ of lines shared):

    sort file1 file2 | uniq -d

    # Safer

    sort -u file1 > a

    sort -u file2 > b

    sort a b | uniq -d

    # Use comm

    comm -12 file1 file2

Sort numerically (with logs) (g) by column (k) 9:

    sort -gk9 file.txt

Find the most common strings in column 2:

    cut -f2 file.txt | sort | uniq -c | sort -k1nr | head

Pick 10 random lines from a file:

    shuf file.txt | head -n 10

Print all possible 3mer DNA sequence combinations:

    echo {A,C,T,G}{A,C,T,G}{A,C,T,G}

Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:

    cat interleaved.fq |paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > deinterleaved_1.fq) | cut -f 5-8 | tr "\t" "\n" > deinterleaved_2.fq

Take a fasta file with a bunch of short scaffolds, e.g., labeled `>Scaffold12345`, remove them, and write a new fasta without them:

    samtools faidx genome.fa && grep -v Scaffold genome.fa.fai | cut -f1 | xargs -n1 samtools faidx genome.fa > genome.noscaffolds.fa

Display hidden control characters:

    python -c "f = open('file.txt', 'r'); f.seek(0); file = f.readlines(); print file" 

## find, xargs, and GNU parallel

[[back to top](#contents)]

*Download GNU parallel at .*

Search for .bam files anywhere in the current directory recursively:

    find . -name "*.bam"

Delete all .bam files (Irreversible: use with caution! Confirm list BEFORE deleting):

    find . -name "*.bam" | xargs rm

Find the largest file in a directory:

    find . -type f -printf '%s %p\n' | sort -nr | head -1 | cut -d' ' -f2-

Count the number of lines in all files with a specific extension in a directory

    find . -type f -name '*.txt' -exec wc -l {} \; | awk '{total += $1} END {print total}'

Find all directories in the current directory that contain files with the extension ".log" and compress them into a tar archive:

    find . -type f -name "*.log" -printf "%h\0" | sort -uz | xargs -0 tar -czvf logs.tar.gz

Rename all .txt files to .bak (backup *.txt before doing something else to them, for example):

    find . -name "*.txt" | sed "s/\.txt$//" | xargs -i echo mv {}.txt {}.bak | sh

Chastity filter raw Illumina data (grep reads containing `:N:`, append (-A) the three lines after the match containing the sequence and quality info, and write a new filtered fastq file):

    find *fq | parallel "cat {} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-$' > {}.filt.fq"

Run FASTQC in parallel 12 jobs at a time:

    find *.fq | parallel -j 12 "fastqc {} --outdir ."

Index your bam files in parallel, but only echo the commands (`--dry-run`) rather than actually running them:

    find *.bam | parallel --dry-run 'samtools index {}'

## seqtk

[[back to top](#contents)]

*Download seqtk at . Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.*

Convert FASTQ to FASTA:

    seqtk seq -a in.fq.gz > out.fa

Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to `N` (the 2nd):

    seqtk seq -aQ64 -q20 in.fq > out.fa

    seqtk seq -aQ64 -q20 -n N in.fq > out.fa

Fold long FASTA/Q lines and remove FASTA/Q comments:

    seqtk seq -Cl60 in.fa > out.fa

Convert multi-line FASTQ to 4-line FASTQ:

    seqtk seq -l0 in.fq > out.fq

Reverse complement FASTA/Q:

    seqtk seq -r in.fq > out.fq

Extract sequences with names in file `name.lst`, one sequence name per line:

    seqtk subseq in.fq name.lst > out.fq

Extract sequences in regions contained in file `reg.bed`:

    seqtk subseq in.fa reg.bed > out.fa

Mask regions in `reg.bed` to lowercases:

    seqtk seq -M reg.bed in.fa > out.fa

Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):

    seqtk sample -s100 read1.fq 10000 > sub1.fq

    seqtk sample -s100 read2.fq 10000 > sub2.fq

Trim low-quality bases from both ends using the Phred algorithm:

    seqtk trimfq in.fq > out.fq

Trim 5bp from the left end of each read and 10bp from the right end:

    seqtk trimfq -b 5 -e 10 in.fa > out.fa

Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:

    seqtk seq -l0 -1 interleaved.fq > deinterleaved_1.fq

    seqtk seq -l0 -2 interleaved.fq > deinterleaved_2.fq

## GFF3 Annotations

[[back to top](#contents)]

Print all sequences annotated in a GFF3 file.

    cut -s -f 1,9 yourannots.gff3 | grep $'\t' | cut -f 1 | sort | uniq

Determine all feature types annotated in a GFF3 file.

    grep -v '^#' yourannots.gff3 | cut -s -f 3 | sort | uniq

Determine the number of genes annotated in a GFF3 file.

    grep -c $'\tgene\t' yourannots.gff3

Extract all gene IDs from a GFF3 file.

    grep $'\tgene\t' yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)'

Print length of each gene in a GFF3 file.

    grep $'\tgene\t' yourannots.gff3 | cut -s -f 4,5 | perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)'

FASTA header lines to GFF format (assuming the length is in the header as an appended "\_length" as in [Velvet](http://www.ebi.ac.uk/~zerbino/velvet/) assembled transcripts):

    grep '>' file.fasta | awk -F "_" 'BEGIN{i=1; print "##gff-version 3"}{ print $0"\t BLAT\tEXON\t1\t"$10"\t95\t+\t.\tgene_id="$0";transcript_id=Transcript_"i;i++ }' > file.gff

## Other generally useful aliases for your .bashrc

[[back to top](#contents)]

Get a prompt that looks like `user@hostname:/full/path/cwd/:$ `

    export PS1="\u@\h:\w\\$ "

Never type `cd ../../..` again (or use [autojump](https://github.com/joelthelion/autojump), which enables you to navigate the filesystem faster):

    alias ..='cd ..'

    alias ...='cd ../../'

    alias ....='cd ../../../'

    alias .....='cd ../../../../'

    alias ......='cd ../../../../../'

Browse 'up' and 'down'

    alias u='clear; cd ../; pwd; ls -lhGgo'

    alias d='clear; cd -; ls -lhGgo'

Ask before removing or overwriting files:

    alias mv="mv -i"

    alias cp="cp -i"  

    alias rm="rm -i"

My favorite `ls` aliases:

    alias ls="ls -1p --color=auto"

    alias l="ls -lhGgo"

    alias ll="ls -lh"

    alias la="ls -lhGgoA"

    alias lt="ls -lhGgotr"

    alias lS="ls -lhGgoSr"

    alias l.="ls -lhGgod .*"

    alias lhead="ls -lhGgo | head"

    alias ltail="ls -lhGgo | tail"

    alias lmore='ls -lhGgo | more'

Use `cut` on space- or comma- delimited files:

    alias cuts="cut -d \" \""

    alias cutc="cut -d \",\""

Pack and unpack tar.gz files:

    alias tarup="tar -zcf"

    alias tardown="tar -zxf"

Or use a generalized `extract` function:

    # as suggested by Mendel Cooper in "Advanced Bash Scripting Guide"

    extract () {

       if [ -f $1 ] ; then

           case $1 in

            *.tar.bz2)      tar xvjf $1 ;;

            *.tar.gz)       tar xvzf $1 ;;

            *.tar.xz)       tar Jxvf $1 ;;

            *.bz2)          bunzip2 $1 ;;

            *.rar)          unrar x $1 ;;

            *.gz)           gunzip $1 ;;

            *.tar)          tar xvf $1 ;;

            *.tbz2)         tar xvjf $1 ;;

            *.tgz)          tar xvzf $1 ;;

            *.zip)          unzip $1 ;;

            *.Z)            uncompress $1 ;;

            *.7z)           7z x $1 ;;

            *)              echo "don't know how to extract '$1'..." ;;

           esac

       else

           echo "'$1' is not a valid file!"

       fi

    }

Use `mcd` to create a directory and `cd` to it simultaneously:

    function mcd { mkdir -p "$1" && cd "$1";}

Go up to the parent directory and list it's contents:

    alias u="cd ..;ls"

Make grep pretty:

    alias grep="grep --color=auto"

Refresh your `.bashrc`:

    alias refresh="source ~/.bashrc"

Edit your `.bashrc`:

    alias eb="vi ~/.bashrc"

Common typos:

    alias mf="mv -i"

    alias mroe="more"

    alias c='clear'

    alias emacs='vim'

Show your `$PATH` in a prettier format:

    alias showpath='echo $PATH | tr ":" "\n" | nl'

Use [pandoc](http://johnmacfarlane.net/pandoc/) to convert a markdown file to PDF:

    # USAGE: mdpdf document.md document.md.pdf

    alias mdpdf="pandoc -s -V geometry:margin=1in -V documentclass:article -V fontsize=12pt"

Find text in any file (`ft "mytext" *.txt`):

    function ft { find . -name "$2" -exec grep -il "$1" {} \;; }

## Etc

[[back to top](#contents)]

Run the last command as root:

    sudo !!

Place the argument of the most recent command on the shell:

    'ALT+.' or ' .'

Type partial command, kill this command, check something you forgot, yank the command, resume typing:

     [...] 

Jump to a directory, execute a command, and jump back to the current directory:

    (cd /tmp && ls)

Stopwatch (`Enter` or `ctrl-d` to stop):

    time read

Create a script of the last executed command:

    echo "!!" > foo.sh

Reuse _all_ parameter of the previous command line:

    !*

List or delete all files in a folder that don't match a certain file extension (e.g., list things that are _not_ compressed; remove anything that is _not_ a `.foo` or `.bar` file):

    ls !(*.gz)

    rm !(*.foo|*.bar)

Insert the last command without the last argument:

    !:- 

Rapidly invoke an editor to write a long, complex, or tricky command:

    fc

Print a specific line (e.g. line 42) from a file:

    sed -n 42p 

Terminate a frozen SSH session (enter a new line, type the `~` key then the `.` key):

    [ENTER]~.

Remove blank lines from a file using grep and save output to new file:

    grep . filename > newfilename

Find large files (e.g., >500M):

    find . -type f -size +500M

Exclude a column with cut (e.g., all but the 5th field in a tab-delimited file):

    cut -f5 --complement

Find files containing text (`-l` outputs only the file names, `-i` ignores the case `-r` descends into subdirectories)

    grep -lir "some text" *
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stephenturner/oneliners

Awesome Lists containing this project

README