Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stephenturner/oneliners
Useful bash one-liners for bioinformatics.
https://github.com/stephenturner/oneliners
Last synced: about 1 month ago
JSON representation
Useful bash one-liners for bioinformatics.
- Host: GitHub
- URL: https://github.com/stephenturner/oneliners
- Owner: stephenturner
- Created: 2013-10-19T11:33:26.000Z (over 11 years ago)
- Default Branch: master
- Last Pushed: 2023-09-09T21:23:36.000Z (over 1 year ago)
- Last Synced: 2024-12-17T20:40:39.687Z (about 1 month ago)
- Homepage:
- Size: 552 KB
- Stars: 1,881
- Watchers: 155
- Forks: 519
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- Awesome-Bioinformatics - Bioinformatics One Liners - Git repo of useful single line commands. (Data Processing / Command Line Utilities)
- awesome-security-collection - **1030**星 - liners for bioinformatics. (<a id="8c5a692b5d26527ef346687e047c5c21"></a>收集)
README
# Bioinformatics one-liners
[![DOI](https://zenodo.org/badge/3882/stephenturner/oneliners.svg)](https://zenodo.org/badge/latestdoi/3882/stephenturner/oneliners)
Useful bash one-liners useful for bioinformatics (and [some, more generally useful](#etc)).
## Contents
- [Sources](#sources)
- [Basic awk & sed](#basic-awk--sed)
- [awk & sed for bioinformatics](#awk--sed-for-bioinformatics)
- [sort, uniq, cut, etc.](#sort-uniq-cut-etc)
- [find, xargs, and GNU parallel](#find-xargs-and-gnu-parallel)
- [seqtk](#seqtk)
- [GFF3 Annotations](#gff3-annotations)
- [Other generally useful aliases for your .bashrc](#other-generally-useful-aliases-for-your-bashrc)
- [Etc.](#etc)## Sources
*
*
*
*
*
*
*
*
*## Basic awk & sed
[[back to top](#contents)]
Extract fields 2, 4, and 5 from file.txt:
awk '{print $2,$4,$5}' input.txt
Print each line where the 5th field is equal to ‘abc123’:
awk '$5 == "abc123"' file.txt
Print each line where the 5th field is *not* equal to ‘abc123’:
awk '$5 != "abc123"' file.txt
Print each line whose 7th field matches the regular expression:
awk '$7 ~ /^[a-f]/' file.txt
Print each line whose 7th field *does not* match the regular expression:
awk '$7 !~ /^[a-f]/' file.txt
Get unique entries in file.txt based on column 2 (takes only the first instance):
awk '!arr[$2]++' file.txt
Print rows where column 3 is larger than column 5 in file.txt:
awk '$3>$5' file.txt
Sum column 1 of file.txt:
awk '{sum+=$1} END {print sum}' file.txt
Compute the mean of column 2:
awk '{x+=$2}END{print x/NR}' file.txt
Replace all occurances of `foo` with `bar` in file.txt:
sed 's/foo/bar/g' file.txt
Trim leading whitespaces and tabulations in file.txt:
sed 's/^[ \t]*//' file.txt
Trim trailing whitespaces and tabulations in file.txt:
sed 's/[ \t]*$//' file.txt
Trim leading and trailing whitespaces and tabulations in file.txt:
sed 's/^[ \t]*//;s/[ \t]*$//' file.txt
Delete blank lines in file.txt:
sed '/^$/d' file.txt
Delete everything after and including a line containing `EndOfUsefulData`:
sed -n '/EndOfUsefulData/,$!p' file.txt
Remove duplicates while preserving order
awk '!visited[$0]++' file.txt
## awk & sed for bioinformatics
[[back to top](#contents)]
Returns all lines on Chr 1 between 1MB and 2MB in file.txt. (assumes) chromosome in column 1 and position in column 3 (this same concept can be used to return only variants that above specific allele frequencies):
cat file.txt | awk '$1=="1"' | awk '$3>=1000000' | awk '$3<=2000000'
Basic sequence statistics. Print total number of reads, total number unique reads, percentage of unique reads, most abundant sequence, its frequency, and percentage of total in file.fq:
cat myfile.fq | awk '((NR-2)%4==0){read=$1;total++;count[read]++}END{for(read in count){if(!max||count[read]>max) {max=count[read];maxRead=read};if(count[read]==1){unique++}};print total,unique,unique*100/total,maxRead,count[maxRead],count[maxRead]*100/total}'
Convert .bam back to .fastq:
samtools view file.bam | awk 'BEGIN {FS="\t"} {print "@" $1 "\n" $10 "\n+\n" $11}' > file.fq
Keep only top bit scores in blast hits (best bit score only):
awk '{ if(!x[$1]++) {print $0; bitscore=($14-1)} else { if($14>bitscore) print $0} }' blastout.txt
Keep only top bit scores in blast hits (5 less than the top):
awk '{ if(!x[$1]++) {print $0; bitscore=($14-6)} else { if($14>bitscore) print $0} }' blastout.txt
Split a multi-FASTA file into individual FASTA files:
awk '/^>/{s=++d".fa"} {print > s}' multi.fa
Output sequence name and its length for every sequence within a fasta file:
cat file.fa | awk '$0 ~ ">" {print c; c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }'
Convert a FASTQ file to FASTA:
sed -n '1~4s/^@/>/p;2~4p' file.fq > file.fa
Extract every 4th line starting at the second line (extract the sequence from FASTQ file):
sed -n '2~4p' file.fq
Print everything except the first line
awk 'NR>1' input.txt
Print rows 20-80:
awk 'NR>=20&&NR<=80' input.txt
Calculate the sum of column 2 and 3 and put it at the end of a row:
awk '{print $0,$2+$3}' input.txt
Calculate the mean length of reads in a fastq file:
awk 'NR%4==2{sum+=length($0)}END{print sum/(NR/4)}' input.fastq
Convert a VCF file to a BED file
sed -e 's/chr//' file.vcf | awk '{OFS="\t"; if (!/^#/){print $1,$2-1,$2,$4"/"$5,"+"}}'
Create a tab-delimited transcript-to-gene mapping table from a GENCODE GFF. The `substr(x,s,n)` returns _n_ characters from string _x_ starting from position _s_. This gets rid of the quotes and semicolon.
bioawk -c gff '$feature=="transcript" {print $group}' gencode.v28.annotation.gtf | awk -F ' ' '{print substr($4,2,length($4)-3) "\t" substr($2,2,length($2)-3)}' > txp2gene.tsv
extract specific reads from fastq file according to reads name :
zcat a.fastq.gz | awk 'BEGIN{RS="@";FS="\n"}; $1~/readsName/{print $2; exit}'
count missing sample in vcf file per line:
bcftools query -f '[%GT\t]\n' a.bcf | awk '{miss=0};{for (x=1; x<=NF; x++) if ($x=="./.") {miss+=1}};{print miss}' > nmiss.count
Interleave paired-end fastq files:
paste <(paste - - - - < reads-1.fastq) \
<(paste - - - - < reads-2.fastq) \
| tr '\t' '\n' \
> reads-int.fastqDecouple an interleaved fastq file:
paste - - - - - - - - < reads-int.fastq \
| tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
| cut -f 5-8 | tr '\t' '\n' > reads-2.fastq## sort, uniq, cut, etc.
[[back to top](#contents)]
Number each line in file.txt:
cat -n file.txt
Count the number of unique lines in file.txt
cat file.txt | sort -u | wc -l
Find lines shared by 2 files (assumes lines within file1 and file2 are unique; pipe to `wd -l` to count the _number_ of lines shared):
sort file1 file2 | uniq -d
# Safer
sort -u file1 > a
sort -u file2 > b
sort a b | uniq -d# Use comm
comm -12 file1 file2Sort numerically (with logs) (g) by column (k) 9:
sort -gk9 file.txt
Find the most common strings in column 2:
cut -f2 file.txt | sort | uniq -c | sort -k1nr | head
Pick 10 random lines from a file:
shuf file.txt | head -n 10
Print all possible 3mer DNA sequence combinations:
echo {A,C,T,G}{A,C,T,G}{A,C,T,G}
Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:
cat interleaved.fq |paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > deinterleaved_1.fq) | cut -f 5-8 | tr "\t" "\n" > deinterleaved_2.fq
Take a fasta file with a bunch of short scaffolds, e.g., labeled `>Scaffold12345`, remove them, and write a new fasta without them:
samtools faidx genome.fa && grep -v Scaffold genome.fa.fai | cut -f1 | xargs -n1 samtools faidx genome.fa > genome.noscaffolds.fa
Display hidden control characters:
python -c "f = open('file.txt', 'r'); f.seek(0); file = f.readlines(); print file"
## find, xargs, and GNU parallel
[[back to top](#contents)]
*Download GNU parallel at .*
Search for .bam files anywhere in the current directory recursively:
find . -name "*.bam"
Delete all .bam files (Irreversible: use with caution! Confirm list BEFORE deleting):
find . -name "*.bam" | xargs rm
Find the largest file in a directory:
find . -type f -printf '%s %p\n' | sort -nr | head -1 | cut -d' ' -f2-
Count the number of lines in all files with a specific extension in a directory
find . -type f -name '*.txt' -exec wc -l {} \; | awk '{total += $1} END {print total}'
Find all directories in the current directory that contain files with the extension ".log" and compress them into a tar archive:
find . -type f -name "*.log" -printf "%h\0" | sort -uz | xargs -0 tar -czvf logs.tar.gz
Rename all .txt files to .bak (backup *.txt before doing something else to them, for example):
find . -name "*.txt" | sed "s/\.txt$//" | xargs -i echo mv {}.txt {}.bak | sh
Chastity filter raw Illumina data (grep reads containing `:N:`, append (-A) the three lines after the match containing the sequence and quality info, and write a new filtered fastq file):
find *fq | parallel "cat {} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-$' > {}.filt.fq"
Run FASTQC in parallel 12 jobs at a time:
find *.fq | parallel -j 12 "fastqc {} --outdir ."
Index your bam files in parallel, but only echo the commands (`--dry-run`) rather than actually running them:
find *.bam | parallel --dry-run 'samtools index {}'
## seqtk
[[back to top](#contents)]
*Download seqtk at . Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.*
Convert FASTQ to FASTA:
seqtk seq -a in.fq.gz > out.fa
Convert ILLUMINA 1.3+ FASTQ to FASTA and mask bases with quality lower than 20 to lowercases (the 1st command line) or to `N` (the 2nd):
seqtk seq -aQ64 -q20 in.fq > out.fa
seqtk seq -aQ64 -q20 -n N in.fq > out.faFold long FASTA/Q lines and remove FASTA/Q comments:
seqtk seq -Cl60 in.fa > out.fa
Convert multi-line FASTQ to 4-line FASTQ:
seqtk seq -l0 in.fq > out.fq
Reverse complement FASTA/Q:
seqtk seq -r in.fq > out.fq
Extract sequences with names in file `name.lst`, one sequence name per line:
seqtk subseq in.fq name.lst > out.fq
Extract sequences in regions contained in file `reg.bed`:
seqtk subseq in.fa reg.bed > out.fa
Mask regions in `reg.bed` to lowercases:
seqtk seq -M reg.bed in.fa > out.fa
Subsample 10000 read pairs from two large paired FASTQ files (remember to use the same random seed to keep pairing):
seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fqTrim low-quality bases from both ends using the Phred algorithm:
seqtk trimfq in.fq > out.fq
Trim 5bp from the left end of each read and 10bp from the right end:
seqtk trimfq -b 5 -e 10 in.fa > out.fa
Untangle an interleaved paired-end FASTQ file. If a FASTQ file has paired-end reads intermingled, and you want to separate them into separate /1 and /2 files, and assuming the /1 reads precede the /2 reads:
seqtk seq -l0 -1 interleaved.fq > deinterleaved_1.fq
seqtk seq -l0 -2 interleaved.fq > deinterleaved_2.fq## GFF3 Annotations
[[back to top](#contents)]
Print all sequences annotated in a GFF3 file.
cut -s -f 1,9 yourannots.gff3 | grep $'\t' | cut -f 1 | sort | uniq
Determine all feature types annotated in a GFF3 file.
grep -v '^#' yourannots.gff3 | cut -s -f 3 | sort | uniq
Determine the number of genes annotated in a GFF3 file.
grep -c $'\tgene\t' yourannots.gff3
Extract all gene IDs from a GFF3 file.
grep $'\tgene\t' yourannots.gff3 | perl -ne '/ID=([^;]+)/ and printf("%s\n", $1)'
Print length of each gene in a GFF3 file.
grep $'\tgene\t' yourannots.gff3 | cut -s -f 4,5 | perl -ne '@v = split(/\t/); printf("%d\n", $v[1] - $v[0] + 1)'
FASTA header lines to GFF format (assuming the length is in the header as an appended "\_length" as in [Velvet](http://www.ebi.ac.uk/~zerbino/velvet/) assembled transcripts):
grep '>' file.fasta | awk -F "_" 'BEGIN{i=1; print "##gff-version 3"}{ print $0"\t BLAT\tEXON\t1\t"$10"\t95\t+\t.\tgene_id="$0";transcript_id=Transcript_"i;i++ }' > file.gff
## Other generally useful aliases for your .bashrc
[[back to top](#contents)]
Get a prompt that looks like `user@hostname:/full/path/cwd/:$ `
export PS1="\u@\h:\w\\$ "
Never type `cd ../../..` again (or use [autojump](https://github.com/joelthelion/autojump), which enables you to navigate the filesystem faster):
alias ..='cd ..'
alias ...='cd ../../'
alias ....='cd ../../../'
alias .....='cd ../../../../'
alias ......='cd ../../../../../'Browse 'up' and 'down'
alias u='clear; cd ../; pwd; ls -lhGgo'
alias d='clear; cd -; ls -lhGgo'Ask before removing or overwriting files:
alias mv="mv -i"
alias cp="cp -i"
alias rm="rm -i"My favorite `ls` aliases:
alias ls="ls -1p --color=auto"
alias l="ls -lhGgo"
alias ll="ls -lh"
alias la="ls -lhGgoA"
alias lt="ls -lhGgotr"
alias lS="ls -lhGgoSr"
alias l.="ls -lhGgod .*"
alias lhead="ls -lhGgo | head"
alias ltail="ls -lhGgo | tail"
alias lmore='ls -lhGgo | more'Use `cut` on space- or comma- delimited files:
alias cuts="cut -d \" \""
alias cutc="cut -d \",\""Pack and unpack tar.gz files:
alias tarup="tar -zcf"
alias tardown="tar -zxf"Or use a generalized `extract` function:
# as suggested by Mendel Cooper in "Advanced Bash Scripting Guide"
extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2) tar xvjf $1 ;;
*.tar.gz) tar xvzf $1 ;;
*.tar.xz) tar Jxvf $1 ;;
*.bz2) bunzip2 $1 ;;
*.rar) unrar x $1 ;;
*.gz) gunzip $1 ;;
*.tar) tar xvf $1 ;;
*.tbz2) tar xvjf $1 ;;
*.tgz) tar xvzf $1 ;;
*.zip) unzip $1 ;;
*.Z) uncompress $1 ;;
*.7z) 7z x $1 ;;
*) echo "don't know how to extract '$1'..." ;;
esac
else
echo "'$1' is not a valid file!"
fi
}Use `mcd` to create a directory and `cd` to it simultaneously:
function mcd { mkdir -p "$1" && cd "$1";}
Go up to the parent directory and list it's contents:
alias u="cd ..;ls"
Make grep pretty:
alias grep="grep --color=auto"
Refresh your `.bashrc`:
alias refresh="source ~/.bashrc"
Edit your `.bashrc`:
alias eb="vi ~/.bashrc"
Common typos:
alias mf="mv -i"
alias mroe="more"
alias c='clear'
alias emacs='vim'Show your `$PATH` in a prettier format:
alias showpath='echo $PATH | tr ":" "\n" | nl'
Use [pandoc](http://johnmacfarlane.net/pandoc/) to convert a markdown file to PDF:
# USAGE: mdpdf document.md document.md.pdf
alias mdpdf="pandoc -s -V geometry:margin=1in -V documentclass:article -V fontsize=12pt"Find text in any file (`ft "mytext" *.txt`):
function ft { find . -name "$2" -exec grep -il "$1" {} \;; }
## Etc
[[back to top](#contents)]Run the last command as root:
sudo !!
Place the argument of the most recent command on the shell:
'ALT+.' or ' .'
Type partial command, kill this command, check something you forgot, yank the command, resume typing:
[...]
Jump to a directory, execute a command, and jump back to the current directory:
(cd /tmp && ls)
Stopwatch (`Enter` or `ctrl-d` to stop):
time read
Create a script of the last executed command:
echo "!!" > foo.sh
Reuse _all_ parameter of the previous command line:
!*
List or delete all files in a folder that don't match a certain file extension (e.g., list things that are _not_ compressed; remove anything that is _not_ a `.foo` or `.bar` file):
ls !(*.gz)
rm !(*.foo|*.bar)Insert the last command without the last argument:
!:-
Rapidly invoke an editor to write a long, complex, or tricky command:
fc
Print a specific line (e.g. line 42) from a file:
sed -n 42p
Terminate a frozen SSH session (enter a new line, type the `~` key then the `.` key):
[ENTER]~.
Remove blank lines from a file using grep and save output to new file:
grep . filename > newfilename
Find large files (e.g., >500M):
find . -type f -size +500M
Exclude a column with cut (e.g., all but the 5th field in a tab-delimited file):
cut -f5 --complement
Find files containing text (`-l` outputs only the file names, `-i` ignores the case `-r` descends into subdirectories)
grep -lir "some text" *