{"id":20109442,"url":"https://github.com/crazyhottommy/bioinformatics-one-liners","last_synced_at":"2025-04-05T23:05:46.652Z","repository":{"id":53903713,"uuid":"95034236","full_name":"crazyhottommy/bioinformatics-one-liners","owner":"crazyhottommy","description":"Bioinformatics one liners from Ming Tang","archived":false,"fork":false,"pushed_at":"2020-10-04T22:24:02.000Z","size":79,"stargazers_count":482,"open_issues_count":0,"forks_count":132,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-03-29T22:04:40.674Z","etag":null,"topics":["bash","bioinformatics"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crazyhottommy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-06-21T18:29:48.000Z","updated_at":"2025-03-26T16:17:40.000Z","dependencies_parsed_at":"2022-08-13T03:50:48.933Z","dependency_job_id":null,"html_url":"https://github.com/crazyhottommy/bioinformatics-one-liners","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyhottommy%2Fbioinformatics-one-liners","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyhottommy%2Fbioinformatics-one-liners/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyhottommy%2Fbioinformatics-one-liners/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crazyhottommy%2Fbioinformatics-one-liners/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crazyhottommy","download_url":"https://codeload.github.com/crazyhottommy/bioinformatics-one-liners/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247411226,"owners_count":20934653,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash","bioinformatics"],"created_at":"2024-11-13T18:08:14.411Z","updated_at":"2025-04-05T23:05:46.625Z","avatar_url":"https://github.com/crazyhottommy.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# bioinformatics-one-liners\nmy collection of bioinformatics one liners that is useful in my day-to-day work\n\n### I came across the bioinformatics one-liners on the [biostar](https://www.biostars.org/p/142545/) forum and gathered them here.\nI also added some of my own tricks\n\n05/21/2015.\n\n\n\n####  get the sequences length distribution form a fastq file using awk\n\n```bash\nzcat file.fastq.gz | awk 'NR%4 == 2 {lengths[length($0)]++} END {for (l in lengths) {print l, lengths[l]}}'  \n```\n\n#### add barcode to 10x single cell R1 read\n\n```bash\ncat test.fq | awk 'NR%4 == 2 {$0=\"xxx\"$0}{print}'\n@D00365:1187:HMM2FBCX2:1:1103:1258:2132 1:N:0:CGTGCAGA\nxxxTATTACCAGATGAGAGCATGGTTAGG\n+\nDDDDDIIIIIIIHIIIIIIIIIIIII\n@D00365:1187:HMM2FBCX2:1:1103:1472:2136 1:N:0:CGTGCAGA\nxxxAACCATGAGTGTCCCGCTGGCATCGC\n+\nDDDADGHHIIHIIGIHHHFCHHIIII\n@D00365:1187:HMM2FBCX2:1:1103:1822:2139 1:N:0:CGTGCAGA\nxxxGTGCATATCATGTAGCGTATTATACT\n+\nDDDDDIIIIIIIIIIIIIIIIIIIII\n@D00365:1187:HMM2FBCX2:1:1103:1943:2145 1:N:0:CGTGCAGA\nxxxGATTCAGTCTCCAACCTCTCCTTTGT\n+\nDDDDDHIIIIIIIIIIHIIIIHIIII\n@D00365:1187:HMM2FBCX2:1:1103:1917:2147 1:N:0:CGTGCAGA\nxxxCCTTCGACAAGTTGTCAGGTGCGGTC\n+\nDDDDDHIIIIIIIIIIIIIIGIIHHH\n```\n#### Reverse complement a sequence (I use that a lot when I need to design primers)\n\n```\necho 'ATTGCTATGCTNNNT' | rev | tr 'ACTG' 'TGAC'\n```\n\n#### split a multifasta file into single ones with csplit:\n\n```bash\ncsplit -z -q -n 4 -f sequence_ sequences.fasta /\\\u003e/ {*}  \n```\n#### Split a multi-FASTA file into individual FASTA files by awk\n\n```bash\nawk '/^\u003e/{s=++d\".fa\"} {print \u003e s}' multi.fa\n```\n\n#### linearize multiline fasta\n\n```bash\ncat file.fasta | awk '/^\u003e/{if(N\u003e0) printf(\"\\n\"); ++N; printf(\"%s\\t\",$0);next;} {printf(\"%s\",$0);}END{printf(\"\\n\");}'\nawk 'BEGIN{RS=\"\u003e\"}NR\u003e1{sub(\"\\n\",\"\\t\"); gsub(\"\\n\",\"\"); print RS$0}' file.fa\n```\n#### fastq2fasta\n\n```bash\nzcat file.fastq.gz | paste - - - - | perl -ane 'print \"\u003e$F[0]\\n$F[2]\\n\";' | gzip -c \u003e file.fasta.gz\n```\n####  bam2bed\n\n```bash\nsamtools view file.bam | perl -F'\\t' -ane '$strand=($F[1]\u002616)?\"-\":\"+\";$length=1;$tmp=$F[5];$tmp =~ s/(\\d+)[MD]/$length+=$1/eg;print \"$F[2]\\t$F[3]\\t\".($F[3]+$length).\"\\t$F[0]\\t0\\t$strand\\n\";' \u003e file.bed\n```\n\n#### bam2wig\n\n```bash\nsamtools mpileup -BQ0 file.sorted.bam | perl -pe '($c, $start, undef, $depth) = split;if ($c ne $lastC || $start != $lastStart+1) {print \"fixedStep chrom=$c start=$start step=1 span=1\\n\";}$_ = $depth.\"\\n\";($lastC, $lastStart) = ($c, $start);' | gzip -c \u003e file.wig.gz\n```\n\n#### Number of reads in a fastq file\n\n```bash\ncat file.fq | echo $((`wc -l`/4))\n```\n#### Single line fasta file to multi-line fasta of 60 characteres each line\n\n```bash\nawk -v FS= '/^\u003e/{print;next}{for (i=0;i\u003c=NF/60;i++) {for (j=1;j\u003c=60;j++) printf \"%s\", $(i*60 +j); print \"\"}}' file\n\nfold -w 60 file\n```\n\n#### Sequence length of every entry in a multifasta file\n\n```bash\nawk '/^\u003e/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen = seqlen +length($0)}END{print seqlen}' file.fa\n```\n#### Reproducible subsampling of a FASTQ file. srand() is the seed for the random number generator - keeps the subsampling the same when the script is run multiple times.  0.01 is the % of reads to output.\n\n```bash\ncat file.fq | paste - - - - | awk 'BEGIN{srand(1234)}{if(rand() \u003c 0.01) print $0}' | tr '\\t' '\\n' \u003e out.fq\n```\n#### or look at the Hengli's Seqtk \n\n#### Deinterleaving a FASTQ:\n\n```bash\ncat file.fq | paste - - - - - - - - | tee \u003e(cut -f1-4 | tr '\\t'  \n'\\n' \u003e out1.fq) | cut -f5-8 | tr '\\t' '\\n' \u003e out2.fq\n```\n\n#### Using mpileup for a whole genome can take forever. So, handling each chromosome separately and parallely running them on several cores will speed up your pipeline. Using xargs you can easily realize it.  \n#### Example usage of xargs (-P is the number of parallel processes started - don't use more than the number of cores you have available):\n\n```basg\nsamtools view -H yourFile.bam | grep \"\\@SQ\" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c \"samtools mpileup -BQ0 -d 100000 -uf yourGenome.fa -r {} yourFile.bam | bcftools view -vcg - \u003e tmp.{}.vcf\"\n```\n\n#### To merge the results afterwards, you might want to do something like this:\n\n```bash\nsamtools view -H yourFile.bam | grep \"\\@SQ\" | sed 's/^.*SN://g' | cut -f 1 | perl -ane 'system(\"cat tmp.$F[0].bcf \u003e\u003e yourFile.vcf\");'\n```\n\n#### split large file by id/label/column\n\n```bash\nawk '{print \u003e\u003e $1; close($1)}' input_file\n```\n#### split a bed file by chromosome:\n\n```bash\ncat nexterarapidcapture_exome_targetedregions_v1.2.bed | sort -k1,1 -k2,2n | sed 's/^chr//' | awk '{close(f);f=$1}{print \u003e f\".bed\"}'\n\n#or\nawk '{print $0 \u003e\u003e $1\".bed\"}' example.bed\n```\n\n#### sort vcf file with header\n\n```bash\ncat my.vcf | awk '$0~\"^#\" { print $0; next } { print $0 | \"sort -k1,1V -k2,2n\" }'\n```\n#### Rename a file, bash string manipulation\n\n```bash\nfor file in *gz\ndo zcat $file \u003e ${file/bed.gz/bed}\n```\n\n#### gnu sed print invisible characters\n\n```bash\ncat my_file | sed -n 'l'\ncat -A\n```\n\n#### exit a dead ssh session\n`~.`\n\n#### copy large files, copy the from_dir directory inside the to_dir directory\n\n```bash\nrsync -av from_dir  to_dir\n\n## copy every file inside the frm_dir to to_dir\nrsync -av from_dir/ to_dir\n\n##re-copy the files avoiding completed ones:\n\nrsync -avhP /from/dir /to/dir\n```\n\n#### make directory using the current date\n\n```bash\nmkdir $(date +%F)\n```\n#### all the folders' size in the current folder (GNU du)\n\n```bash\ndu -h --max-depth=1\n```\n\n### this one is a bit different, try it and see the difference\n`du -ch`\n\n#### the total size of current directory\n`du -sh .`\n\n#### disk usage\n`df -h`\n\n#### the column names of the file, install csvkit https://csvkit.readthedocs.org/en/0.9.1/\n`csvcut -n`\n\n#### open top with human readable size in Mb, Gb. install htop for better visualization\n`top -M`\n\n#### how many memeory are used in Gb\n`free -mg`\n\n#### print out unique rows based on the first and second column\n`awk '!a[$1,$2]++' input_file`\n\n`sort -u -k1,2 file`\nIt will sort based on unique first and second column\n\n#### do not wrap the lines using less\n`less -S`\n\n#### pretty output\n```bash\nfold -w 60\ncat file.txt | column -t | less -S\n```\n#### pass tab as delimiter http://unix.stackexchange.com/questions/46910/is-it-a-bug-for-join-with-t-t\n`-t $'\\t'`\n\n#### awk with the first line printed always\n`awk ' NR ==1 || ($10 \u003e 1 \u0026\u0026 $11 \u003e 0 \u0026\u0026 $18 \u003e 0.001)'  input_file`\n\n#### delete blank lines with sed\n`sed /^$/d`\n\n#### delete the last line\n`sed $d`\n\nawk to join files based on several columns\n\nmy [github repo](https://github.com/crazyhottommy/scripts-general-use/blob/master/Shell/Awk_anotates_vcf_with_bed.ipynb)\n\n```\n### select lines from a file based on columns in another file\n## http://unix.stackexchange.com/questions/134829/compare-two-columns-of-different-files-and-print-if-it-matches\nawk -F\"\\t\" 'NR==FNR{a[$1$2$3]++;next};a[$1$2$3] \u003e 0' file2 file1 \n\n```\n\nFinally learned about the !$ in unix: take the last thing (word) from the previous command.   \n`echo hello, world; echo !$` gives 'world'\n\n\nCreate a script of the last executed command:  \n`echo \"!!\" \u003e foo.sh`\n\nReuse all parameter of the previous command line:  \n`!*`\n\nfind bam in current folder (search recursively) and copy it to a new directory using 5 CPUs    \n`find . -name \"*bam\" | xargs -P5 -I{} rsync -av {} dest_dir`\n\n`ls -X`  will group files by extension.\n\nloop through all the chromosomes\n\n```bash\nfor i in {1..22} X Y \ndo\n  echo $i\ndone\n```\n\nfor i in in `{01..22}` will expand to 01 02 ...\n\n\nchange every other newline to tab:\n\n`paste` is used to concatenate corresponding lines from files: paste file1 file2 file3 .... If one of the \"file\" arguments is \"-\", then lines are read from standard input. If there are 2 \"-\" arguments, then paste takes 2 lines from stdin. And so on.\n\n```bash\ncat test.txt  \n0    ATTTTATTNGAAATAGTAGTGGG\n0    CTCCCAAAATACTAAAATTATAA\n1    TTTTAGTTATTTANGAGGTTGAG\n1    CNTAATCTTAACTCACTACAACC\n2    TTATAATTTTAGTATTTTGGGAG\n2    CATATTAACCAAACTAATCTTAA\n3    GGTTAATATGGTGAAATTTAAT\n3    ACCTCAACCTCNTAAATAACTAA\n\ncat test.txt| paste - -                               \n0    ATTTTATTNGAAATAGTAGTGGG    0    CTCCCAAAATACTAAAATTATAA\n1    TTTTAGTTATTTANGAGGTTGAG    1    CNTAATCTTAACTCACTACAACC\n2    TTATAATTTTAGTATTTTGGGAG    2    CATATTAACCAAACTAATCTTAA\n3    GGTTAATATGGTGAAATTTAAT     3    ACCTCAACCTCNTAAATAACTAA\n```\n\nORS: output record seperator in `awk`\n`var=condition?condition_if_true:condition_if_false is the ternary operator.`\n\n```bash\ncat test.txt| awk 'ORS=NR%2?\"\\t\":\"\\n\"'          \n\n0    ATTTTATTNGAAATAGTAGTGGG    0    CTCCCAAAATACTAAAATTATAA\n1    TTTTAGTTATTTANGAGGTTGAG    1    CNTAATCTTAACTCACTACAACC\n2    TTATAATTTTAGTATTTTGGGAG    2    CATATTAACCAAACTAATCTTAA\n3    GGTTAATATGGTGAAATTTAAT     3    ACCTCAACCTCNTAAATAACTAA\n\n```\n\n#### awk\nWe can also use the concept of a conditional operator in print statement of the form print CONDITION ? PRINT_IF_TRUE_TEXT : PRINT_IF_FALSE_TEXT. For example, in the code below, we identify sequences with lengths \u003e 14:\n\n```bash\ncat data/test.tsv\nblah_C1\tACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG\nblah_C2\tACTTTATATATT\nblah_C3\tACTTATATATATATA\nblah_C4\tACTTATATATATATA\nblah_C5\tACTTTATATATT\t\n\nawk '{print (length($2)\u003e14) ? $0\"\u003e14\" : $0\"\u003c=14\";}' data/test.tsv\nblah_C1\tACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG\u003e14\nblah_C2\tACTTTATATATT\u003c=14\nblah_C3\tACTTATATATATATA\u003e14\nblah_C4\tACTTATATATATATA\u003e14\nblah_C5\tACTTTATATATT\u003c=14\n\nawk 'NR==3{print \"\";next}{printf $1\"\\t\"}{print $1}' data/test.tsv\nblah_C1\tblah_C1\nblah_C2\tblah_C2\n\nblah_C4\tblah_C4\nblah_C5\tblah_C5\n\n```\nYou can also use getline to load the contents of another file in addition to the one you are reading, for example, in the statement given below, the while loop will load each line from test.tsv into k until no more lines are to be read:\n```bash\nawk 'BEGIN{while((getline k \u003c\"data/test.tsv\")\u003e0) print \"BEGIN:\"k}{print}' data/test.tsv\nBEGIN:blah_C1\tACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG\nBEGIN:blah_C2\tACTTTATATATT\nBEGIN:blah_C3\tACTTATATATATATA\nBEGIN:blah_C4\tACTTATATATATATA\nBEGIN:blah_C5\tACTTTATATATT\nblah_C1\tACTGTCTGTCACTGTGTTGTGATGTTGTGTGTG\nblah_C2\tACTTTATATATT\nblah_C3\tACTTATATATATATA\nblah_C4\tACTTATATATATATA\nblah_C5\tACTTTATATATT\n```\n#### merge multiple fasta sequences in two files into a single file line by line\nsee [post](https://www.biostars.org/p/204336/#204380)  \n\n`linearize.awk:`  \n\n```bash\n/^\u003e/ {printf(\"%s%s\\t\",(N\u003e0?\"\\n\":\"\"),$0);N++;next;} {printf(\"%s\",$0);} END {printf(\"\\n\");}\n```\n\n```bash\npaste \u003c(awk -f linearize.awk file1.fa ) \u003c(awk -f linearize.awk file2.fa  )| tr \"\\t\" \"\\n\"\n```\n\n#### grep fastq reads containing a pattern but maintain the fastq format\n\n```bash\ngrep -A 2 -B 1 'AAGTTGATAACGGACTAGCCTTATTTT' file.fq | sed '/^--$/d' \u003e out.fq\n\n# or\nzcat reads.fq.gz \\\n| paste - - - - \\\n| awk -v FS=\"\\t\" -v OFS=\"\\n\" '$2 ~ \"AAGTTGATAACGGACTAGCCTTATTTT\" {print $1, $2, $3, $4}' \\\n| gzip \u003e filtered.fq.gz\n```\n\n#### count how many columns of a tsv files: \n```bash\ncat file.tsv | head -1 | tr \"\\t\" \"\\n\" | wc -l  \ncsvcut -n -t  file.tsv (from csvkit)\nawk '{print NF; exit}' file.tsv\nawk -F \"\\t\" 'NR == 1 {print NF}' file.tsv\n```\n\n#### combine info to the fasta header\n\n[from biostar post](https://www.biostars.org/p/212379/#212393)\n```bash\ncat myfasta.txt \n\u003eBlap_contig79\nMSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI\n\u003eBluc_contig23663\nMSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI\n\u003eBlap_contig7988\nMSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI\n\u003eBluc_contig1223663\nMSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI\n\ncat my_info.txt \ninfo1\ninfo2\ninfo3\ninfo4\n\npaste \u003c(cat my_info.txt) \u003c(cat myfasta.txt| paste - - | cut -c2-) | awk '{printf(\"\u003e%s_%s\\n%s\\n\",$1,$2,$3);}'\n\u003einfo1_Blap_contig79\nMSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI\n\u003einfo2_Bluc_contig23663\nMSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI\n\u003einfo3_Blap_contig7988\nMSTDVDAKTRSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI\n\u003einfo4_Bluc_contig1223663\nMSTNVDAKARSKERASIAAFYVGRNIFVTGGTGFLGKVLIEKLLRSCPDVGEIFILMRPKAGLSI\n\n```\n\n#### count how many columns in a tsv file\n\n```bash\ncat file.tsv | head -1 | tr \"\\t\" \"\\n\" | wc -l  \n\n##(from csvkit)\ncsvcut -n -t file.\n\n## emulate csvcut -n -t\nless files.tsv | head -1| tr \"\\t\" \"\\n\" | nl\n\nawk -F \"\\t\" 'NR == 1 {print NF}' file.tsv\nawk '{print NF; exit}'\n```\n#### change fasta header\n\nsee https://www.biostars.org/p/53212/\n\nThe fasta header is like `\u003e7 dna:chromosome chromosome:GRCh37:7:1:159138663:1`\nconvert to `\u003e7`: \n\n```bash\ncat Homo_sapiens_assembly19.fasta | gawk '/^\u003e/ { b=gensub(\" dna:.+\", \"\", \"g\", $0); print b; next} {print}' \u003e Homo_sapiens_assembly19_reheader.fasta\n```\n### mkdir and cd into that dir shortcut\n\n```bash\nmkdir blah \u0026\u0026 cd $_\n```\n### cut out columns based on column names in another file\n\nhttp://crazyhottommy.blogspot.com/2016/10/cutting-out-500-columns-from-26g-file.html\n\n```bash\n#! /bin/bash\n\nset -e\nset -u\nset -o pipefail\n\n#### Author: Ming Tang (Tommy)\n#### Date 09/29/2016\n#### I got the idea from this stackOverflow post http://stackoverflow.com/questions/11098189/awk-extract-columns-from-file-based-on-header-selected-from-2nd-file\n\n# show help\nshow_help(){\ncat \u003c\u003c EOF\n  This is a wrapper extracting columns of a (big) dataframe based on a list of column names in another\n  file. The column names must be one per line. The output will be stdout. For small files \u003c 2G, one \n  can load it into R and do it easily, but when the file is big \u003e 10G. R is quite cubersome. \n  Using unix commands on the other hand is better because files do not have to be loaded into memory at once.\n  e.g. subset a 26G size file for 700 columns takes around 30 mins. Memory footage is very low ~4MB.\n\n  usage: ${0##*/} -f \u003c a dataframe  \u003e -c \u003c colNames\u003e -d \u003cdelimiter of the file\u003e\n        -h display this help and exit.\n\t\t-f the file you want to extract columns from. must contain a header with column names.\n\t\t-c a file with the one column name per line.\n\t\t-d delimiter of the dataframe: , or \\t. default is tab.  \n\t\t\n\t\te.g. \n\t\t\n\t\tfor tsv file:\n\t\t\t${0##*/} -f mydata.tsv -c colnames.txt -d $'\\t' or simply ommit the -d, default is tab.\n\t\t\n\t\tfor csv file: Note you have to specify -d , if your file is csv, otherwise all columns will be cut out.\n\t\t\t${0##*/} -f mydata.csv -c colnames.txt -d ,\n        \nEOF\n}\n\n## if there are no arguments provided, show help\nif [[ $# == 0 ]]; then show_help; exit 1; fi\n\nwhile getopts \":hf:c:d:\" opt; do\n  case \"$opt\" in\n    h) show_help;exit 0;;\n    f) File2extract=$OPTARG;;\n    c) colNames=$OPTARG;;\n    d) delim=$OPTARG;;\n    '?') echo \"Invalid option $OPTARG\"; show_help \u003e\u00262; exit 1;;\n  esac\ndone\n\t\n\n## set up the default delimiter to be tab, Note the way I specify tab \n\ndelim=${delim:-$'\\t'}\n\n## get the number of columns in the data frame that match the column names in the colNames file.\n## change the output to 2,5,6,22,... and get rid of the last comma  so cut -f can be used\n \ncols=$(head -1 \"${File2extract}\" | tr \"${delim}\" \"\\n\" | grep -nf \"${colNames}\" | sed 's/:.*$//' | tr \"\\n\" \",\" | sed 's/,$//')\n\n## cut out the columns \ncut -d\"${delim}\" -f\"${cols}\" \"${File2extract}\"\n```\nor use [csvtk](https://github.com/shenwei356/csvtk) from Shen Wei:  \n\n```bash\ncsvtk cut -t -f $(paste -s -d , list.txt) data.tsv\n```\n#### merge all bed files and add a column for the filename.\n\n```bash\nawk '{print $0 \"\\t\" FILENAME}' *bed \n```\n\n### add or remove chr from the start of each line\n\n```bash\n# add chr\nsed 's/^/chr/' my.bed\n\n# or\nawk 'BEGIN {OFS = \"\\t\"} {$1=\"chr\"$1; print}'\n\n# remove chr\nsed 's/^chr//' my.bed\n```\n### check if a tsv files have the same number of columns for all rows\n\n```bash\nawk '{print NF}' test.tsv | sort -nu | head -n 1\n```\n\n### Parallelized samtools mpileup \n\nhttps://www.biostars.org/p/134331/\n\n```bash\nBAM=\"yourFile.bam\"\nREF=\"reference.fasta\"\nsamtools view -H $BAM | grep \"\\@SQ\" | sed 's/^.*SN://g' | cut -f 1 | xargs -I {} -n 1 -P 24 sh -c \"samtools mpileup -BQ0 -d 100000 -uf $REF -r \\\"{}\\\" $BAM | bcftools call -cv \u003e \\\"{}\\\".vcf\"\n```\n### convert multiple lines to a single line\n\nThis is better than `tr \"\\n\" \"\\t\"` because somtimes I do not want to convert the last newline to tab.\n\n```bash\ncat myfile.txt | paste -s \n```\n\n### merge multiple files with same header by keeping the header of the first file\nI usually do it in R, but like the quick solution.\n\nhttps://stackoverflow.com/questions/16890582/unixmerge-multiple-csv-files-with-same-header-by-keeping-the-header-of-the-firs\n\n```bash\nawk 'FNR==1 \u0026\u0026 NR!=1{next;}{print}' *.csv \n\n# or\n\nawk '\n    FNR==1 \u0026\u0026 NR!=1 { while (/^\u003cheader\u003e/) getline; }\n    1 {print}\n' file*.txt \u003eall.txt\n```\n\n### insert a field into the first line\n\n```bash\ncut -f1-4 F5.hg38.enhancers.expression.usage.matrix | head\nCNhs11844\tCNhs11251\tCNhs11282\tCNhs10746\nchr10:100006233-100006603\t1\t0\t0\nchr10:100008181-100008444\t0\t0\t0\nchr10:100014348-100014634\t0\t0\t0\nchr10:100020065-100020562\t0\t0\t0\nchr10:100043485-100043744\t0\t0\t0\nchr10:100114218-100114567\t0\t0\t0\nchr10:100148595-100148922\t0\t0\t0\nchr10:100182422-100182522\t0\t0\t0\nchr10:100184498-100184704\t0\t0\t0\n\nsed '1 s/^/enhancer\\t/' F5.hg38.enhancers.expression.usage.matrix | cut -f1-4 | head\nenhancer\tCNhs11844\tCNhs11251\tCNhs11282\nchr10:100006233-100006603\t1\t0\t0\nchr10:100008181-100008444\t0\t0\t0\nchr10:100014348-100014634\t0\t0\t0\nchr10:100020065-100020562\t0\t0\t0\nchr10:100043485-100043744\t0\t0\t0\nchr10:100114218-100114567\t0\t0\t0\nchr10:100148595-100148922\t0\t0\t0\nchr10:100182422-100182522\t0\t0\t0\nchr10:100184498-100184704\t0\t0\t0\n\n```\n### extract PASS calls from vcf file\n\n```\ncat my.vcf | awk -F '\\t' '{if($0 ~ /\\#/) print; else if($7 == \"PASS\") print}' \u003e my_PASS.vcf\n\n```\n\n### replace a pattern in a specific column\n\n```\n## column5 \nawk '{gsub(pattern,replace,$5)}1' in.file\n\n## http://bioinf.shenwei.me/csvtk/usage/#replace\ncsvtk replace -f 5 -p pattern -r replacement \n\n```\n### move a process to a screen session\n\nhttps://www.linkedin.com/pulse/move-running-process-screen-bruce-werdschinski/\n\n```\n1. Suspend: Ctrl+z\n2. Resume: bg\n3. Disown: disown %1\n4. Launch screen\n5. Find pid: prep BLAH\n6. Reparent process: reptyr ###\n```\n\n### count uinque values in a column and put in a new \n\nhttps://www.unix.com/unix-for-beginners-questions-and-answers/270526-awk-count-unique-element-array.html\n\n```\n# input\nblabla_1 A,B,C,C\nblabla_2 A,E,G\nblabla_3 R,Q,A,B,C,R,Q\n\n# output\nblabla_1 3\nblabla_2 3\nblabla_3 5\n\n\nawk '{split(x,C); n=split($2,F,/,/); for(i in F) if(C[F[i]]++) n--; print $1, n}' file\n\n```\n\n### get the promoter regions from a gtf file\n\nhttps://twitter.com/David_McGaughey/status/1106371758142173185\n\nCreate TSS bed from GTF in one line: \n```bash\nzcat gencode.v29lift37.annotation.gtf.gz | awk '$3==\"gene\" {print $0}' | grep protein_coding | awk -v OFS=\"\\t\" '{if ($7==\"+\") {print $1, $4, $4+1} else {print $1, $5-1, $5}}' \u003e tss.bed\n```\nor 5kb flanking tss\n\n```bash\nzcat gencode.v29lift37.annotation.gtf.gz | awk '$3==\"gene\" {print $0}' | grep protein_coding | awk -v OFS=\"\\t\" '{if ($7==\"+\") {print $1, $4, $4+5000} else {print $1, $5-5000, $5}}' \u003e promoters.bed\n```\ncaveat: some genes are at the end of the chromosomes, add or minus 5000 may go beyond the point, use [`bedtools slop`](https://bedtools.readthedocs.io/en/latest/content/tools/slop.html) with a genome size file to avoid that.\n\ndownload `fetchChromSizes` from http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/\n\n```bash\nfetchChromSizes hg19 \u003e chrom_size.txt\n\nzcat gencode.v29lift37.annotation.gtf.gz | awk '$3==\"gene\" {print $0}' |  awk -v OFS=\"\\t\" '{if ($7==\"+\") {print $1, $4, $4+1} else {print $1, $5-1, $5}}' | bedtools slop -i - -g chrom_size.txt -b 5000 \u003e promoter_5kb.bed\n```\n\n### reverse one column of a txt file\n\nreverse column 3 and put it to column5\n```bash\nawk -v OFS=\"\\t\" '{\"echo \"$3 \"| rev\" | getline $5}{print $0}' \n\n#or use perl reverse second column\nperl -lane 'BEGIN{$,=\"\\t\"}{$rev=reverse $F[2];print $F[0],$F[1],$rev,$F[3]}\n```\n\n### get the full path of a file\n\n```bash\nrealpath file.txt\nreadlink -f file.txt \n```\n\n### pugz unizp in parallel\n\nhttps://github.com/Piezoid/pugz\n\nContrary to the pigz program which does single-threaded decompression (see https://github.com/madler/pigz/blob/master/pigz.c#L232), pugz found a way to do truly parallel decompression.\n\n### run singularity on a multi-user HPC\n\n```bash\n#! /bin/bash\nset -euo pipefail\n\nmodule load singularity\n# Need a unique /tmp for this job for /tmp/rstudio-rsession \u0026 /tmp/rstudio-server\nWORKDIR=/liulab/${USER}/singularity_images\nmkdir -m 700 -p ${WORKDIR}/tmp2\nmkdir -m 700 -p ${WORKDIR}/tmp\n\nPASSWORD='xyz' singularity exec --bind \"${WORKDIR}/tmp2:/var/run/rstudio-server\" --bind \"${WORKDIR}/tmp:/tmp\" --bind=\"/liulab/${USER}\" geospatial_4.0.2.simg rserver --www-port 8888 --auth-none=0  --auth-pam-helper-path=pam-helper  --www-address=127.0.0.1\n```\n\n### add ServerAliveInterval 60 to avoid dropping from your ssh session\n\nAdd the following on the top of your `~/.ssh/config` to prevent drop off the ssh session\n\n```\nHost *\n ServerAliveInterval 60\n \n```\nI use `screen`/`tmux` and also [mosh](https://mosh.org/) as well.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrazyhottommy%2Fbioinformatics-one-liners","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrazyhottommy%2Fbioinformatics-one-liners","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrazyhottommy%2Fbioinformatics-one-liners/lists"}