awesome-bioinformatics-formats

Curated list of bioinformatics formats and publications
https://github.com/kmhernan/awesome-bioinformatics-formats

Last synced: 2 days ago
JSON representation

EDAM
Formats
- General
  - HDF5 - [[edam:format_3590](http://edamontology.org/format_3590)] HDF5 is a data model, library, and file format for storing and managing data.
  - tiledb - TileDB manages massive dense and sparse multi-dimensional array data that frequently arise in important scientific applications.
  - NetCDF - [[edam:format_3650](http://edamontology.org/format_3650)] Network Common Data Form is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++, Java, and other languages.
  - SQLite - [[edam:format_3621](http://edamontology.org/format_3621)] SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.
  - tiledb - TileDB manages massive dense and sparse multi-dimensional array data that frequently arise in important scientific applications.
  - NetCDF - [[edam:format_3650](http://edamontology.org/format_3650)] Network Common Data Form is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++, Java, and other languages.
- Genomic Intervals
  - segmentation file - A tab-delimited text file that lists loci and associated numeric values associated with copy number.
  - bedGraph - [[edam:format_3583](http://edamontology.org/format_3583)] The bedGraph format allows display of continuous-valued data in track format.
  - bigBed - [[edam:format_3004](http://edamontology.org/format_3004)] Binary and indexed form of [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1).
  - interval list - The intervals are given in the form `<chr> <start> <stop> + <target_name>`, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).
  - segmentation file - A tab-delimited text file that lists loci and associated numeric values associated with copy number.
  - tabix - [[edam:format_3616](http://edamontology.org/format_3616)] An *index* file format for genomic intervals (can be used on bed, gtf, vcf, etc).
  - interval list - The intervals are given in the form `<chr> <start> <stop> + <target_name>`, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).
  - tabix - [[edam:format_3616](http://edamontology.org/format_3616)] An *index* file format for genomic intervals (can be used on bed, gtf, vcf, etc).
- Genomic Features
  - GFF2 - [[edam:format_1974](http://edamontology.org/format_1974)] The general feature format is a file format used for describing genes and other features of DNA, RNA and protein sequences version 2.
  - GFF3 - [[edam:format_1975](http://edamontology.org/format_1975)] The general feature format is a file format used for describing genes and other features of DNA, RNA and protein sequences version 3.
  - genePred - [[edam:format_3011](http://edamontology.org/format_3011)] a table format commonly used for gene prediction tracks.
  - GFF2 - [[edam:format_1974](http://edamontology.org/format_1974)] The general feature format is a file format used for describing genes and other features of DNA, RNA and protein sequences version 2.
  - GFF3 - [[edam:format_1975](http://edamontology.org/format_1975)] The general feature format is a file format used for describing genes and other features of DNA, RNA and protein sequences version 3.
  - GTF - [[edam:format_2306](http://edamontology.org/format_2306)] GTF stands for Gene transfer format. It borrows from [GFF2](http://gmod.org/wiki/GFF2), but has additional structure that warrants a separate definition and format name.
  - genePred - [[edam:format_3011](http://edamontology.org/format_3011)] a table format commonly used for gene prediction tracks.
  - GTF - [[edam:format_2306](http://edamontology.org/format_2306)] GTF stands for Gene transfer format. It borrows from [GFF2](http://gmod.org/wiki/GFF2), but has additional structure that warrants a separate definition and format name.
- Genotype Data
  - MAF - A Mutation Annotation Format (MAF) file (.maf) is a tab-delimited text file that lists mutations.
  - oxford-bgen - Binary version of the native [Oxford gen format](https://www.cog-genomics.org/plink2/formats#gen). Operations on bgen files are generally faster and more descriptive than on plain gen files, and the file size of bgen files is also smaller -- UK Biobank genotypes are in bgen format. Latest bgen version is 1.3.
  - pileup - [[edam:format_3015](http://edamontology.org/format_3015)] Describes the base-pair information at each chromosomal position. This format facilitates SNP/indel calling and brief alignment viewing by eyes.
  - GDS - Genomic Data Structure is a storage format for bioinformatics data similar to [NetCDF](https://www.unidata.ucar.edu/software/netcdf/).
  - BCF - [[edam:format_3020](http://edamontology.org/format_3020)] Binary and compressed [VCF](http://samtools.github.io/hts-specs/VCFv4.3.pdf) format.
  - GVF - [[edam:format_3019](http://edamontology.org/format_3019)] The Genome Variation Format (GVF) is a very simple file format for describing sequence_alteration features at nucleotide resolution relative to a reference genome.
  - oxford-bgen - Binary version of the native [Oxford gen format](https://www.cog-genomics.org/plink2/formats#gen). Operations on bgen files are generally faster and more descriptive than on plain gen files, and the file size of bgen files is also smaller -- UK Biobank genotypes are in bgen format. Latest bgen version is 1.3.
  - pileup - [[edam:format_3015](http://edamontology.org/format_3015)] Describes the base-pair information at each chromosomal position. This format facilitates SNP/indel calling and brief alignment viewing by eyes.
  - plink2-pgen - PLINK2 binary genotype table capable of representing mixed-phase, multiallelic, and mixed-hardcall/dosage/missing genotype data.
  - VCF - [[edam:format_3016](http://edamontology.org/format_3016)] Variant Call Format.
  - plink-ped - [[edam:format_3288](http://edamontology.org/format_3288)] PLINK plain-text genotype format. Mostly has been replaced by bed/bim/fam, but is useful if someone wants to actually look at the SNPs in plain-text since Plink bed is in binary, and also wants to retain more information than from a VCF (eg. additional individual information).
  - BCF - [[edam:format_3020](http://edamontology.org/format_3020)] Binary and compressed [VCF](http://samtools.github.io/hts-specs/VCFv4.3.pdf) format.
  - VCF - [[edam:format_3016](http://edamontology.org/format_3016)] Variant Call Format.
- Molecular Structural Data
  - CTfile - The CTfile Formats document fully describes the formats for CTfiles (chemical table files) including: Molfiles, RGfiles, Rxnfiles, SDfiles, RDfiles, XDfiles.
  - mmCIF - [[edam:format_1477](http://edamontology.org/format_1477)] Another format for PDB molecular structures.
  - CTfile - The CTfile Formats document fully describes the formats for CTfiles (chemical table files) including: Molfiles, RGfiles, Rxnfiles, SDfiles, RDfiles, XDfiles.
  - mmCIF - [[edam:format_1477](http://edamontology.org/format_1477)] Another format for PDB molecular structures.
  - PDB - [[edam:format_1476](http://edamontology.org/format_1476)] The Protein Data Bank (PDB) format provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies. This representation was created in the 1970's and a large amount of software using it has been written.
  - PDB - [[edam:format_1476](http://edamontology.org/format_1476)] The Protein Data Bank (PDB) format provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies. This representation was created in the 1970's and a large amount of software using it has been written.
- Medical Imaging Data
  - DICOM - [[edam:format_3548](http://edamontology.org/format_3548)] Medical image format corresponding to the Digital Imaging and Communications in Medicine (DICOM) standard.
  - NifTI - [[edam:format_3549](http://edamontology.org/format_3549)] Medical image and metadata format of the Neuroimaging Informatics Technology Initiative.
  - DICOM - [[edam:format_3548](http://edamontology.org/format_3548)] Medical image format corresponding to the Digital Imaging and Communications in Medicine (DICOM) standard.
  - NifTI - [[edam:format_3549](http://edamontology.org/format_3549)] Medical image and metadata format of the Neuroimaging Informatics Technology Initiative.
- Dense Genomic Data
  - bigWig - [[edam:format_3006](http://edamontology.org/format_3006)] The bigWig format is useful for dense, continuous data and is a binary form of [wiggle](https://genome.ucsc.edu/goldenPath/help/wiggle.html).
  - Genomedata - a format for efficient storage of multiple tracks of numeric data anchored to a genome.
  - GenomicsDB - GenomicsDB is an open sourced library and tools with a focus on optimizing sparse array storage specifically for genomic data.
  - wiggle - [[edam:format_3005](http://edamontology.org/format_3005)] ASCII format for dense, continuous data.
- Unaligned Sequencing Data
  - FASTA - [[edam:format_1929](http://edamontology.org/format_1929)] ASCII format for storing nucleotide/amino acid sequences.
  - FASTQ - [[edam:format_1930](http://edamontology.org/format_1930)] Designed as a replacement for FASTA, combining the sequence and quality information in the same file.
- Aligned Sequencing Data
  - CALF - The Compact ALignment Format records the base qualities and mapping qualities of the aligned reads, and unaligned reads data.
  - CRAM - [[edam:format_3462](http://edamontology.org/format_3462)] Further lossless compression of [SAM](http://samtools.github.io/hts-specs/SAMv1.pdf) format.
  - MAF - [[edam:format_3008](http://edamontology.org/format_3008)] The multiple alignment format stores a series of multiple alignments in a format that is easy to parse and relatively easy to read.
  - SAM - [[edam:format_2573](http://edamontology.org/format_2573)] The Sequence Alignment/Map format.
  - BAM - [[edam:format_2572](http://edamontology.org/format_2572)] Binary [SAM](http://samtools.github.io/hts-specs/SAMv1.pdf) format. _note: it is becoming more common to store unaligned reads in a BAM called a uBAM_
  - CALF - The Compact ALignment Format records the base qualities and mapping qualities of the aligned reads, and unaligned reads data.
  - CRAM - [[edam:format_3462](http://edamontology.org/format_3462)] Further lossless compression of [SAM](http://samtools.github.io/hts-specs/SAMv1.pdf) format.
- Miscellaneous
  - BIOM - [[edam:format_3746](http://edamontology.org/format_3746)] The Biological Observation Matrix (BIOM) file format (canonically pronounced biome) is designed to be a general-use format for representing biological sample by observation contingency tables.
  - Pairs - The specification for the text contact list file format for chromosome conformation experiments (e.g., Hi-C).
  - SBML - [[edam:format_2585](http://edamontology.org/format_2585)] Systems Biology Markup Language (SBML), the standard XML format for models of biological processes such as for example metabolism, cell signaling, and gene regulation.
  - SBML - [[edam:format_2585](http://edamontology.org/format_2585)] Systems Biology Markup Language (SBML), the standard XML format for models of biological processes such as for example metabolism, cell signaling, and gene regulation.
  - BIOM - [[edam:format_3746](http://edamontology.org/format_3746)] The Biological Observation Matrix (BIOM) file format (canonically pronounced biome) is designed to be a general-use format for representing biological sample by observation contingency tables.
Review Papers and Blogs
- Miscellaneous
  - hts-specs - Specs from HTS
  - hts-specs - Specs from HTS

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-bioinformatics-formats

EDAM

Formats

General

Genomic Intervals

Genomic Features

Genotype Data

Molecular Structural Data

Medical Imaging Data

Dense Genomic Data

Unaligned Sequencing Data

Aligned Sequencing Data

Miscellaneous

Review Papers and Blogs

Miscellaneous