An open API service indexing awesome lists of open source software.

awesome-bioinformatics-formats

Curated list of bioinformatics formats and publications
https://github.com/kmhernan/awesome-bioinformatics-formats

Last synced: 2 days ago
JSON representation

  • EDAM

  • Formats

    • General

      • HDF5 - [[edam:format_3590](http://edamontology.org/format_3590)] HDF5 is a data model, library, and file format for storing and managing data.
      • tiledb - TileDB manages massive dense and sparse multi-dimensional array data that frequently arise in important scientific applications.
      • NetCDF - [[edam:format_3650](http://edamontology.org/format_3650)] Network Common Data Form is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++, Java, and other languages.
      • SQLite - [[edam:format_3621](http://edamontology.org/format_3621)] SQLite is a C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.
      • tiledb - TileDB manages massive dense and sparse multi-dimensional array data that frequently arise in important scientific applications.
      • NetCDF - [[edam:format_3650](http://edamontology.org/format_3650)] Network Common Data Form is a set of interfaces for array-oriented data access and a freely distributed collection of data access libraries for C, Fortran, C++, Java, and other languages.
    • Genomic Intervals

      • segmentation file - A tab-delimited text file that lists loci and associated numeric values associated with copy number.
      • bedGraph - [[edam:format_3583](http://edamontology.org/format_3583)] The bedGraph format allows display of continuous-valued data in track format.
      • bigBed - [[edam:format_3004](http://edamontology.org/format_3004)] Binary and indexed form of [BED](https://genome.ucsc.edu/FAQ/FAQformat.html#format1).
      • interval list - The intervals are given in the form `<chr> <start> <stop> + <target_name>`, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).
      • segmentation file - A tab-delimited text file that lists loci and associated numeric values associated with copy number.
      • tabix - [[edam:format_3616](http://edamontology.org/format_3616)] An *index* file format for genomic intervals (can be used on bed, gtf, vcf, etc).
      • interval list - The intervals are given in the form `<chr> <start> <stop> + <target_name>`, with fields separated by tabs, and the coordinates are 1-based (first position in the genome is position 1, not position 0).
      • tabix - [[edam:format_3616](http://edamontology.org/format_3616)] An *index* file format for genomic intervals (can be used on bed, gtf, vcf, etc).
    • Genomic Features

      • GFF2 - [[edam:format_1974](http://edamontology.org/format_1974)] The general feature format is a file format used for describing genes and other features of DNA, RNA and protein sequences version 2.
      • GFF3 - [[edam:format_1975](http://edamontology.org/format_1975)] The general feature format is a file format used for describing genes and other features of DNA, RNA and protein sequences version 3.
      • genePred - [[edam:format_3011](http://edamontology.org/format_3011)] a table format commonly used for gene prediction tracks.
      • GFF2 - [[edam:format_1974](http://edamontology.org/format_1974)] The general feature format is a file format used for describing genes and other features of DNA, RNA and protein sequences version 2.
      • GFF3 - [[edam:format_1975](http://edamontology.org/format_1975)] The general feature format is a file format used for describing genes and other features of DNA, RNA and protein sequences version 3.
      • GTF - [[edam:format_2306](http://edamontology.org/format_2306)] GTF stands for Gene transfer format. It borrows from [GFF2](http://gmod.org/wiki/GFF2), but has additional structure that warrants a separate definition and format name.
      • genePred - [[edam:format_3011](http://edamontology.org/format_3011)] a table format commonly used for gene prediction tracks.
      • GTF - [[edam:format_2306](http://edamontology.org/format_2306)] GTF stands for Gene transfer format. It borrows from [GFF2](http://gmod.org/wiki/GFF2), but has additional structure that warrants a separate definition and format name.
    • Genotype Data

      • MAF - A Mutation Annotation Format (MAF) file (.maf) is a tab-delimited text file that lists mutations.
      • oxford-bgen - Binary version of the native [Oxford gen format](https://www.cog-genomics.org/plink2/formats#gen). Operations on bgen files are generally faster and more descriptive than on plain gen files, and the file size of bgen files is also smaller -- UK Biobank genotypes are in bgen format. Latest bgen version is 1.3.
      • pileup - [[edam:format_3015](http://edamontology.org/format_3015)] Describes the base-pair information at each chromosomal position. This format facilitates SNP/indel calling and brief alignment viewing by eyes.
      • GDS - Genomic Data Structure is a storage format for bioinformatics data similar to [NetCDF](https://www.unidata.ucar.edu/software/netcdf/).
      • BCF - [[edam:format_3020](http://edamontology.org/format_3020)] Binary and compressed [VCF](http://samtools.github.io/hts-specs/VCFv4.3.pdf) format.
      • GVF - [[edam:format_3019](http://edamontology.org/format_3019)] The Genome Variation Format (GVF) is a very simple file format for describing sequence_alteration features at nucleotide resolution relative to a reference genome.
      • oxford-bgen - Binary version of the native [Oxford gen format](https://www.cog-genomics.org/plink2/formats#gen). Operations on bgen files are generally faster and more descriptive than on plain gen files, and the file size of bgen files is also smaller -- UK Biobank genotypes are in bgen format. Latest bgen version is 1.3.
      • pileup - [[edam:format_3015](http://edamontology.org/format_3015)] Describes the base-pair information at each chromosomal position. This format facilitates SNP/indel calling and brief alignment viewing by eyes.
      • plink2-pgen - PLINK2 binary genotype table capable of representing mixed-phase, multiallelic, and mixed-hardcall/dosage/missing genotype data.
      • VCF - [[edam:format_3016](http://edamontology.org/format_3016)] Variant Call Format.
      • plink-ped - [[edam:format_3288](http://edamontology.org/format_3288)] PLINK plain-text genotype format. Mostly has been replaced by bed/bim/fam, but is useful if someone wants to actually look at the SNPs in plain-text since Plink bed is in binary, and also wants to retain more information than from a VCF (eg. additional individual information).
      • BCF - [[edam:format_3020](http://edamontology.org/format_3020)] Binary and compressed [VCF](http://samtools.github.io/hts-specs/VCFv4.3.pdf) format.
      • VCF - [[edam:format_3016](http://edamontology.org/format_3016)] Variant Call Format.
    • Molecular Structural Data

      • CTfile - The CTfile Formats document fully describes the formats for CTfiles (chemical table files) including: Molfiles, RGfiles, Rxnfiles, SDfiles, RDfiles, XDfiles.
      • mmCIF - [[edam:format_1477](http://edamontology.org/format_1477)] Another format for PDB molecular structures.
      • CTfile - The CTfile Formats document fully describes the formats for CTfiles (chemical table files) including: Molfiles, RGfiles, Rxnfiles, SDfiles, RDfiles, XDfiles.
      • mmCIF - [[edam:format_1477](http://edamontology.org/format_1477)] Another format for PDB molecular structures.
      • PDB - [[edam:format_1476](http://edamontology.org/format_1476)] The Protein Data Bank (PDB) format provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies. This representation was created in the 1970's and a large amount of software using it has been written.
      • PDB - [[edam:format_1476](http://edamontology.org/format_1476)] The Protein Data Bank (PDB) format provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies. This representation was created in the 1970's and a large amount of software using it has been written.
    • Medical Imaging Data

      • DICOM - [[edam:format_3548](http://edamontology.org/format_3548)] Medical image format corresponding to the Digital Imaging and Communications in Medicine (DICOM) standard.
      • NifTI - [[edam:format_3549](http://edamontology.org/format_3549)] Medical image and metadata format of the Neuroimaging Informatics Technology Initiative.
      • DICOM - [[edam:format_3548](http://edamontology.org/format_3548)] Medical image format corresponding to the Digital Imaging and Communications in Medicine (DICOM) standard.
      • NifTI - [[edam:format_3549](http://edamontology.org/format_3549)] Medical image and metadata format of the Neuroimaging Informatics Technology Initiative.
    • Dense Genomic Data

      • bigWig - [[edam:format_3006](http://edamontology.org/format_3006)] The bigWig format is useful for dense, continuous data and is a binary form of [wiggle](https://genome.ucsc.edu/goldenPath/help/wiggle.html).
      • Genomedata - a format for efficient storage of multiple tracks of numeric data anchored to a genome.
      • GenomicsDB - GenomicsDB is an open sourced library and tools with a focus on optimizing sparse array storage specifically for genomic data.
      • wiggle - [[edam:format_3005](http://edamontology.org/format_3005)] ASCII format for dense, continuous data.
    • Unaligned Sequencing Data

      • FASTA - [[edam:format_1929](http://edamontology.org/format_1929)] ASCII format for storing nucleotide/amino acid sequences.
      • FASTQ - [[edam:format_1930](http://edamontology.org/format_1930)] Designed as a replacement for FASTA, combining the sequence and quality information in the same file.
    • Aligned Sequencing Data

      • CALF - The Compact ALignment Format records the base qualities and mapping qualities of the aligned reads, and unaligned reads data.
      • CRAM - [[edam:format_3462](http://edamontology.org/format_3462)] Further lossless compression of [SAM](http://samtools.github.io/hts-specs/SAMv1.pdf) format.
      • MAF - [[edam:format_3008](http://edamontology.org/format_3008)] The multiple alignment format stores a series of multiple alignments in a format that is easy to parse and relatively easy to read.
      • SAM - [[edam:format_2573](http://edamontology.org/format_2573)] The Sequence Alignment/Map format.
      • BAM - [[edam:format_2572](http://edamontology.org/format_2572)] Binary [SAM](http://samtools.github.io/hts-specs/SAMv1.pdf) format. _note: it is becoming more common to store unaligned reads in a BAM called a uBAM_
      • CALF - The Compact ALignment Format records the base qualities and mapping qualities of the aligned reads, and unaligned reads data.
      • CRAM - [[edam:format_3462](http://edamontology.org/format_3462)] Further lossless compression of [SAM](http://samtools.github.io/hts-specs/SAMv1.pdf) format.
    • Miscellaneous

      • BIOM - [[edam:format_3746](http://edamontology.org/format_3746)] The Biological Observation Matrix (BIOM) file format (canonically pronounced biome) is designed to be a general-use format for representing biological sample by observation contingency tables.
      • Pairs - The specification for the text contact list file format for chromosome conformation experiments (e.g., Hi-C).
      • SBML - [[edam:format_2585](http://edamontology.org/format_2585)] Systems Biology Markup Language (SBML), the standard XML format for models of biological processes such as for example metabolism, cell signaling, and gene regulation.
      • SBML - [[edam:format_2585](http://edamontology.org/format_2585)] Systems Biology Markup Language (SBML), the standard XML format for models of biological processes such as for example metabolism, cell signaling, and gene regulation.
      • BIOM - [[edam:format_3746](http://edamontology.org/format_3746)] The Biological Observation Matrix (BIOM) file format (canonically pronounced biome) is designed to be a general-use format for representing biological sample by observation contingency tables.
  • Review Papers and Blogs