{"id":13751464,"url":"https://github.com/bigdatagenomics/adam","last_synced_at":"2025-10-19T19:22:23.897Z","repository":{"id":11968913,"uuid":"14541530","full_name":"bigdatagenomics/adam","owner":"bigdatagenomics","description":"ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.","archived":false,"fork":false,"pushed_at":"2025-05-06T16:47:40.000Z","size":13054,"stargazers_count":1023,"open_issues_count":37,"forks_count":314,"subscribers_count":96,"default_branch":"master","last_synced_at":"2025-05-12T13:07:31.421Z","etag":null,"topics":["avro","big-data","bioinformatics","genomics","java","parquet","python","r","scala","spark"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bigdatagenomics.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2013-11-19T23:47:57.000Z","updated_at":"2025-05-08T21:43:49.000Z","dependencies_parsed_at":"2023-01-13T16:44:31.747Z","dependency_job_id":"96757a58-e3eb-4902-8267-97966f2640f0","html_url":"https://github.com/bigdatagenomics/adam","commit_stats":{"total_commits":1647,"total_committers":88,"mean_commits":18.71590909090909,"dds":0.7091681845780207,"last_synced_commit":"360e5b456b2bd31fb16ce0b89ae56a68b615e984"},"previous_names":[],"tags_count":67,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigdatagenomics%2Fadam","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigdatagenomics%2Fadam/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigdatagenomics%2Fadam/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bigdatagenomics%2Fadam/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bigdatagenomics","download_url":"https://codeload.github.com/bigdatagenomics/adam/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253745152,"owners_count":21957317,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["avro","big-data","bioinformatics","genomics","java","parquet","python","r","scala","spark"],"created_at":"2024-08-03T09:00:45.731Z","updated_at":"2025-10-19T19:22:18.863Z","avatar_url":"https://github.com/bigdatagenomics.png","language":"Scala","funding_links":[],"categories":["Scala","[](https://github.com/josephmisiti/awesome-machine-learning/blob/master/README.md#scala)Scala","Packages","Ranked by starred repositories"],"sub_categories":["General-Purpose Machine Learning","Bioinformatics"],"readme":"ADAM\n====\n\n[![Maven Central](https://img.shields.io/maven-central/v/org.bdgenomics.adam/adam-parent-spark3_2.12.svg?maxAge=600)](http://search.maven.org/#search%7Cga%7C1%7Corg.bdgenomics.adam)\n[![API Documentation](http://javadoc.io/badge/org.bdgenomics.adam/adam-core-spark3_2.12.svg?color=brightgreen\u0026label=scaladoc)](http://javadoc.io/doc/org.bdgenomics.adam/adam-core-spark3_2.12)\n\n# Introduction\n\nADAM is a library and command line tool that enables the use of [Apache\nSpark](https://spark.apache.org) to parallelize genomic data analysis across\ncluster/cloud computing environments. ADAM uses a set of schemas to describe\ngenomic sequences, reads, variants/genotypes, and features, and can be used\nwith data in legacy genomic file formats such as SAM/BAM/CRAM, BED/GFF3/GTF,\nand VCF, as well as data stored in the columnar\n[Apache Parquet](https://parquet.apache.org) format. On a single node, ADAM\nprovides competitive performance to optimized multi-threaded tools, while\nenabling scale out to clusters with more than a thousand cores. ADAM's APIs\ncan be used from Scala, Java, Python, R, and SQL.\n\n## Why ADAM?\n\nOver the last decade, DNA and RNA sequencing has evolved from an expensive,\nlabor intensive method to a cheap commodity. The consequence of this is\ngeneration of _massive amounts of genomic and transcriptomic data_. Typically,\ntools to process and interpret these data are developed with a focus on\nexcellence of the results generated, not on __scalability__ and\n__interoperability__. A typical _sequencing workflow_ consists of a suite\nof tools from quality control, mapping, mapped read preprocessing, to variant\ncalling or quantification, depending on the application at hand. Concretely,\nthis usually means that such a workflow is implemented as tools glued together\nby scripts or workflow descriptions, with data written to files at each step.\nThis approach entails three main bottlenecks: \n\n  1. __scaling the workflow__ comes down to scaling each of the individual\n     tools,\n  2. the __stability of the workflow__ heavily depends on the consistency of\n     intermediate file formats, and\n  3. __writing to and reading from disk__ is a major slow-down.\n\nWe propose here a transformative solution for these problems, by replacing\nad-hoc workflows by the [ADAM framework](http://bdgenomics.org/), developed\nin the [Apache Spark](http://spark.apache.org/) ecosystem.\n\nADAM enables the high performance in-memory cluster computing functionality\nof Apache Spark on genomic data, ensuring efficient and fault-tolerant\ndistribution based on data parallelism, without the intermediate disk\noperations required in traditional distributed approaches.\n\nFurthermore, the ADAM and Apache Spark approach comes with an additional\nbenefit. Typically, the endpoint of a sequencing pipeline is a file with\nprocessed data for a single sample: e.g. variants for DNA sequencing, read\ncounts for RNA sequencing, etc. The real endpoint, however, of a sequencing\nexperiment initiated by an investigator is __interpretation__ of these data\nin a certain context. This usually translates into (statistical) analysis of\nmultiple samples, connection with (clinical) metadata, and interactive\nvisualization, using data science tools such as R, Python, Tableau and\nSpotfire. In addition to scalable distributed processing, Apache Spark also\nallows __interactive data analysis__ in the form of analysis notebooks\n(Spark Notebook, Jupyter, or Zeppelin), or direct connection to the data in\nR and Python.\n\n# Getting Started\n\n## Installing ADAM via Conda\n\nADAM is available in Conda via Bioconda, https://bioconda.github.io\n\n```bash\n$ conda install adam\n```\n\n## Installing ADAM via Homebrew\n\nADAM is available in Homebrew via Brewsci/bio, https://github.com/brewsci/homebrew-bio\n\n```bash\n$ brew install brewsci/bio/adam\n```\n\n## Installing ADAM via Docker\n\nADAM is available in Docker via BioContainers, https://biocontainers.pro\n\n```bash\n$ docker pull quay.io/biocontainers/adam:{tag}\n```\n\nFind `{tag}` on the tag search page, https://quay.io/repository/biocontainers/adam?tab=tags\n\n## Building from Source\n\nYou will need to have [Apache Maven](http://maven.apache.org/) version 3.3.9 or\nlater installed in order to build ADAM.\n\n```bash\n$ git clone https://github.com/bigdatagenomics/adam.git\n$ cd adam\n$ mvn install\n```\n\n### Installing Spark\n\nYou'll need to have a Spark release on your system and the `$SPARK_HOME` environment variable pointing at it;\nprebuilt binaries can be downloaded from the [Spark website](http://spark.apache.org/downloads.html).\n\nAs of ADAM version 0.37.0, Spark version 3.2.0 or later is required.\n\n\n# Documentation\n\nADAM's documentation is available at http://adam.readthedocs.io.\n\nADAM's core API documentation is available at http://javadoc.io/doc/org.bdgenomics.adam/adam-core-spark3_2.12.\n\n# The ADAM/Big Data Genomics Ecosystem\n\nADAM builds upon the open source [Apache Spark](https://spark.apache.org),\n[Apache Avro](https://avro.apache.org), and [Apache\nParquet](https://parquet.apache.org) projects. Additionally, ADAM can be\ndeployed for both interactive and production workflows using a variety of\nplatforms.\n\nThere are a number of tools built using ADAM's core APIs:\n\n* [Avocado](https://github.com/bigdatagenomics/avocado) - Avocado is a distributed\n  variant caller built on top of ADAM for germline and somatic calling.\n* [Cannoli](https://github.com/bigdatagenomics/cannoli) - ADAM\n  [Pipe](http://adam.readthedocs.io/en/latest/api/pipes/) API wrappers for bioinformatics\n  tools, (e.g.,\n  [BWA](https://github.com/lh3/bwa),\n  [bowtie2](http://bowtie-bio.sourceforge.net/bowtie2/index.shtml),\n  [FreeBayes](https://github.com/ekg/freebayes))\n* [DECA](https://github.com/bigdatagenomics/deca) - DECA is a reimplementation of the\n  XHMM copy number variant caller on top of ADAM.\n* [Gnocchi](https://github.com/bigdatagenomics/gnocchi) - Gnocchi provides primitives\n  for running GWAS/eQTL tests on large genotype/phenotype datasets using ADAM.\n* [Lime](https://github.com/bigdatagenomics/lime) - Lime provides a\n  parallel implementation of genomic set theoretic primitives using the ADAM\n  [region join](http://adam.readthedocs.io/en/latest/api/joins/) API.\n* [Mango](https://github.com/bigdatagenomics/mango) - Mango is a library for\n  visualizing large scale genomics data with interactive latencies.\n\nFor more, please see our [awesome list of applications](https://github.com/bigdatagenomics/awesome-adam) that extend ADAM.\n\n\n# Connecting with the ADAM team\n\nThe best way to reach the ADAM team is to post in our [Gitter\nchannel](https://gitter.im/bigdatagenomics/adam) or to open an issue on our\n[Github repository](https://github.com/bigdatagenomics/adam/issues). For more\ncontact methods, please see [our support page](https://github.com/bigdatagenomics/adam/blob/master/SUPPORT.md).\n\n\n# License\n\nADAM is released under the [Apache License, Version 2.0](LICENSE.txt).\n\n\n# Citing ADAM\n\nADAM has been described in two manuscripts. The first, [a tech\nreport](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-207.pdf),\ncame out in 2013 and described the rationale behind using schemas for genomics,\nand presented an early implementation of some of the preprocessing algorithms.\nTo cite this paper, please cite:\n\n```\n@techreport{massie13,\n  title={{ADAM}: Genomics Formats and Processing Patterns for Cloud Scale Computing},\n  author={Massie, Matt and Nothaft, Frank and Hartl, Christopher and Kozanitis, Christos and Schumacher, Andr{\\'e} and Joseph, Anthony D and Patterson, David A},\n  year={2013},\n  institution={UCB/EECS-2013-207, EECS Department, University of California, Berkeley}\n}\n```\n\nThe second, [a conference paper](http://dl.acm.org/ft_gateway.cfm?ftid=1586788\u0026id=2742787),\nappeared in the SIGMOD 2015 Industrial Track. This paper described how ADAM's\ndesign was influenced by database systems, expanded upon the concept of a stack\narchitecture for scientific analyses, presented more results comparing ADAM to\nstate-of-the-art single node genomics tools, and demonstrated how the\narchitecture generalized beyond genomics. To cite this paper, please cite:\n\n```\n@inproceedings{nothaft15,\n  title={Rethinking Data-Intensive Science Using Scalable Analytics Systems},\n  author={Nothaft, Frank A and Massie, Matt and Danford, Timothy and Zhang, Zhao and Laserson, Uri and Yeksigian, Carl and Kottalam, Jey and Ahuja, Arun and Hammerbacher, Jeff and Linderman, Michael and Franklin, Michael and Joseph, Anthony D. and Patterson, David A.},\n  booktitle={Proceedings of the 2015 International Conference on Management of Data (SIGMOD '15)},\n  year={2015},\n  organization={ACM}\n}\n```\n\nWe prefer that you cite both papers, but if you can only cite one paper, we\nprefer that you cite the SIGMOD 2015 manuscript.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigdatagenomics%2Fadam","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbigdatagenomics%2Fadam","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbigdatagenomics%2Fadam/lists"}