{"id":13735232,"url":"https://github.com/google/nucleus","last_synced_at":"2025-09-30T09:30:44.560Z","repository":{"id":41203512,"uuid":"126868526","full_name":"google/nucleus","owner":"google","description":"Python and C++ code for reading and writing genomics data.","archived":true,"fork":false,"pushed_at":"2021-12-09T21:37:35.000Z","size":6423,"stargazers_count":787,"open_issues_count":3,"forks_count":125,"subscribers_count":50,"default_branch":"v0.6.0","last_synced_at":"2025-01-07T16:08:38.135Z","etag":null,"topics":["bioinformatics","dna","genomics","tensorflow"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/google.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-03-26T17:58:07.000Z","updated_at":"2025-01-02T20:58:07.000Z","dependencies_parsed_at":"2022-07-16T18:17:08.618Z","dependency_job_id":null,"html_url":"https://github.com/google/nucleus","commit_stats":null,"previous_names":[],"tags_count":18,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fnucleus","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fnucleus/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fnucleus/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/google%2Fnucleus/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/google","download_url":"https://codeload.github.com/google/nucleus/tar.gz/refs/heads/v0.6.0","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234722055,"owners_count":18876896,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","dna","genomics","tensorflow"],"created_at":"2024-08-03T03:01:04.527Z","updated_at":"2025-09-30T09:30:39.521Z","avatar_url":"https://github.com/google.png","language":"C++","readme":"# Nucleus\n\nNucleus is a library of Python and C++ code designed to make it easy to read,\nwrite and analyze data in common genomics file formats like SAM and VCF. In\naddition, Nucleus enables painless integration with the TensorFlow machine\nlearning framework, as anywhere a genomics file is consumed or produced, a\nTensorFlow tfrecords file may be used instead.\n\n## Tutorial\n\nPlease check out our tutorial on\n[using Nucleus and TensorFlow for DNA sequencing error correction](https://colab.research.google.com/github/google/nucleus/blob/master/nucleus/examples/dna_sequencing_error_correction.ipynb).\nIt's a Python notebook that really demonstrates the power of Nucleus at\nintegrating information from multiple file types (BAM, VCF and Fasta) and\nturning it into a form usable by TensorFlow.\n\n## Poll\n\nWhich of these would most increase your usage of Nucleus? (Click on an option to\nvote on it.)\n\n[![](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Better%20TensorFlow%20integration)](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Better%20TensorFlow%20integration/vote)\n[![](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Spark%20integration)](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Spark%20integration/vote)\n[![](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Beam%20integration)](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Beam%20integration/vote)\n[![](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Improved%20documentation)](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Improved%20documentation/vote)\n[![](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Support%20for%20more%20file%20formats)](https://api.gh-polls.com/poll/01CQSHKQZMV3F2JZ72YYQ28Q4F/Support%20for%20more%20file%20formats/vote)\n\n## Installation\n\nNucleus currently only works on modern Linux systems using Python 3. It must be\ninstalled using a version of `pip` less than 21. To determine the version of pip\ninstalled on your system, run\n\n```\npip --version\n```\n\nTo install Nucleus, run\n\n```shell\npip install --user google-nucleus\n```\n\nNote that each version of Nucleus works with a specific TensorFlow version. Check the [releases](https://github.com/google/nucleus/releases) page for specifics.\n\nYou can ignore any \"Failed building wheel for google-nucleus\" error messages -- these are expected\nand won't prevent Nucleus from installing successfully.\n\nIf you are using Python 2, instead run\n\n```shell\npip install --user google-nucleus==0.3.2\n```\n\n## Documentation\n\n*   [Overview](https://github.com/google/nucleus/blob/master/docs/overview.md).\n*   [Summary of example programs](https://github.com/google/nucleus/blob/master/docs/examples.md).\n*   [Python API Reference](https://github.com/google/nucleus/blob/master/docs/source/doc_index.md).\n\n## Building from source\n\nFor Ubuntu 20, building from source is easy. Simply type\n\n```shell\nsource install.sh\n```\n\nThis will call `build_clif.sh`, which will build CLIF from scratch as well.\n\nFor all other systems, you will need to first install CLIF by following the\ninstructions at\n[https://github.com/google/clif#installation](https://github.com/google/clif#installation)\nbefore running install.sh. You'll need to run this command with Python 3.8.\nIf you don't want to build CLIF binaries on your own, you can consider\nusing pre-built CLIF binaries (see\n[an example here](https://github.com/google/nucleus/blob/v0.5.6/install.sh#L143-L152)). Note that we don't plan to update these pre-built CLIF binaries, so we\nrecommend building CLIF binaries from scratch.\n\nNote that install.sh extensively depends on apt-get, so it is unlikely to run\nwithout extensive modifications on non-Debian-based systems.\n\nNucleus depends on TensorFlow. By default, install.sh will install a CPU-only\nversion of a stable TensorFlow release (currently 2.6). If that isn't what you\nwant, there are several other options that can be enabled with a simple edit to\n`install.sh`.\n\nRunning `install.sh` will build all of Nucleus's programs and libraries. You can\nfind the generated binaries under `bazel-bin/nucleus`. If in addition to\nbuilding Nucleus you would like to run its tests, execute\n\n```shell\nbazel test -c opt $BAZEL_FLAGS nucleus/...\n```\n\n## Version\n\nThis is Nucleus 0.6.0. Nucleus follows\n[semantic versioning](https://semver.org/).\n\nNew in 0.6.0:\n\n*   Upgrade to support TensorFlow 2.6.0 specifically.\n*   Upgrade to Python 3.8.\n\nNew in 0.5.9:\n\n*   Upgrade to support TensorFlow 2.5.0 specifically.\n\nNew in 0.5.8:\n\n*   Update `util/vis.py` to use updated channel names.\n*   Support `MED_DP` (median DP) field for a `VariantCall`.\n\nNew in 0.5.7:\n\n*   Add automatic pileup curation functionality in `util/vis.py`.\n*   Upgrade protobuf settings to support TensorFlow 2.4.0 specifically.\n\nNew in 0.5.6:\n\n*   Upgrade to protobuf 3.9.2 to support TensorFlow 2.3.0 specifically.\n\nNew in 0.5.5:\n\n*   Upgrade protobuf settings to support TensorFlow 2.2.0 specifically.\n\nNew in 0.5.4:\n\n*   Upgrade to protobuf 3.8.0 to support TensorFlow 2.1.0. * Add explicit\n    .close() method to TFRecordWriter.\n\nNew in 0.5.3:\n\n*   Fixes memory leaks in message_module.cc.\n*   Updates setup.py to install .egg-info directory for pip 20.2+ compatibility.\n*   Pins TensorFlow to 2.0.0 for protobuf version compatibility.\n*   Pins setuptools to 49.6.0 to avoid breaking changes of setuptools 50.\n\nNew in 0.5.2:\n\n*   Upgrades htslib dependency from 1.9 to 1.10.2.\n*   More informative error message for failed SAM header parsing.\n*   `util/vis.py` now supports saving images to Google Cloud Storage.\n\nNew in 0.5.1:\n\n*   Added new utilities for working with DeepVariant pileup images and variant\n    protos.\n\nNew in 0.5.0:\n\n*   Fixed bug preventing Nucleus to work with TensorFlow 2.0.\n*   Added util.vis routines for visualizing DeepVariant pileup examples.\n*   FASTA reader now supports keep\\_true\\_case option for keeping the original\n    casing.\n*   VCF writer now supports writing headerless VCF files.\n*   SAM reader now supports optional fields of type 'B'.\n*   variant\\_utils now supports gVCF files.\n*   Numerous minor bug fixes.\n\nNew in 0.4.1:\n\n*   Pip package is slightly more robust.\n\nNew in 0.4.0:\n\n*   The Nucleus pip package now works with Python 3.\n\nNew in 0.3.0:\n\n*   Reading of VCF, SAM, and most other genomics files is now twice as fast.\n*   Read range and end calculations are now done in C++ for speed.\n*   VcfReader can now read \"headerless\" VCF files.\n*   variant\\_utils.major\\_allele\\_frequency now 5x faster.\n*   Memory leaks fixed in TFRecordReader/Writer and gfile\\_cc.\n\nNew in 0.2.3:\n\n*   Nucleus no longer depends on any specific version of TensorFlow's python\n    code. This should make it easier to use Nucleus with for example TensorFlow\n    2.0.\n*   Added BCF support to VcfWriter.\n*   Fixed memory leaks in VcfWriter::Write.\n*   Added print\\_tfrecord example program.\n\nNew in 0.2.2:\n\n*   Faster SAM file querying and read overlap calculations.\n*   Writing protocol buffers to files uses less memory.\n*   Smaller pip package.\n*   nucleus/util:io\\_utils refactored into nucleus/io:tfrecord and\n    nucleus/io:sharded\\_file\\_utils.\n*   Alleles coming from VCF files are now always normalized as uppercase.\n\nNew in 0.2.1:\n\n*   Upgrades htslib dependency from 1.6 to 1.9.\n*   Minor VCF parsing fixes.\n*   Added new example program, apply\\_genotyping\\_prior.\n*   Slightly more robust pip package.\n\nNew in 0.2.0:\n\n*   Support for reading and writing BedGraph files.\n*   Support for reading and writing GFF files.\n*   Support for reading and writing CRAM files.\n*   Support for writing SAM/BAM files.\n*   Support for reading unindexed FASTA files.\n*   Iteration support for indexed FASTA files.\n*   Ability to read VCF files from memory.\n*   Python API documentation.\n*   Python 3 compatibility.\n*   Added universal file converter example program.\n\n## License\n\nNucleus is licensed under the terms of the [Apache 2 license](LICENSE).\n\n## Support\n\nThe\n[Genomics team in Google Brain](https://research.google.com/teams/brain/genomics/)\nactively supports Nucleus and are always interested in improving its quality. If\nyou run into an issue, please report the problem on our\n[Issue tracker](https://github.com/google/nucleus/issues). Be sure to add enough\ndetail to your report that we can reproduce the problem and fix it. We encourage\nincluding links to snippets of BAM/VCF/etc files that provoke the bug, if\npossible. Depending on the severity of the issue we may patch Nucleus\nimmediately with the fix or roll it into the next release.\n\n## Contributing\n\nInterested in contributing? See [CONTRIBUTING](CONTRIBUTING.md).\n\n## History\n\nNucleus grew out of the [DeepVariant](https://github.com/google/deepvariant)\nproject.\n\n## Disclaimer\n\nThis is not an official Google product.\n","funding_links":[],"categories":["TensorFlow Tools, Libraries, and Frameworks","Software packages","Ranked by starred repositories"],"sub_categories":["Data wrangling"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fnucleus","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogle%2Fnucleus","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogle%2Fnucleus/lists"}