{"id":34027943,"url":"https://github.com/brenodupin/gdt","last_synced_at":"2026-04-07T15:34:06.606Z","repository":{"id":301810154,"uuid":"1000978876","full_name":"brenodupin/gdt","owner":"brenodupin","description":"GDT is a tool for gene dictionary management, works with annotated genomes, handles .gdict and GFF3 files, and provides both programmatic and interactive interface.","archived":false,"fork":false,"pushed_at":"2026-02-20T15:48:08.000Z","size":34894,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-03-14T06:20:33.574Z","etag":null,"topics":["bioinformatics","gene-naming","genome-annotation","gff3","gff3-format","pip"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brenodupin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-12T16:05:10.000Z","updated_at":"2026-02-20T15:48:26.000Z","dependencies_parsed_at":"2025-08-26T15:59:24.733Z","dependency_job_id":"754f48c5-b2b6-47c4-b4e6-de3af04e091c","html_url":"https://github.com/brenodupin/gdt","commit_stats":null,"previous_names":["brenodupin/gdt"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/brenodupin/gdt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brenodupin%2Fgdt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brenodupin%2Fgdt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brenodupin%2Fgdt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brenodupin%2Fgdt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brenodupin","download_url":"https://codeload.github.com/brenodupin/gdt/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brenodupin%2Fgdt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31518621,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-07T03:10:19.677Z","status":"ssl_error","status_checked_at":"2026-04-07T03:10:13.982Z","response_time":105,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","gene-naming","genome-annotation","gff3","gff3-format","pip"],"created_at":"2025-12-13T17:02:52.195Z","updated_at":"2026-04-07T15:34:06.600Z","avatar_url":"https://github.com/brenodupin.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://github.com/brenodupin/gdt/releases/download/v1.0.0/GDT_logo_dark_mode.png\"\u003e\n    \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://github.com/brenodupin/gdt/releases/download/v1.0.0/GDT_logo_light_mode.png\"\u003e\n    \u003cimg src=\"https://github.com/brenodupin/gdt/releases/download/v1.0.0/GDT_logo_light_mode.png\" width=\"50%\" alt=\"GDT Logo\"\u003e\n  \u003c/picture\u003e\n\n$${\\color{#E0AF68}{\\LARGE\\textsf{🧬 Standardizing gene names across organelle genomes 🧬}}}$$  \n\u003cbr\u003e\n[![Published in STAR Protocols](https://img.shields.io/badge/published%20in-STAR%20Protocols-blue)](https://doi.org/10.1016/j.xpro.2025.104187)\n[![License](https://img.shields.io/badge/license-MIT-purple)](https://github.com/brenodupin/gdt/blob/master/LICENSE)\n![Build Status](https://img.shields.io/badge/tests-in_development-yellow)\n[![Checked with mypy](https://www.mypy-lang.org/static/mypy_badge.svg)](https://mypy-lang.org/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Linting: Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/charliermarsh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![DOI](https://zenodo.org/badge/1000978876.svg)](https://zenodo.org/badge/latestdoi/1000978876)\n\u003c/div\u003e\n\n\n# Table of Contents\n\n- [Overview](#overview)\n- [Getting Started](#getting-started)\n  - [Requirements](#requirements)\n  - [Installation](#installation)\n- [GDICT Format](#gdict-format)\n  - [tl;dr](#tldr)\n    - [Quick Overview](#quick-overview)\n    - [Basic Format](#basic-format)\n    - [Entry Types](#entry-types)\n    - [Label Convention](#label-convention)\n    - [Complete Specification](#complete-specification)\n  - [Creation Process](#creation-process)\n  - [Update of GFF Versions](#update-of-gff-versions)\n- [CLI commands](#cli-commands)\n  - [`filter`](#filter)\n  - [`stripped`](#stripped)\n  - [`standardize`](#standardize)\n- [Library usage](#library-usage)\n- [Project structure](#project-structure)\n\n\n# Overview\n\nGDT (Gene Dictonary Tool) is a protocol for the creation and implementation of a gene dictionary across any type of annotated genomes. This Python package offers a suite of functionalities that enables the manipulation and integration of .gdict files into other pipelines seamlessly.\n\n# Getting Started\n\n## Requirements\n\n### `gdt` Library\n- [Python](https://www.python.org/) `(\u003e=3.12)`\n- [pandas](https://pandas.pydata.org/) `(\u003e=1.5.3,\u003c3.0.0)`\n\n### Notebooks\n- [Python](https://www.python.org/) `(\u003e=3.12)`\n- [gdt](https://github.com/brenodupin/gdt) `(\u003e=1.0.0)`\n- [pandas](https://pandas.pydata.org/) `(\u003e=1.5.3,\u003c3.0.0)`\n- [biopython](https://biopython.org) `(\u003e=1.80)`\n\n## Installation\n### `gdt`\nYou can install the library with pip:\n```shell\npip install gdt\n```\n### Notebooks\nTo run the Jupyter notebooks, you need to install gdt and biopython.:\n```shell\npip install gdt biopython\n```\n\n## GDICT Format\n### tl;dr\n\nGDICT (`.gdict`) is a plain-text file that stores a `GeneDict` with a human-readable, easily editable, and machine-parsable structure. `.gdict` files are read by `gdt.read_gdict()` and written to by `gdt.GeneDict.to_gdict()`. A GDICT file contains gene nomenclature data (i.e., gene identifiers) and associated metadata (gene names, database cross-references and comments added by the user).\n\n#### Quick Overview\n- **File extension**: `.gdict`\n- **Structure**: Header + labeled sections with gene data\n- **Encoding**: UTF-8\n- **Current version**: 0.0.2\n\n#### Basic Format\n```\n#! version 0.0.2\n#! Optional metadata lines\n\n[LABEL]\ngene description #gd SOURCE\ngene-identifier #gn SOURCE1 SOURCE2\ngene-identifier #dx SOURCE:GeneID\n```\n\n#### Entry Types\n- **`#gd`** - Gene descriptions (names from NCBI Gene, etc.)\n- **`#gn`** - Gene identifiers from genome annotations  \n- **`#dx`** - Database cross-references with GeneIDs\n\n#### Label Convention\n\nWe propose a label naming convention that is based on the [HGNC](https://www.genenames.org/) human mitochondrial gene nomenclature, but adapted to accommodate other organelles and genetic compartments. The labels are structured as `\u003cprefix\u003e-\u003csymbol\u003e`, where `\u003cprefix\u003e` is a three-letter code representing the genetic compartment, and `\u003csymbol\u003e` is the gene name or identifier.\n\n**The GDT library will not enforce any label naming convention** (even our own!), helping you rename and remove labels with the `gdt.GeneDict.rename_labels()` and `gdt.GeneDict.remove_labels()` methods, respectively.\n\n#### Complete Specification\nYou can read more about it at the [GDICT File Specification](https://github.com/brenodupin/gdt/blob/master/GDICT_FILE_SPECIFICATION.md)\n\n### Creation Process\n\nThe process of creating a GDICT file is not fully automated because it requires extensive user input and curation. To facilitate this process, we provide two Jupyter notebooks that guide the user through the steps of creating a GDICT file from scratch or from an existing stripped GDICT file. These notebooks are designed to be run interactively, allowing the user to make decisions and curate the entries as needed.  \nWe provide our GDICT files (also in stripped form) for a most organelle genomes (public avaible at NCBI), which can be used as a starting point for creating new GDICT files.\n\nA more detailed description of the process can be found in the preprint: [Protocol for GDT, Gene Dictionary Tool, to create and implement a gene dictionary across annotated genomes](https://doi.org/10.1101/2025.06.15.659783)\n\n### Update of GFF Versions\n\nWe have written a guide to update an existing GDICT after a new version of a GFF (in your dataset) is released. The guide can be found in the [GFF Version Update Guide](GFF_Update_Guide.md).\n\n## CLI commands\n\nThe flags below work on all commands:\n\n|       flag      |   description   |\n|-----------------|-----------------|\n| `-h`, `--help`      | Show the help message and exit. | \n| `--debug`         | Enable TRACE level in file, and DEBUG on console.\u003cbr\u003eDefault: DEBUG level on file and INFO on console. |\n| `--log LOG`       | Path to the log file. If not provided, a default log file will be created. |\n| `--quiet`         | Suppress console output. |\n| `--version`      | Show the version of the gdt package. |\n\n### `gdt-cli filter`\nThe filter command is used to filter GFF3 files that are indexed via a TSV file, it may create `AN_missing_dbxref.txt` and/or `AN_missing_gene_dict.txt` based on the provided .gdict file.\n\n|       flag      |   description   |\n|-----------------|-----------------|\n| `--tsv TSV`       | TSV file with indexed GFF3 files to filter. |\n| `--AN-column AN_COLUMN` | Column name for NCBI Accession Number inside the TSV. Default: AN |\n| `--gdict GDICT`       | GDICT file to use for filtering. If not provided, an empty GeneDict (i.e., GDICT file) will be used. |\n| `--keep-orfs`     | Keep ORFs. Default: exclude ORFs. |\n| `--workers WORKERS` | Number of workers to use. Default: 0 (use all available cores) |\n| `--gff-suffix GFF_SUFFIX` | Suffix for GFF files. Default: '.gff3' |\n| `--query-string QUERY_STRING` | Query string that pandas filter features in GFF. Default: 'type in ('gene', 'tRNA', 'rRNA')' |\n\nUsage example: \n```shell\ngdt-cli filter --tsv fungi_mt_model2.tsv --gdict fungi_mt_model2_stripped.gdict --debug\n```\n\n### `gdt-cli stripped`\nThe stripped command filters out GeneGeneric (#gn) and Dbxref (#dx) entries from a GDICT file, keeping only GeneDescription (#gd) entries and their metadata.\n\n|       flag      |   description   |\n|-----------------|-----------------|\n| `--gdict_in GDT_IN`, `-gin GDICT_IN` | Input GDICT file to be stripped. |\n| `--gdict_out GDT_OUT`, `-gout GDICT_OUT` | New GDICT file to create. |\n| `--overwrite`     | Overwrite output file, if it already exists. Default: False |\n\nUsage example: \n```shell\ngdt-cli stripped --gdict_in ../GeneDictionaries/Result/Fungi_mt.gdict --gdict_out fungi_mt_model2_stripped.gdict --overwrite\n```\n\n### `gdt-cli standardize`\nThe standardize command is used to standardize gene names across features in GFF3 files using a GDT file.\nThe command has two modes, either single GFF3 file with `--gff` or a TSV file with indexed GFF3 files with `--tsv`.\n\n|       flag      |   description   |\n|-----------------|-----------------|\n| `--gff GFF`       | GFF3 file to standardize. |\n|\u003cimg width=200/\u003e |\u003cimg width=500/\u003e|\n| `--tsv TSV`       | TSV file with indexed GFF3 files to standardize. |\n| `--AN-column AN_COLUMN` | Column name for NCBI Accession Number inside the TSV. Default: AN |\n| `--gff-suffix GFF_SUFFIX` | Suffix for GFF files. Default: '.gff3' |\n|\u003cimg width=200/\u003e |\u003cimg width=500/\u003e|\n| `--gdict GDICT`       | GDICT file to use for standardization. |\n| `--query-string QUERY_STRING` | Query string that pandas filter features in GFF. Default: 'type in ('gene', 'tRNA', 'rRNA')' |\n| `--check`         | Just check for standardization issues, do not modify the GFF3 file. Default: False |\n| `--second-place`  | Add gdt_tag pair to the second place in the GFF3 file, after the ID. Default: False (add to the end of the attributes field). |\n| `--gdt-tag GDT_TAG` | Tag to use for the GDT key/value pair in the GFF3 file. Default: 'gdt_label='. |\n| `--error-on-missing` | Raise an error if a feature is missing in the GDT file. Default: False (just log a warning and skip the feature). |\n| `--save-copy`     | Save a copy of the original GFF3 file with a .original suffix. Default: False (change inplace). |\n\nUsage example:\n```shell\ngdt-cli standardize --gff sandbox/fungi_mt/HE983611.1.gff3 --gdict sandbox/fungi_mt/misc/gdt/fungi_mt_pilot_07.gdict --save-copy\n```\n```shell\ngdt-cli standardize --tsv sandbox/fungi_mt/fungi_mt.tsv --gdict sandbox/fungi_mt/misc/gdt/fungi_mt_pilot_07.gdict --second-place --debug --log test1.log\n```\n\n## Library usage\nYou can use the library in your own Python scripts. The main interface is the `GeneDict` class, where you can load a GDT file and use its methods to manipulate it.\n\nSince `GeneDict` inherits from `collections.UserDict`, it behaves like a dictionary, allowing you manipulate its entries using standard dictionary methods. The metadata are stored as attributes of the `GeneDict` object, which can be accessed directly.\nThey are:\n- `version`: The version of the GDT file. (\"0.0.2\")\n- `header`: A list of strings containing the header lines from the GDT file.\n- `info`: An instance of `GeneDictInfo` containing metadata about its entries (This information is only calculated when `update_info()` is called, or when `lazy_info` is set to `False` at start).\n    \n   - `labels`: The number of unique gene labels in the GDT file.\n   - `total_entries`: The total number of entries in the GDT file.\n   - `gene_descriptions`: The number of gene description entries (#gd) in the GDT file.\n   - `gene_generics`: The number of gene generic entries (#gn) in the GDT file.\n   - `dbxref_GeneIDs`: The number of dbxref entries (#dx) that contain GeneID in the GDT file.\n\nTo read a GDT file, you can use the `read_gdict()` function, which returns a `GeneDict` object. You can then manipulate it as needed and save it back to a GDT file using the `to_gdict()` method.\n\n```python\nimport gdt\n\n# Read a GDT file\ngene_dict = gdt.read_gdict(\"path/to/your.gdict\")\n# Check the type of the object\nprint(type(gene_dict))  # \u003cclass 'gdt.gdict.GeneDict'\u003e\n# Access metadata\nprint(gene_dict.version)  # \"0.0.2\"\ngene_dict.update_info()  # Update the info attribute with metadata\nprint(gene_dict.info.labels)  # Number of unique gene labels\nprint(gene_dict.info.total_entries)  # Total number of entries\n\n# Manipulate the GeneDict as needed\n# For example, you can access a specific entry by its key\nprint(gene_dict[\"gene-ATP8\"])  # Access the entry for 'gene-ATP8'\n\n# Save the GeneDict back to a GDT file\ngene_dict.to_gdict(\"path/to/your_output.gdict\", overwrite=True)\n```\nAll GDT functions and classes are documented with docstrings, so you can use the `help()` function to get more information about them. A full documentation of the library is being built with Sphinx and can be found in the `docs` folder later on.\n\n## Project structure\n\nWe follow a project structure inspired by [cookiecutter-bioinformatics-project](https://github.com/maxplanck-ie/cookiecutter-bioinformatics-project), with some modifications to better suit our needs. Below is an overview of the project structure:\n\n```\n├── CITATION.cff        \u003c- Contains metadata on how the project might eventually be published. \n├── LICENSE\n├── README.md           \u003c- The top-level README for developers using this project. \n│\n├── docs                \u003c- A default Sphinx project; see sphinx-doc.org for details\n│\n│\n├── img                 \u003c- A place to store images associated with the project/pipeline, e.g. a \n│                         a figure of the pipeline DAG. \n│\n├── notebooks           \u003c- Jupyter or Rmd notebooks.\n│\n├── resources           \u003c- Place for data.\n│   ├── stripped        \u003c- Stripped down GDICT files, from our protocol, containing only the #gd entries.\n│   └── pilot           \u003c- Complete GDICT files, containing all entries (#gd, #gn, #dx) from our protocol.\n│\n├── example             \u003c- Example data.\n│ \n├── sandbox             \u003c- A place to test scripts and ideas. By default excluded from the git repository.\n│ \n├── pyproject.toml      \u003c- Makes project pip installable (pip install -e .) so src can be imported.\n│\n├─ src                  \u003c- Source code for use in this project.\n│  └─ gdt               \u003c- Package containing the main library code.\n│     ├── __init__.py   \u003c- Makes src/gdt a package.\n│     ├── cli.py        \u003c- Contains the command line interface for the gdt package.\n│     ├── gdt_impl.py   \u003c- Contains the main implementation of the GeneDict class and its methods.\n│     ├── gff3_utils.py \u003c- Contains utility functions for working with GFF files.\n|     └── log_setup.py  \u003c- Contains the logger configuration for the gdt package.\n│\n├── tox.ini             \u003c- tox file with settings for running tox; see tox.readthedocs.io \n|\n|── ruff.toml           \u003c- ruff configuration file for linting; see https://docs.astral.sh/ruff/configuration/\n|\n|── uv.lock             \u003c- uv configuration file for versioning; see https://docs.astral.sh/uv/concepts/projects/sync/\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrenodupin%2Fgdt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrenodupin%2Fgdt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrenodupin%2Fgdt/lists"}