{"id":29829824,"url":"https://github.com/y9c/metagene","last_synced_at":"2025-07-29T09:41:10.662Z","repository":{"id":295380005,"uuid":"512110621","full_name":"y9c/metagene","owner":"y9c","description":"Metagene Profiling Analysis and Visualization","archived":false,"fork":false,"pushed_at":"2025-07-07T03:15:25.000Z","size":24515,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-07T03:16:58.011Z","etag":null,"topics":["bioinformatics","epitranscriptomics","metagene","rna"],"latest_commit_sha":null,"homepage":"https://metagene.yech.science/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/y9c.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-07-09T06:22:36.000Z","updated_at":"2025-07-07T03:15:28.000Z","dependencies_parsed_at":"2025-05-25T08:44:43.964Z","dependency_job_id":null,"html_url":"https://github.com/y9c/metagene","commit_stats":null,"previous_names":["y9c/metagene"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/y9c/metagene","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/y9c%2Fmetagene","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/y9c%2Fmetagene/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/y9c%2Fmetagene/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/y9c%2Fmetagene/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/y9c","download_url":"https://codeload.github.com/y9c/metagene/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/y9c%2Fmetagene/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267665326,"owners_count":24124516,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","epitranscriptomics","metagene","rna"],"created_at":"2025-07-29T09:40:59.837Z","updated_at":"2025-07-29T09:41:10.656Z","avatar_url":"https://github.com/y9c.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Metagene\n\n[![Pypi Releases](https://img.shields.io/pypi/v/metagene.svg)](https://pypi.python.org/pypi/metagene)\n[![Downloads](https://static.pepy.tech/badge/metagene)](https://pepy.tech/project/metagene)\n\n**Metagene Profiling Analysis and Visualization**\n\nThis tool allows you to analyze metagene, the distribution of genomic features relative to gene regions (5'UTR, CDS, 3'UTR) and create publication-ready metagene profile plots.\n\n\n## Installation\n\nInstall metagene using pip:\n\n```bash\npip install metagene\n```\nminimal python version requirement: 3.12\n\n## Quick Start\n\n### Command Line Interface\n\nBasic metagene analysis using a built-in reference:\n\n```bash\n# Using built-in human genome reference (GRCh38)\nmetagene -i sites.tsv.gz -r GRCh38 --with-header -m 1,2,3 -w 5 \\\n         -o output.tsv -s scores.tsv -p plot.png\n```\n\nUsing a custom GTF file:\n\n```bash\n# Using custom GTF annotation\nmetagene -i sites.bed -g custom.gtf.gz -m 1,2,3 -w 5 \\\n         -o output.tsv -s scores.tsv -p plot.png\n```\n\n### Python API\n\n```python\nfrom metagene import (\n    load_sites, load_reference, map_to_transcripts, \n    normalize_positions, plot_profile\n)\n\n# Load your genomic sites\nsites_df = load_sites(\"sites.tsv.gz\", with_header=True, meta_col_index=[0, 1, 2])\n\n# Load reference genome annotation\nreference_df = load_reference(\"GRCh38\")  # or load_gtf(\"custom.gtf.gz\")\n\n# Perform metagene analysis\nannotated_df = map_to_transcripts(sites_df, reference_df)\ngene_bins, gene_stats, gene_splits = normalize_positions(\n    annotated_df, split_strategy=\"median\", bin_number=100\n)\n\n# Generate plot\nplot_profile(gene_bins, gene_splits, \"metagene_plot.png\")\n\nprint(f\"Analyzed {gene_bins['count'].sum()} sites\")\nprint(f\"Gene splits - 5'UTR: {gene_splits[0]:.3f}, CDS: {gene_splits[1]:.3f}, 3'UTR: {gene_splits[2]:.3f}\")\nprint(f\"Gene statistics - 5'UTR: {gene_stats['5UTR']}, CDS: {gene_stats['CDS']}, 3'UTR: {gene_stats['3UTR']}\")\n```\n\n## Input Formats\n\n### TSV Format\n```\nref\tpos\tstrand\tscore\tpvalue\nchr1\t1000000\t+\t0.85\t0.001\nchr1\t2000000\t-\t0.72\t0.005\n```\n\n### BED Format\n```\nchr1\t999999\t1000000\tscore1\t0.85\t+\nchr1\t1999999\t2000000\tscore2\t0.72\t-\n```\n\n### Column Specification\n- Use `-m/--meta-columns` to specify coordinate columns (1-based indexing)\n- Use `-w/--weight-columns` to specify score/weight columns\n- Use `-H/--with-header` if your file has a header line\n\n## Built-in References\n\nMetagene includes pre-processed gene annotations for major model organisms:\n\n| Species             | Assembly    | Reference                                  |\n| ------------------- | ----------- | ------------------------------------------ |\n| **Human**           | GRCh38/hg38 | `GRCh38`, `hg38`                           |\n|                     | GRCh37/hg19 | `GRCh37`, `hg19`                           |\n| **Mouse**           | GRCm39/mm39 | `GRCm39`, `mm39`                           |\n|                     | GRCm38/mm10 | `GRCm38`, `mm10`                           |\n|                     | mm9/NCBIM37 | `mm9`, `NCBIM37`                           |\n| **Arabidopsis**     | TAIR10      | `TAIR10`                                   |\n| **Rice**            | IRGSP-1.0   | `IRGSP-1.0`                                |\n| **Model Organisms** | Various     | `dm6`, `ce11`, `WBcel235`, `sacCer3`, etc. |\n\n### Managing References\n\nList all available references:\n```bash\nmetagene --list\n```\n\nThis will show all 23+ available references organized by species:\n```\nHuman:\n  GRCh37          - Human genome GRCh37 (Ensembl release 75)\n  GRCh38          - Human genome GRCh38 (Ensembl release 110)\n  hg19            - Human genome hg19 (UCSC 2021)\n  hg38            - Human genome hg38 (UCSC 2022)\n\nMouse:\n  GRCm38          - Mouse genome GRCm38 (Ensembl release 102)\n  GRCm39          - Mouse genome GRCm39 (Ensembl release 110)\n  mm10            - Mouse genome mm10 (UCSC 2021)\n  mm39            - Mouse genome mm39 (UCSC 2024)\n  mm9             - Mouse genome mm9 (UCSC 2020)\n\n... and more\n```\n\nDownload a specific reference:\n```bash\nmetagene --download GRCh38\n```\n\nDownload all references (requires ~10GB disk space):\n```bash\nmetagene --download all\n```\n\n\n## CLI Options\n\n```\nUsage: metagene [OPTIONS]\n\n  Run metagene analysis on genomic sites.\n\nOptions:\n  --version                       Show the version and exit.\n  -i, --input PATH                Input file path (BED, GTF, TSV or CSV, etc.)\n  -o, --output PATH               Output file path (TSV, CSV)\n  -s, --output-score PATH         Output file for binned score statistics\n  -p, --output-figure PATH        Output file for metagene plot\n  -r, --reference TEXT            Built-in reference genome to use (e.g.,\n                                  GRCh38, GRCm39)\n  -g, --gtf PATH                  GTF/GFF file path for custom reference\n  --region     Region to analyze (default: all)\n  -b, --bins INTEGER              Number of bins for analysis (default: 100)\n  -H, --with-header               Input file has header line\n  -S, --separator TEXT            Separator for input file (default: tab)\n  -m, --meta-columns TEXT         Input column indices (1-based) for genomic\n                                  coordinates. The columns should contain\n                                  Chromosome,Start,End,Strand or\n                                  Chromosome,Site,Strand\n  -w, --weight-columns TEXT       Input column indices (1-based) for\n                                  weight/score values\n  -n, --weight-names TEXT         Names for weight columns\n  --score-transform \n                                  Transform to apply to scores (default: none)\n  --normalize                     Normalize scores by transcript length\n  --list                          List all available built-in references and\n                                  exit\n  --download TEXT                 Download a specific reference (e.g., GRCh38)\n                                  or 'all' for all references\n  -h, --help                      Show this message and exit.\n```\n\n## API Reference (Core Functions)\n\n- `load_sites(file, with_header=False, meta_col_index=[0,1,2])` - Load genomic sites\n- `load_reference(name)` - Load built-in reference genome\n- `load_gtf(file)` - Load custom GTF annotation  \n- `map_to_transcripts(sites, reference)` - Annotate sites with gene information\n- `normalize_positions(annotated_sites, strategy=\"median\")` - Normalize to relative positions\n- `plot_profile(data, gene_splits, output_file)` - Generate metagene plot\n\n\n## Demo\n\n![Metagene Profile](docs/fig_metagene.svg)\n\nThe plot shows the distribution of genomic sites across normalized gene regions:\n- **5'UTR** (0.0 - first split): 5' untranslated region\n- **CDS** (first split - second split): Coding sequence  \n- **3'UTR** (second split - 1.0): 3' untranslated region\n\n## TODO:\n\n- [ ] How to 100k sites on human genome in less than 10s?\n- [ ] The core function should be move into [variant](https://github.com/y9c/variant)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fy9c%2Fmetagene","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fy9c%2Fmetagene","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fy9c%2Fmetagene/lists"}