https://github.com/higlass/gene_annotations
https://github.com/higlass/gene_annotations
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/higlass/gene_annotations
- Owner: higlass
- Created: 2020-09-13T02:41:21.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2024-02-22T04:34:21.000Z (over 2 years ago)
- Last Synced: 2025-03-25T19:40:43.002Z (about 1 year ago)
- Language: Python
- Size: 20.5 KB
- Stars: 6
- Watchers: 3
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Installation
This repository just contains standalone scripts. Make sure to install requirements before running:
```
pip install -r requirements.txt
```
## Expected format
HiGlass expects the gene annotations file to have following format:
```
# 1: chr (chr1)
# 2: txStart (52301201) [9]
# 3: txEnd (52317145) [10]
# 4: geneName (ACVRL1) [2]
# 5: citationCount (123) [16]
# 6: strand (+) [8]
# 7: refseqId (NM_000020)
# 8: geneId (94) [1]
# 9: geneType (protein-coding)
# 10: geneDesc (activin A receptor type II-like 1)
# 11: cdsStart (52306258)
# 12: cdsEnd (52314677)
# 13: exonStarts (52301201,52306253,52306882,52307342,52307757,52308222,52309008,52309819,52312768,52314542,)
# 14: exonEnds (52301479,523063
```
This bed-like format then needs to be aggregated using `clodius aggregate bedfile` in order to limit the amount of data displayed at once and to enable searching by gene name.
## Example 1: From UCSC GTF file
1. Download the UCSC `gtfToGenePred` binary from http://hgdownload.soe.ucsc.edu/admin/exe/
2. Get the GTF and chromsizes files for an assembly (the `-NP .` parameters ensure that a file isn't downloaded if it's already present) and convert to genepred format:
```
wget -NP . https://hgdownload.soe.ucsc.edu/goldenPath/danRer10/bigZips/genes/danRer10.refGene.gtf.gz
wget -NP . https://hgdownload.soe.ucsc.edu/goldenPath/danRer10/bigZips/danRer10.chrom.sizes
gtfToGenePred -genePredExt -geneNameAsName2 danRer10.refGene.gtf.gz danRer10.refGene.genepred
```
3. Convert to higlass-compatible format:
```
cat danRer10.refGene.genepred | python genepredext_to_hgbed.py | python exonU.py - > danRer10.refGene.hgbed
clodius aggregate bedfile --chromsizes-filename danRer10.chrom.sizes danRer10.refGene.hgbed
```
4. Use in either HiGlass or Resgen using `filetype:beddb`, `datatype:gene-annotations`.
## Example 2: From NCBI GFF
Find the genome information page for sacCer3 at https://www.ncbi.nlm.nih.gov/assembly/GCF_000146045.2/.
Download the gff file by clicking on "Download Assembly" and selecting "Genomic GFF".
Convert to higlass-compatible format using these commands:
```
gzcat GCF_000146045.2_R64_genomic.gff.gz \
| python scripts/gff_to_jsonl.py - \
| python scripts/gjsonl_to_chromsizes.py - > sacCer3.chrom.sizes
gzcat GCF_000146045.2_R64_genomic.gff.gz \
| python scripts/gff_to_jsonl.py - \
| python scripts/gjsonl_to_hgbed.py - > sacCer3.hgbed
clodius aggregate bedfile sacCer3.hgbed \
--delimiter $`\t' \
--chromsizes-filename sacCer3.chrom.sizes
```
The `sacCer2.chrom.sizes` file just contains the names of the chromosomes and their sizes.
View in higlass:
```
higlass-manage view sacCer3.hgbed.beddb --datatype gene-annotations
```
Note that this process omits all RNAs and takes the union of all exons in a gene to represent it as if it were just one transcript.