{"id":18712628,"url":"https://github.com/tariqdaouda/pygeno","last_synced_at":"2025-10-09T20:37:13.988Z","repository":{"id":41281036,"uuid":"24479570","full_name":"tariqdaouda/pyGeno","owner":"tariqdaouda","description":"Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs","archived":false,"fork":false,"pushed_at":"2025-10-07T02:52:00.000Z","size":11092,"stargazers_count":325,"open_issues_count":14,"forks_count":49,"subscribers_count":23,"default_branch":"bloody","last_synced_at":"2025-10-07T04:23:27.594Z","etag":null,"topics":["bed","bioinformatics","biology","cancer","cancer-genomes","cancer-genomics","csv-parser","ensembl","genome","genome-annotation","genome-browser","genome-sequencing","genomes","genomics","gtf","medical","medicine","proteomics","snps","vcf"],"latest_commit_sha":null,"homepage":"http://pygeno.iric.ca","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tariqdaouda.png","metadata":{"files":{"readme":"README.rst","changelog":"CHANGELOG.rst","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2014-09-26T00:04:58.000Z","updated_at":"2025-08-17T22:33:33.000Z","dependencies_parsed_at":"2022-08-10T01:43:30.302Z","dependency_job_id":"2aef060b-f862-4cf3-870d-f6201b83e538","html_url":"https://github.com/tariqdaouda/pyGeno","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/tariqdaouda/pyGeno","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tariqdaouda%2FpyGeno","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tariqdaouda%2FpyGeno/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tariqdaouda%2FpyGeno/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tariqdaouda%2FpyGeno/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tariqdaouda","download_url":"https://codeload.github.com/tariqdaouda/pyGeno/tar.gz/refs/heads/bloody","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tariqdaouda%2FpyGeno/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279001978,"owners_count":26083259,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bed","bioinformatics","biology","cancer","cancer-genomes","cancer-genomics","csv-parser","ensembl","genome","genome-annotation","genome-browser","genome-sequencing","genomes","genomics","gtf","medical","medicine","proteomics","snps","vcf"],"created_at":"2024-11-07T12:43:14.015Z","updated_at":"2025-10-09T20:37:13.964Z","avatar_url":"https://github.com/tariqdaouda.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"CODE FREEZE:\n============\n\nPyGeno has long been limited due to it's backend. We are now ready to take it to the next level.\n\nWe are working on a major port of pyGeno to the open-source multi-modal database ArangoDB. PyGeno's code on both branches master and bloody is frozen until we are finished. No pull request will be merged until then, and we won't implement any new features.\n\npyGeno: A Python package for precision medicine and proteogenomics\n==================================================================\n\n.. image:: http://depsy.org/api/package/pypi/pyGeno/badge.svg\n   :alt: depsy\n   :target: http://depsy.org/package/python/pyGeno\n\n.. image:: https://pepy.tech/badge/pygeno\n   :alt: downloads\n   :target: https://pepy.tech/project/pygeno\n\n.. image:: https://pepy.tech/badge/pygeno/month\n   :alt: downloads_month\n   :target: https://pepy.tech/project/pygeno/month\n\n.. image:: https://pepy.tech/badge/pygeno/week\n   :alt: downloads_week\n   :target: https://pepy.tech/project/pygeno/week\n\n.. image:: http://bioinfo.iric.ca/~daoudat/pyGeno/_static/logo.png\n   :alt: pyGeno's logo\n   \n\npyGeno is (to our knowledge) the only tool available that will gladly build your specific genomes for you.\n\npyGeno is developed by `Tariq Daouda`_ at the *Institute for Research in Immunology and Cancer* (IRIC_), its logo is the work of the freelance designer `Sawssan Kaddoura`_.\nFor the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.\n\n.. _Tariq Daouda: http://wwww.tariqdaouda.com\n.. _IRIC: http://www.iric.ca\n.. _Sawssan Kaddoura: http://sawssankaddoura.com\n\nClick here for The `full documentation`_.\n\n.. _full documentation: http://pygeno.iric.ca/\n\nFor the latest news about pyGeno, you can follow me on twitter `@tariqdaouda`_.\n\n.. _@tariqdaouda: https://www.twitter.com/tariqdaouda\n\nCiting pyGeno:\n--------------\nPlease cite this paper_.\n\n.. _paper: http://f1000research.com/articles/5-381/v1\n\nInstallation:\n-------------\n\nIt is recommended to install pyGeno within a `virtual environement`_, to setup one you can use:\n\n.. code:: shell\n\n        virtualenv ~/.pyGenoEnv\n        source ~/.pyGenoEnv/bin/activate\n\npyGeno can be installed through pip:\n\n.. code:: shell\n\t\n\tpip install pyGeno #for the latest stable version\n\nOr github, for the latest developments:\n\n.. code:: shell\n\n\tgit clone https://github.com/tariqdaouda/pyGeno.git\n\tcd pyGeno\n        python setup.py develop\n\n.. _`virtual environement`: http://virtualenv.readthedocs.org/\n\nA brief introduction\n--------------------\n\npyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend\nupon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to\nbe able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and **Personalized Genomes**.\n\nPersonalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphisms and an optional filter.\npyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get\ndirect access to the DNA and Protein sequences of your patients.\n\n.. code:: python\n\n\tfrom pyGeno.Genome import *\n\t\n\tg = Genome(name = \"GRCh37.75\")\n\tprot = g.get(Protein, id = 'ENSP00000438917')[0]\n\t#print the protein sequence\n\tprint prot.sequence\n\t#print the protein's gene biotype\n\tprint prot.gene.biotype\n\t#print protein's transcript sequence\n\tprint prot.transcript.sequence\n\t\n\t#fancy queries\n\tfor exon in g.get(Exon, {\"CDS_start \u003e\": x1, \"CDS_end \u003c=\" : x2, \"chromosome.number\" : \"22\"}) :\n\t\t#print the exon's coding sequence\n\t\tprint exon.CDS\n\t\t#print the exon's transcript sequence\n\t\tprint exon.transcript.sequence\n\t\n\t#You can do the same for your subject specific genomes\n\t#by combining a reference genome with polymorphisms\n\tg = Genome(name = \"GRCh37.75\", SNPs = [\"STY21_RNA\"], SNPFilter = MyFilter())\n\nAnd if you ever get lost, there's an online **help()** function for each object type:\n\n.. code:: python\n\n\tfrom pyGeno.Genome import *\n\t\n\tprint Exon.help()\n\nShould output:\n\n.. code::\n\t\n\tAvailable fields for Exon: CDS_start, end, chromosome, CDS_length, frame, number, CDS_end, start, genome, length, protein, gene, transcript, id, strand\n\n\t\nCreating a Personalized Genome:\n-------------------------------\nPersonalized Genomes are a powerful feature that allow you to work on the specific genomes and proteomes of your patients. You can even mix several SNP sets together.\n\n.. code:: python\n  \n  from pyGeno.Genome import Genome\n  #the name of the snp set is defined inside the datawrap's manifest.ini file\n  dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')\n  #you can also define a filter (ex: a quality filter) for the SNPs\n  dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())\n  #and even mix several snp sets  \n  dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())\n\nFiltering SNPs:\n---------------\npyGeno allows you to select the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions.\n\n.. code:: python\n\t\n\tfrom pyGeno.SNPFiltering import SNPFilter, SequenceSNP\n\n\tclass QMax_gt_filter(SNPFilter) :\n\t\t\n\t\tdef __init__(self, threshold) :\n\t\t\tself.threshold = threshold\n\t\t\n\t\t#Here SNPs is a dictionary: SNPSet Name =\u003e polymorphism  \n\t\t#This filter ignores deletions and insertions and\n\t\t#but applis all SNPs\n\t\tdef filter(self, chromosome, **SNPs) :\n\t\t\tsources = {}\n\t\t\talleles = []\n\t\t\tfor snpSet, snp in SNPs.iteritems() :\n\t\t\t\tpos = snp.start\n\t\t\t\tif snp.alt[0] == '-' :\n\t\t\t\t\tpass\n\t\t\t\telif snp.ref[0] == '-' :\n\t\t\t\t\tpass\n\t\t\t\telse :\n\t\t\t\t\tsources[snpSet] = snp\n\t\t\t\t\talleles.append(snp.alt) #if not an indel append the polymorphism\n\t\t\t\t\n\t\t\t#appends the refence allele to the lot\n\t\t\trefAllele = chromosome.refSequence[pos]\n\t\t\talleles.append(refAllele)\n\t\t\tsources['ref'] = refAllele\n\t\n\t\t\t#optional we keep a record of the polymorphisms that were used during the process\n\t\t\treturn SequenceSNP(alleles, sources = sources)\n\t\t\nThe filter function can also be made more specific by using arguments that have the same names as the SNPSets\n\n.. code:: python\n\n\tdef filter(self, chromosome, dummySRY = None) :\n\t\tif dummySRY.Qmax_gt \u003e self.threshold :\n\t\t\t#other possibilities of return are SequenceInsert(\u003cbases\u003e), SequenceDelete(\u003clength\u003e)\n\t\t\treturn SequenceSNP(dummySRY.alt)\n\t\treturn None #None means keep the reference allele\n\nTo apply the filter simply specify if while loading the genome.\n\n.. code:: python\n\n\tpersGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))\n\nTo include several SNPSets use a list.\n\n.. code:: python\n\n\tpersGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = ['ARN_P1', 'ARN_P2'], SNPFilter = myFilter())\n\nGetting an arbitrary sequence:\n------------------------------\nYou can ask for any sequence of any chromosome:\n\n.. code:: python\n\t\n\tchr12 = myGenome.get(Chromosome, number = \"12\")[0]\n\tprint chr12.sequence[x1:x2]\n\t# for the reference sequence\n  \tprint chr12.refSequence[x1:x2]\n\nBatteries included (bootstraping):\n---------------------------------\n\npyGeno's database is populated by importing datawraps.\npyGeno comes with a few data wraps, to get the list you can use:\n\n.. code:: python\n\t\n\timport pyGeno.bootstrap as B\n\tB.printDatawraps()\n\n.. code::\n\n\tAvailable datawraps for boostraping\n\t\n\tSNPs\n\t~~~~|\n\t    |~~~:\u003e Human_agnostic.dummySRY.tar.gz\n\t    |~~~:\u003e Human.dummySRY_casava.tar.gz\n\t    |~~~:\u003e dbSNP142_human_common_all.tar.gz\n\t\n\t\n\tGenomes\n\t~~~~~~~|\n\t       |~~~:\u003e Human.GRCh37.75.tar.gz\n\t       |~~~:\u003e Human.GRCh37.75_Y-Only.tar.gz\n\t       |~~~:\u003e Human.GRCh38.78.tar.gz\n\t       |~~~:\u003e Mouse.GRCm38.78.tar.gz\n\nTo get a list of remote datawraps that pyGeno can download for you, do:\n\n.. code:: python\n\n\tB.printRemoteDatawraps()\n\nImporting whole genomes is a demanding process that take more than an hour and requires (according to tests) \nat least 3GB of memory. Depending on your configuration, more might be required.\n\nThat being said importating a data wrap is a one time operation and once the importation is complete the datawrap\ncan be discarded without consequences.\n\nThe bootstrap module also has some handy functions for importing built-in packages.\n\nSome of them just for playing around with pyGeno (**Fast importation** and **Small memory requirements**):\n\n.. code:: python\n\t\n\timport pyGeno.bootstrap as B\n\n\t#Imports only the Y chromosome from the human reference genome GRCh37.75\n\t#Very fast, requires even less memory. No download required.\n\tB.importGenome(\"Human.GRCh37.75_Y-Only.tar.gz\")\n\t\n\t#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP  format. \n\t# This one has one SNP at the begining of the gene SRY\n\tB.importSNPs(\"Human.dummySRY_casava.tar.gz\")\n\nAnd for more **Serious Work**, the whole reference genome.\n\n.. code:: python\n\n\t#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.\n\tB.importGenome(\"Human.GRCh38.78.tar.gz\")\n\t\nImporting a custom datawrap:\n--------------------------\n\n.. code:: python\n\n  from pyGeno.importation.Genomes import *\n  importGenome('GRCh37.75.tar.gz')\n\nTo import a patient's specific polymorphisms\n\n.. code:: python\n\n  from pyGeno.importation.SNPs import *\n  importSNPs('patient1.tar.gz')\n\nFor a list of available datawraps available for download, please have a look here_.\n\nYou can easily make your own datawraps with any tar.gz compressor.\nFor more details on how datawraps are made you can check wiki_ or have a look inside the folder bootstrap_data.\n\n.. _here: http://pygeno.iric.ca/datawraps.html\n.. _wiki: https://github.com/tariqdaouda/pyGeno/wiki/How-to-create-a-pyGeno-friendly-package-to-import-your-data%3F\n\nInstanciating a genome:\n-----------------------\n.. code:: python\n\t\n\tfrom pyGeno.Genome import Genome\n\t#the name of the genome is defined inside the package's manifest.ini file\n\tref = Genome(name = 'GRCh37.75')\n\nPrinting all the proteins of a gene:\n-----------------------------------\n.. code:: python\n\n  from pyGeno.Genome import Genome\n  from pyGeno.Gene import Gene\n  from pyGeno.Protein import Protein\n\nOr simply:\n\n.. code:: python\n\n  from pyGeno.Genome import *\n\nthen:\n\n.. code:: python\n\n  ref = Genome(name = 'GRCh37.75')\n  #get returns a list of elements\n  gene = ref.get(Gene, name = 'TPST2')[0]\n  for prot in gene.get(Protein) :\n  \tprint prot.sequence\n\nMaking queries, get() Vs iterGet():\n-----------------------------------\niterGet is a faster version of get that returns an iterator instead of a list.\n\nMaking queries, syntax:\n----------------------\npyGeno's get function uses the expressivity of rabaDB.\n\nThese are all possible query formats:\n\n.. code:: python\n\n  ref.get(Gene, name = \"SRY\")\n  ref.get(Gene, { \"name like\" : \"HLA\"})\n  chr12.get(Exon, { \"start \u003e=\" : 12000, \"end \u003c\" : 12300 })\n  ref.get(Transcript, { \"gene.name\" : 'SRY' })\n\nCreating indexes to speed up queries:\n------------------------------------\n.. code:: python\n\n  from pyGeno.Gene import Gene\n  #creating an index on gene names if it does not already exist\n  Gene.ensureGlobalIndex('name')\n  #removing the index\n  Gene.dropIndex('name')\n\nFind in sequences:\n------------------\n\nInternally pyGeno uses a binary representation for nucleotides and amino acids to deal with polymorphisms. \nFor example,both \"AGC\" and \"ATG\" will match the following sequence \"...AT/GCCG...\".\n\n.. code:: python\n\n\t#returns the position of the first occurence\n\ttranscript.find(\"AT/GCCG\")\n\t#returns the positions of all occurences\n\ttranscript.findAll(\"AT/GCCG\")\n\t\n\t#similarly, you can also do\n\ttranscript.findIncDNA(\"AT/GCCG\")\n\ttranscript.findAllIncDNA(\"AT/GCCG\")\n\ttranscript.findInUTR3(\"AT/GCCG\")\n\ttranscript.findAllInUTR3(\"AT/GCCG\")\n\ttranscript.findInUTR5(\"AT/GCCG\")\n\ttranscript.findAllInUTR5(\"AT/GCCG\")\n\t\n\t#same for proteins\n\tprotein.find(\"DEV/RDEM\")\n\tprotein.findAll(\"DEV/RDEM\")\n\t\n\t#and for exons\n\texon.find(\"AT/GCCG\")\n\texon.findAll(\"AT/GCCG\")\n\texon.findInCDS(\"AT/GCCG\")\n\texon.findAllInCDS(\"AT/GCCG\")\n\t#...\n\n\t\nProgress Bar:\n-------------\n.. code:: python\n\n  from pyGeno.tools.ProgressBar import ProgressBar\n  pg = ProgressBar(nbEpochs = 155)\n  for i in range(155) :\n  \tpg.update(label = '%d' %i) # or simply p.update() \n  pg.close()\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftariqdaouda%2Fpygeno","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftariqdaouda%2Fpygeno","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftariqdaouda%2Fpygeno/lists"}