{"id":15906031,"url":"https://github.com/lucacappelletti94/keras_synthetic_genome_sequence","last_synced_at":"2025-06-15T07:39:16.192Z","repository":{"id":62573979,"uuid":"236455993","full_name":"LucaCappelletti94/keras_synthetic_genome_sequence","owner":"LucaCappelletti94","description":"Python package to lazily generate synthetic genomic sequences for training od Keras models.","archived":false,"fork":false,"pushed_at":"2020-04-12T16:14:44.000Z","size":881,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-05T11:13:29.005Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LucaCappelletti94.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-27T09:26:55.000Z","updated_at":"2020-04-12T16:14:48.000Z","dependencies_parsed_at":"2022-11-03T18:43:52.926Z","dependency_job_id":null,"html_url":"https://github.com/LucaCappelletti94/keras_synthetic_genome_sequence","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LucaCappelletti94/keras_synthetic_genome_sequence","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fkeras_synthetic_genome_sequence","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fkeras_synthetic_genome_sequence/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fkeras_synthetic_genome_sequence/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fkeras_synthetic_genome_sequence/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LucaCappelletti94","download_url":"https://codeload.github.com/LucaCappelletti94/keras_synthetic_genome_sequence/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LucaCappelletti94%2Fkeras_synthetic_genome_sequence/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259941043,"owners_count":22935291,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-06T13:20:40.713Z","updated_at":"2025-06-15T07:39:16.162Z","avatar_url":"https://github.com/LucaCappelletti94.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"keras_synthetic_genome_sequence\n=========================================================================================\n|travis| |sonar_quality| |sonar_maintainability| |codacy|\n|code_climate_maintainability| |pip| |downloads|\n\nPython package to lazily generate synthetic genomic sequences for training of Keras models.\n\nHow do I install this package?\n----------------------------------------------\nAs usual, just download it using pip:\n\n.. code:: shell\n\n    pip install keras_synthetic_genome_sequence\n\nTests Coverage\n----------------------------------------------\nSince some software handling coverages sometime\nget slightly different results, here's three of them:\n\n|coveralls| |sonar_coverage| |code_climate_coverage|\n\n\nUsage examples\n-------------------------\nTo use GapSequence to train your keras model you\nwill need to obtain statistical metrics for the\nbiological gaps you intend to mimic in your synthetic gaps.\n\nTo achieve this, this package offers an utility called\nget_gaps_statistics, which allows you to obtain the\nmean and covariance of gaps in a given genomic assembly.\n\nThe genomic assembly is automatically downloaded from UCSC\nusing `ucsc_genomes_downloader \u003chttps://github.com/LucaCappelletti94/ucsc_genomes_downloader\u003e`__,\nthen the gaps contained within are extracted and their windows\nis expanded to the given one, after filtering for the given\nmax_gap_size, as you might want to limit the gaps size to\na relatively small one (gaps can get in the tens of thousands\nof nucleotides, for instance in the telomeres).\n\nLet's start by listing all the important parameters:\n\n.. code:: python\n\n    assembly = \"hg19\"\n    window_size = 200\n    batch_size = 128\n\nNow we can start by retrieving the gaps statistics:\n\n.. code:: python\n\n    from keras_synthetic_genome_sequence.utils import get_gaps_statistics\n\n    number, mean, covariance = get_gaps_statistics(\n        assembly=assembly,\n        max_gap_size=100,\n        window_size=window_size\n    )\n\n    print(\"I have identified {number} gaps!\".format(number=number))\n\nNow you must choose the ground truth on which to apply the\nsynthetic gaps, for instance the regions without gaps in\nthe genomic assembly hg19, chromosome chr1.\nThese regions will have to be tasselized into smaller\nchunks that are compatible with the shape you have chosen for\nthe gap statistics window_size.\nWe can retrieve these regions as follows:\n\n.. code:: python\n\n    from ucsc_genomes_downloader import Genome\n    from ucsc_genomes_downloader.utils import tessellate_bed\n\n    genome = Genome(assembly, chromosomes=[\"chr1\"])\n    ground_truth = tessellate_bed(genome.filled(), window_size=window_size)\n\nThe obtained pandas DataFrame will have a bed-like format\nand look as follows:\n\n+----+---------+--------------+------------+\n|    | chrom   |   chromStart |   chromEnd |\n+====+=========+==============+============+\n|  0 | chr1    |        10000 |      10200 |\n+----+---------+--------------+------------+\n|  1 | chr1    |        10200 |      10400 |\n+----+---------+--------------+------------+\n|  2 | chr1    |        10400 |      10600 |\n+----+---------+--------------+------------+\n|  3 | chr1    |        10600 |      10800 |\n+----+---------+--------------+------------+\n|  4 | chr1    |        10800 |      11000 |\n+----+---------+--------------+------------+\n\nNow we are ready to actually create the GapSequence:\n\n.. code:: python\n\n    from keras_synthetic_genome_sequence import GapSequence\n\n    gap_sequence = GapSequence(\n        assembly=assembly,\n        bed=ground_truth,\n        gaps_mean=mean,\n        gaps_covariance=covariance,\n        batch_size=batch_size\n    )\n\nNow, having a model that receives as\ninput and output shape (batch_size, window_size, 4),\nwe can train it as follows:\n\n.. code:: python\n\n    model = build_my_denoiser()\n    model.fit_generator(\n        gap_sequence,\n        steps_per_epoch=gap_sequence.steps_per_epoch,\n        epochs=2,\n        shuffle=True\n    )\n\nHappy denoising!\n\nComparison between biological and synthetic distributions\n----------------------------------------------------------\nThe following images refer to the biological and synthetic distributions\nof gaps in the hg19, hg38, mm9 and mm10 genomic assembly, considering\ngaps with length to up 100 nucleotides and total window size 1000.\nThe threshold used to convert to integer the multivariate gaussian distribution\nis 0.4, the default value used within the python package.\n\n.. image:: https://github.com/LucaCappelletti94/keras_synthetic_genome_sequence/blob/master/distributions/hg19.png?raw=true\n.. image:: https://github.com/LucaCappelletti94/keras_synthetic_genome_sequence/blob/master/distributions/hg38.png?raw=true\n.. image:: https://github.com/LucaCappelletti94/keras_synthetic_genome_sequence/blob/master/distributions/mm9.png?raw=true\n.. image:: https://github.com/LucaCappelletti94/keras_synthetic_genome_sequence/blob/master/distributions/mm10.png?raw=true\n\n\n.. |travis| image:: https://travis-ci.org/LucaCappelletti94/keras_synthetic_genome_sequence.png\n   :target: https://travis-ci.org/LucaCappelletti94/keras_synthetic_genome_sequence\n   :alt: Travis CI build\n\n.. |sonar_quality| image:: https://sonarcloud.io/api/project_badges/measure?project=LucaCappelletti94_keras_synthetic_genome_sequence\u0026metric=alert_status\n    :target: https://sonarcloud.io/dashboard/index/LucaCappelletti94_keras_synthetic_genome_sequence\n    :alt: SonarCloud Quality\n\n.. |sonar_maintainability| image:: https://sonarcloud.io/api/project_badges/measure?project=LucaCappelletti94_keras_synthetic_genome_sequence\u0026metric=sqale_rating\n    :target: https://sonarcloud.io/dashboard/index/LucaCappelletti94_keras_synthetic_genome_sequence\n    :alt: SonarCloud Maintainability\n\n.. |sonar_coverage| image:: https://sonarcloud.io/api/project_badges/measure?project=LucaCappelletti94_keras_synthetic_genome_sequence\u0026metric=coverage\n    :target: https://sonarcloud.io/dashboard/index/LucaCappelletti94_keras_synthetic_genome_sequence\n    :alt: SonarCloud Coverage\n\n.. |coveralls| image:: https://coveralls.io/repos/github/LucaCappelletti94/keras_synthetic_genome_sequence/badge.svg?branch=master\n    :target: https://coveralls.io/github/LucaCappelletti94/keras_synthetic_genome_sequence?branch=master\n    :alt: Coveralls Coverage\n\n.. |pip| image:: https://badge.fury.io/py/keras-synthetic-genome-sequence.svg\n    :target: https://badge.fury.io/py/keras-synthetic-genome-sequence\n    :alt: Pypi project\n\n.. |downloads| image:: https://pepy.tech/badge/keras-synthetic-genome-sequence\n    :target: https://pepy.tech/badge/keras-synthetic-genome-sequence\n    :alt: Pypi total project downloads\n\n.. |codacy| image:: https://api.codacy.com/project/badge/Grade/7f2c4e2947834c05b5a869a9445482d0\n    :target: https://www.codacy.com/manual/LucaCappelletti94/keras_synthetic_genome_sequence?utm_source=github.com\u0026amp;utm_medium=referral\u0026amp;utm_content=LucaCappelletti94/keras_synthetic_genome_sequence\u0026amp;utm_campaign=Badge_Grade\n    :alt: Codacy Maintainability\n\n.. |code_climate_maintainability| image:: https://api.codeclimate.com/v1/badges/b89f6bd0ddc58cc93e89/maintainability\n    :target: https://codeclimate.com/github/LucaCappelletti94/keras_synthetic_genome_sequence/maintainability\n    :alt: Maintainability\n\n.. |code_climate_coverage| image:: https://api.codeclimate.com/v1/badges/b89f6bd0ddc58cc93e89/test_coverage\n    :target: https://codeclimate.com/github/LucaCappelletti94/keras_synthetic_genome_sequence/test_coverage\n    :alt: Code Climate Coverate\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucacappelletti94%2Fkeras_synthetic_genome_sequence","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucacappelletti94%2Fkeras_synthetic_genome_sequence","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucacappelletti94%2Fkeras_synthetic_genome_sequence/lists"}