{"id":24667617,"url":"https://github.com/torkamanilab/genome-tiling","last_synced_at":"2025-03-21T12:16:54.640Z","repository":{"id":236679363,"uuid":"792941402","full_name":"TorkamaniLab/Genome-Tiling","owner":"TorkamaniLab","description":"A Set of R Tools for Splitting up Chromosomes into Regions of High and Low Recombination","archived":false,"fork":false,"pushed_at":"2024-04-28T01:49:31.000Z","size":41532,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-01-26T08:17:52.733Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TorkamaniLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-28T01:35:15.000Z","updated_at":"2024-04-28T03:23:04.000Z","dependencies_parsed_at":"2024-04-28T07:06:01.866Z","dependency_job_id":null,"html_url":"https://github.com/TorkamaniLab/Genome-Tiling","commit_stats":null,"previous_names":["torkamanilab/genome-tiling"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TorkamaniLab%2FGenome-Tiling","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TorkamaniLab%2FGenome-Tiling/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TorkamaniLab%2FGenome-Tiling/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TorkamaniLab%2FGenome-Tiling/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TorkamaniLab","download_url":"https://codeload.github.com/TorkamaniLab/Genome-Tiling/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244795520,"owners_count":20511521,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-26T08:17:58.091Z","updated_at":"2025-03-21T12:16:54.608Z","avatar_url":"https://github.com/TorkamaniLab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Genome Tiling Using Box Scores (2024-04-27)\n\nDeveloped by Doug Evans, Torkamani Lab, Scripps Research\n\nThe files contained in this archive provide a demonstration of genome tiling,\nand is pre-coded for chromosome 9 from the 1000G consortium (vcf not included)\n\nWarning: This code is not well documented or verified. Users are advised to read through\nand understand this code before attempting to integrate into serious applications\n\nReferences to M and V generally refer to mountains (or peaks) and valleys. \nMountains are regions of the genome of high correlation (low recombination) and high boxscores, while \nvalleys are regions of the genome of low correlation (high recombination) and have low boxscores\n\n### Files included in the repo\n- README.ipynb\\\n  This file\n  \n- BoxScore.Version.2.0.0\\\n  Program for computing Boxscores, outputs to a file\n\n- chr9.corboxscores.corcut.p45.4000.500.txt\\\n  This is an example but real boxscore file produced from BoxScore.Version.2.0.0 for 1000G chr9\n  \n- FindMVMs.Version.2.0.0\\\n  Program for splitting Boxscores based on regions of low and high correlations\\\n  Outputs a file of genomic positions for peaks and valleys of boxscores\\\n  Also outputs a file of genomic positions in an attempt break up large peaks\n\n- chr9.p45.mnts.valleys.tsv\\\n  This is an example, but real, of a mountains and valleys file\n  \n- chr9.p45.4500.mnts.minima.tsv\\\n  This is an example , but real, of a file with additional break points (minima) for some cases where mountains are too big\n  \n- All.chr9.v5b.linenum.AF.txt\\\n  A file derived from chr 9 1000G vcf (See BoxScore File for details on how to create this file)\n\n- ChIPseq_Peaks.YFP_HumanPRDM9.antiGFP.protocolN.p10e-5.sep250.Annotated.txt\\\n  A file containing a list of putative PRDM9 binding sites. This is for reference only and is not used for tiling.\n\n- ALL.chr9.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz\\\n  ALL.chr9.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz.tbi\\\n  These are thes vcf files used as examples as coded, but are are not included in the repo\\\n  They are available at: http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3\n\n### The intention of this code\n\nProbably the easiest way to understand the code and files in this repo is to understand the application for which it was intended.\n\nTorkamani lab wished to develop a tool for performing reference free genomic imputation based on deep learning models we would develop.\nIf successful, deep learning models would impute missing genotypes based on subregions (tiles) of the genome, quickly, cheaply, and\nmore accurately than existing imputions tools.\n\nThe reason for creating tiles was to attempt to identify and isolate regions of the genome where recombination rates were low, and to train\nmodels for each highly correlated region (or tile). Also, we needed small tiles becasue the deep learning algorithm was constrained by the \nnumber of variants that could be accommodated nicely by GPU VRAM. The goal was to try and split the genome into regions of about 4500 \nvariants or so (per tile).\n\nTo achieve this, we devised a scoring method (boxscores), where we computed a correlation matrix of common variants (MAF \u003e .05)\nand counted the number of upstream and downstream elements (cells in the matrix) which had threshold values above 0.45 (absolute value = 0.45).\nThis size (of the box) determines how far upstream and downstream of a pair of genomic loci we need to compute the correlation matrix. The score\nfor each locus (loci) is the sum of the cells in that matrix that had correlations above 0.45. If no loci correlate above or below a locus (loci)\nthe boxscore is near zero, and this is referred to as a valley. The opposite it true for highly correlated variants upstream to downstream\nof a pair of variant loci, with high boxscores.\n\nThe easiest way to envision this is as an integrating (sum) as a box sliding along the top of a diagonal within a larger thresholded correlation matrix.\nThe trick is, we cannot do a whole VCF at once. So we break the matrix up into small windows referred to blocks. Each block is a large correlation matrix and\nshould be much bigger than the boxsize. There are tradeoffs for setting blocks too large due to memory needs and unnecessary computational overhead.\n\nFinally, box scores cannot be computed fully at the beginning and ending of the chromosome, because the box crowds the boundary. So a number\nof variants (boxsize) are not computed at the upper and lower boundary of the chromosome and are padded with preset values (-1) for easy detection\nby downstream tools in the scores file.\n\nOnce a box score file is created by the BoxScore program, the FindMVMs file can be used to discover valleys, and them used to split up the chromosome into blocks.\nThis first step is to simply look for very low values of boxscore and tag them as valleys. Valleys and mountains can then be annotated, along with the\nvariant count in each mountain and valley. \n\nUnfortunately, we were not always successful in getting small enough tiles to fit into GPU VRAM, so a second bit\nof code is used to split large mountains up further using binning and discovering local minimum. If no local minima of suitable size can be found, then\ndownstream applications will be forced to break up larger remaining mountains manually into smaller tiles.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftorkamanilab%2Fgenome-tiling","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftorkamanilab%2Fgenome-tiling","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftorkamanilab%2Fgenome-tiling/lists"}