{"id":20386986,"url":"https://github.com/cmdcolin/pairwise_indexed_paf","last_synced_at":"2026-04-23T03:31:21.854Z","repository":{"id":191537659,"uuid":"684862745","full_name":"cmdcolin/pairwise_indexed_paf","owner":"cmdcolin","description":"An experimental demo of \"pairwise indexing\" the PAF format using tabix","archived":false,"fork":false,"pushed_at":"2023-08-30T04:14:25.000Z","size":4,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-04T23:30:43.237Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cmdcolin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-08-30T02:18:44.000Z","updated_at":"2023-08-30T05:09:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"9bf1910b-abd9-4719-a746-4d024407e9d3","html_url":"https://github.com/cmdcolin/pairwise_indexed_paf","commit_stats":null,"previous_names":["cmdcolin/pairwise_indexed_paf"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cmdcolin/pairwise_indexed_paf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fpairwise_indexed_paf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fpairwise_indexed_paf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fpairwise_indexed_paf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fpairwise_indexed_paf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cmdcolin","download_url":"https://codeload.github.com/cmdcolin/pairwise_indexed_paf/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Fpairwise_indexed_paf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32164855,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-23T02:19:40.750Z","status":"ssl_error","status_checked_at":"2026-04-23T02:17:55.737Z","response_time":53,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T02:42:00.994Z","updated_at":"2026-04-23T03:31:21.828Z","avatar_url":"https://github.com/cmdcolin.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pairwise indexed PAF proposal\n\n## Prepare data\n\n```bash\ngit clone git@github.com:cmdcolin/pairwise_indexed_paf\ncd pairwise_indexed_paf\n./process.sh in.paf \u003e out.paf\nbgzip out.paf\ntabix -s1 -b3 -e4 out.paf.gz\n```\n\n### Basic concept\n\nPAF files are difficult to \"index\" by traditional tools like genome browsers\nwhich like to load a subset of the data. Whole genome alignments in PAF format\nbetween eukaryotic genomes are frequently hundreds of megabytes with CIGAR\nstring data included, and even with gzipping, the genome browser has to\nuncompress it in memory to process.\n\nIn my view, a genome browser should be able to \"query the PAF in both\ndirections\". Therefore, if a user is browsing e.g. BRCA1 on the human genome,\nthey should be able to load a small amount of the PAF file to find the matching\nposition on mouse. Similarly if they are on mouse, they should be able to go in\nthe other direction. Even though a PAF has specific notions of \"query\" and\n\"target\", I still want to be able to navigate in both \"directions\".\n\nThis repository proposes two processes to aid subsetting large whole-genome\nalignments using simple tabix tools\n\n### Strategy 1. Create two copies of the PAF data in a single file, with separate \"tabix name spaces\"\n\n1. Create copy of PAF with the letter 'q' pre-prended to all lines\n2. Create another copy of the PAF, with the query and target swapped (e.g.\n   columns 6-9 become columns 1-4 and vice versa), with the letter 't'\n   pre-pended to all lines\n3. Append step 1. and step 2. together into a single file\n4. Sort by column 1 and 3, and tabix index\n\nThis creates a single file where a user can query in either direction. They will\nknow which \"direction\" they are querying, so can prepend the letter q or t to\nthe refName they are querying.\n\n### Strategy 2. Create an \"overview\" file, with a reduction in the granularity of the CIGAR string (not implemented here yet)\n\nIf we are trying to look at a \"whole genome overview dotplot\" for example, the\nindex will not help us (the index primarily helps small data when viewing a\nparticular region) because we have to load the entire dataset anyways. But we\ncan create a \"reduced\" version of the PAF that essentially deletes single\nbasepair indels from the CIGAR string, retaining some of the larger features.\n\nBut we cannot just delete from the CIGAR string and expect the coordinates to\nstill match up. This is why one strategy I have considered currently is to split\nfeatures when there is a \"large enough\" CIGAR feature (large 100kb insertion or\ndeletion for example), and then delete the CIGAR string entirely from all\nfeatures. You could try to retain the CIGAR, but may be lying about the exact\nper-base location of certain events, which is risky in terms of data accuracy\n\n### Footnote\n\nPAF is a very pairwise format, however, doing a similar thing with MAF may also\nbe desirable. It might be that putting all the re-ordered MAF data in a single\ne.g. tabix file may be an overload, but making it into N files for each element\nof the multiple alignment may be reasonable.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmdcolin%2Fpairwise_indexed_paf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcmdcolin%2Fpairwise_indexed_paf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmdcolin%2Fpairwise_indexed_paf/lists"}