{"id":16618544,"url":"https://github.com/brentp/genoiser","last_synced_at":"2025-06-12T21:06:11.642Z","repository":{"id":66472180,"uuid":"131044171","full_name":"brentp/genoiser","owner":"brentp","description":"use the noise","archived":false,"fork":false,"pushed_at":"2020-04-15T20:06:44.000Z","size":2039,"stargazers_count":15,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-14T04:12:08.106Z","etag":null,"topics":["bioinformatics","genomics","high-throughput-sequencing","nim","nim-lang"],"latest_commit_sha":null,"homepage":"","language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brentp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-04-25T17:54:57.000Z","updated_at":"2021-11-26T08:08:03.000Z","dependencies_parsed_at":"2023-04-13T11:38:19.589Z","dependency_job_id":null,"html_url":"https://github.com/brentp/genoiser","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/brentp/genoiser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fgenoiser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fgenoiser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fgenoiser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fgenoiser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brentp","download_url":"https://codeload.github.com/brentp/genoiser/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brentp%2Fgenoiser/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259529883,"owners_count":22872091,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","genomics","high-throughput-sequencing","nim","nim-lang"],"created_at":"2024-10-12T02:20:37.456Z","updated_at":"2025-06-12T21:06:11.607Z","avatar_url":"https://github.com/brentp.png","language":"Nim","funding_links":[],"categories":[],"sub_categories":[],"readme":"#### genoiser: the noise is the signal\n\ngiven a lot of alignment files, `genoiser` helps to find the areas of the genome that have noise signals\nthat result in bad variant (small, SV, ME) calls. It finds per-sample noise and then aggregates across\nsamples and reports the number of samples at each base in the genome that had a given noise signal. \nThese regions can be used as black-list regions instead of or in addition to LCRs.\n\n[mosdepth](https://github.com/brentp/mosdepth) uses chromosome-sized arrays of\nint32's to track sequencing depth. This is [fast and flexible](https://brentp.github.io/post/arrays/).\n\ngiven `mosdepth` as a special-case for *depth*, `genoiser` is a general case for user-defined functions.\n`mosdepth` could be implemented with `genoiser`.\n\nAn added benefit is the reduction of memory; `mosdepth` allocates an int32 array the size of each\nchromosome--meaning about 1GB of memory for chromosome 1. `genoiser` can use smaller-sized chunks to\ntile across each chromosome. This is important because it uses 1 array for each user-defined function.\nIt defaults to 8 megabase chunks as that is the smallest size with no noticeable effect on performance\nin our tests. Chunk sizes down to 100KB have some, but minor effect on performance.\n\nThe idea is of `genoiser` is that it will handle all accounting, a user\nsimply defines a [nim](https://nim-lang.org) function that takes an alignment and then\nindicates which genomic positions to increment. For example, to calculate depth, this user\nfunction would increment from start to end:\n\n```Nim\nproc depthfun*(aln:Record, posns:var seq[mrange]) =\n  ## depthfun is an example of a `fun` that can be sent to `genoiser`.\n  ## it increments from aln.start to aln.stop of passing reads.\n  var f = aln.flag\n  if f.unmapped or f.secondary or f.qcfail or f.dup: return\n  posns.add((aln.start, aln.stop, 1))\n```\n\nThe `posns` value is sent to the function by `genoiser` and the user-defined function\ncan add to it as many elements as desired. In this case it increments from `aln.start`\nto `aln.stop` by `1`. It can inrement by any integer value.\n\nThe user could also choose to increment any soft or hard-clip location:\n\n```Nim\nproc softfun*(aln:Record, posns:var seq[mrange]) =\n  ## softfun an example of a `fun` that can be sent to `genoiser`.\n  ## it sets positions where there are soft-clips\n  var f = aln.flag\n  if f.unmapped or f.secondary or f.supplementary or f.qcfail or f.dup: return\n  var cig = aln.cigar\n  if cig.len == 1: return\n  var pos = aln.start\n\n  for op in cig:\n    if op.op == CigarOp.soft_clip or op.op == CigarOp.hard_clip:\n      # for this function, we want the exact break-points, not the span of the event,\n      # so we increment the position and the one that follows it.\n      posns.add((pos, pos+1, 1))\n    if op.consumes.reference:\n      pos += op.len\n```\n\n### Utility\n\nThis library provides the machinery. Other command-line tools will use this for more obviously useful things.\n\n\n## Speed\n\nfor maximum speed, compile with `nim c -d:release --passC:-flto --passL:-s --gc:markAndSweep src/genoiser.nim`\n\n## CLI\n\nThe command-line interface allows running pre-specified noise filters in 2 steps. The first steps calculates the \"noise\" in each sample.\n\n```\ngenoiser per-sample --fasta $reference results/$sample /path/to/$sample.bam # or cram\n```\n\nThis will use a single thread and it will take about 1 hour for 30X bams and a bit more for crams. It will output 5 files per sample.\nat the specificied prefix, in this case, `results/$sample`\n\nAfter all samples have been run, then the user can `aggregate` the signal across samples. The recommended commands are:\n\n```\nchroms=\"$(seq 1 22) X Y\"\n\n########\n## soft\n########\n# count number of samples at each site where more than 1 and less than 15% of reads were soft-clipped.\necho $chroms | tr ' ' '\\n' \\\n    | gargs -d -v -p 5 \"genoiser aggregate -t 10 -e '(value / depth) \u003c 0.15 \u0026 (value \u003e 1)' results/*.genoiser.{}.soft.bed \u003e soft.{}.bed\"\n\n# count number of samples at each site where more than 2 where soft-clipped.\necho $chroms | tr ' ' '\\n' \\\n    | gargs -d -v -p 5 \"genoiser aggregate -t 10 -e '(value \u003e 2)' results/*.genoiser.{}.soft.bed \u003e high-soft.{}.bed\"\n\n\n###########\n## mq0\n###########\n\n# count number of samples at each site where more than 2 reads had MQ0.\necho $chroms | tr ' ' '\\n' \\\n    | gargs -d -v -p 5 \"genoiser aggregate -t 10 -e '(value \u003e 2)' results/*.genoiser.{}.mq0.bed \u003e mq0.{}.bed\"\n\n\n############\n## weird\n############\n# count number of samples at each site where more than 1 and less than 15% of reads were weird.\necho $chroms | tr ' ' '\\n' \\\n    | gargs -d -v -p 5 \"genoiser aggregate -t 10 -e '(value / depth) \u003c 0.15 \u0026 (value \u003e 1)' results/*.genoiser.{}.weird.bed \u003e weird.{}.bed\"\n\necho $chroms | tr ' ' '\\n' \\\n    | gargs -d -v -p 5 \"genoiser aggregate -t 10 -e '(value \u003e 2)' results/*.genoiser.{}.weird.bed \u003e high-weird.{}.bed\"\n\n############\n## mismatches\n############\n\n# count number of samples at each site where 10 or more reads read with 4 or more mismatches overlapped.\necho $chroms | tr ' ' '\\n' \\\n    | gargs -d -v -p 5 \"genoiser aggregate -t 10 -e 'value \u003e 10' results/*.genoiser.{}.mismatches.bed \u003e mismatches.{}.bed\"\n\n\n##################\n# interchromosomal\n##################\n\n# count number of samples at each site where more than 1 and less than 15% of reads were interchromosomal.\necho $chroms | tr ' ' '\\n' \\\n    | gargs -d -v -p 5 \"genoiser aggregate -t 10 -e '(value / depth) \u003c 0.15 \u0026 (value \u003e 1)' results/*.genoiser.{}.interchromosomal.bed \u003e interchromosomal.{}.bed\"\n\necho $chroms | tr ' ' '\\n' \\\n    | gargs -d -v -p 5 \"genoiser aggregate -t 10 -e '(value \u003e 2)' results/*.genoiser.{}.interchromosomal.bed \u003e high-interchromosomal.{}.bed\"\n\n```\nwhere `gargs` is available as a static binary from [here](https://github.com/brentp/gargs/releases)\n\nThis will output 1 file per chromosome, per metric. The resulting files are the desired output.\n\nNote that this will use `5*10` threads. Adjust this to match your CPU requirements.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrentp%2Fgenoiser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrentp%2Fgenoiser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrentp%2Fgenoiser/lists"}