{"id":20386918,"url":"https://github.com/cmdcolin/oddgenes","last_synced_at":"2025-04-04T08:08:16.237Z","repository":{"id":39583907,"uuid":"85263957","full_name":"cmdcolin/oddgenes","owner":"cmdcolin","description":"A small database of weird gene annotations","archived":false,"fork":false,"pushed_at":"2025-01-24T19:22:57.000Z","size":1496,"stargazers_count":202,"open_issues_count":0,"forks_count":12,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-03-28T07:08:08.422Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Raku","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cmdcolin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-03-17T02:50:12.000Z","updated_at":"2025-02-17T11:47:09.000Z","dependencies_parsed_at":"2023-02-12T20:45:49.607Z","dependency_job_id":"3f27edc7-7169-4566-b5ec-f4125baebe6e","html_url":"https://github.com/cmdcolin/oddgenes","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Foddgenes","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Foddgenes/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Foddgenes/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cmdcolin%2Foddgenes/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cmdcolin","download_url":"https://codeload.github.com/cmdcolin/oddgenes/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247142066,"owners_count":20890652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-15T02:41:23.630Z","updated_at":"2025-04-04T08:08:16.209Z","avatar_url":"https://github.com/cmdcolin.png","language":"Raku","funding_links":[],"categories":[],"sub_categories":[],"readme":"# oddgenes\n\nA list of weird gene annotations or things that break bioinformatics assumptions\n\nSee also https://github.com/cmdcolin/oddbiology/ for more weird bio\n\n## Gene structures\n\n### 1bp length exon\n\nEvidence given for a 1bp length exon in Arabidopsis and different splicing\nmodels are discussed\n\nhttp://www.nature.com/articles/srep18087\n\nAnother 1bp exon is discussed here\nhttps://journals.plos.org/plosone/article?id=10.1371/journal.pone.0177959\n\nMicroexons in general are an interesting topic and are \"involved in important\nbiological processes in brain development and human cancers\" (ref\nhttps://www.cell.com/molecular-therapy-family/nucleic-acids/fulltext/S2162-2531(23)00013-6)\nyet are commonly misannotated (e.g. in plants\nhttps://www.nature.com/articles/s41467-022-28449-8)\n\nSee also cryptic splicing\n\n### 0bp length exon\n\nThe phenomenon of recursive splicing can remove sequences progressively inside\nan intron, so there can exist \"0bp exons\" that are just the splice-site\nsequences pasted together.\n\n\"To identify potential zero nucleotide exon-type ratchet points, we parsed the\nRNA-Seq alignments to identify novel splice junctions where the reads mapped to\nan annotated 5' splice site and an unannotated 3' splice site, and the genomic\nsequence at the 3' splice site junction was AG/GT\"\n\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4529404/\n\n### Very large introns\n\nSatellite DNA study uncovers megabase scale introns\nhttps://www.biorxiv.org/content/early/2018/12/11/493254\n\nAn example in this paper kl-3 spans 4.3 million bp\n\nIn human, an example is Dystrophin. \"Dystrophin is coded for by the DMD gene –\nthe largest known human gene, covering 2.4 megabases (0.08% of the human genome)\nat locus Xp21. The primary transcript in muscle measures about 2,100 kilobases\nand takes 16 hours to transcribe; the mature mRNA measures 14.0 kilobases\"\nhttps://en.wikipedia.org/wiki/Dystrophin\n\nNote: these large introns require very large amounts of DNA to be transcribed\ninto RNA, before just removing most of the transcribed RNA via intron splicing,\nwhich is sort of \"wasteful\" on a molecular level\n\n\n### Large number of exons\n\nIn human, the TTN (titan) gene has ~364 exons, which is almost double the next most NEB (nebulin) at ~184 exons\n\nhttps://www.ncbi.nlm.nih.gov/gene?Db=gene\u0026Cmd=DetailsSearch\u0026Term=7273\n\n### Small introns\n\n\"A 2015 study suggests that the shortest known metazoan intron length is 30 base\npairs (bp) belonging to the human MST1L gene\n(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4675715/). The shortest known\nintrons belong to the heterotrich ciliates, such as Stentor coeruleus, in which\nmost (\u003e 95%) introns are 15 or 16 bp long\n(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5659724/)\"\nhttps://en.wikipedia.org/wiki/Intron#Distribution\n\nA novel splicing factor may be involved in small introns\nhttps://www.news-medical.net/news/20240215/Novel-splicing-mechanism-for-short-introns-discovered.aspx\n\n### Very large proteins\n\nAn algae published about in 2024 encodes a protein PKZILLA-1 that has a mass of\n4.7 megadaltons and contains 140 enzyme domains\nhttps://cen.acs.org/biological-chemistry/PKZILLA-proteins-smash-protein-size/102/web/2024/08\n\nIn human the TITIN gene (in muscle) has almost 4 megadaltons\n\n![](https://s7d1.scene7.com/is/image/CENODS/20240808lnp2-titinpkzilla?$responsive$\u0026wid=700\u0026qlt=90,0\u0026resMode=sharp2)\n\nThe DMD gene above, despite being large on the genome, only encodes a 70\nkilo-dalton protein (not megadalton!)\nhttps://pmc.ncbi.nlm.nih.gov/articles/PMC49288/\n\n### Backsplicing and circRNAs\n\nThe process of \"backsplicing\" circularizes RNAs. There can be alternative\nbacksplicing too\n\nSee https://academic.oup.com/nar/article/48/4/1779/5715065\n\n### Very large number of isoforms in Dscam\n\n\"Dscam has 24 exons; exon 4 has 12 variants, exon 6 has 48 variants, exon 9 has\n33 variants, and exon 17 has two variants. The combination of exons 4, 6, and 9\nleads to 19,008 possible isoforms with different extracellular domains (due to\ndifferences in Ig2, Ig3 and Ig4). With two different transmembrane domains from\nexon 17, the total possible protein products could reach 38,016 isoforms\"\n\nRef https://en.wikipedia.org/wiki/DSCAM\nhttps://www.wikigenes.org/e/gene/e/35652.html\n\n### Translational frameshift/Ribosomal frameshift/Programmed ribosomal frameshift\n\nRef https://en.wikipedia.org/wiki/Translational_frameshift\n\nhttps://www.sciencedirect.com/topics/neuroscience/ribosomal-frameshifting\n\nSARS-CoV-2 uses ribosomal frameshifting and this video shows a 3D animation of\nthe process, showing a 'pseudoknot' in the RNA contributes to it\nhttps://www.youtube.com/watch?v=gLcueW61QMU\n\nAnother lecture explaining frameshift in viruses\nhttps://youtu.be/b5BX5A3dGUQ?t=2980\n\n### Ribosome hopping\n\n\"Ribosome hopping involves ribosomes skipping over large portions of an mRNA\nwithout translating them\" Ref https://pubmed.ncbi.nlm.nih.gov/24711422/\n\n### Internal Ribosome Entry Sites (IRES)\n\n\"Eukaryotic mRNAs are typically monocistronic and translated only a single Open\nReading Frame. Some viruses can reinititate translation after translation\ntermination using an IRES\" Ref\nhttps://en.wikipedia.org/wiki/Internal_ribosome_entry_site\n\n### Stop codon readthrough/translational readthrough\n\n\"Stop codon suppression or translational readthrough occurs when in translation\na stop codon is interpreted as a sense codon, that is, when a (standard) amino\nacid is 'encoded' by the stop codon. Mutated tRNAs can be the cause of\nreadthrough, but also certain nucleotide motifs close to the stop codon.\nTranslational readthrough is very common in viruses and bacteria, and has also\nbeen found as a gene regulatory principle in humans, yeasts, bacteria and\ndrosophila.[28][29] This kind of endogenous translational readthrough\nconstitutes a variation of the genetic code, because a stop codon codes for an\namino acid. In the case of human malate dehydrogenase, the stop codon is read\nthrough with a frequency of about 4%.[30] The amino acid inserted at the stop\ncodon depends on the identity of the stop codon itself: Gln, Tyr, and Lys have\nbeen found for the UAA and UAG codons, while Cys, Trp, and Arg for the UGA codon\nhave been identified by mass spectrometry.[31] Extent of readthrough in mammals\nhave widely variable extents, and can broadly diversify the proteome and affect\ncancer progression.[32] \"\n\nhttps://en.wikipedia.org/wiki/Stop_codon#Translational_readthrough\n\n### Stop codon re-assignment: selenocysteine\n\nThe amino acid Selenocysteine is coded for by a \"opal\" (UGA) stop codon\n(https://en.wikipedia.org/wiki/Selenocysteine)\n\nIs present in all domains of life including humans\n\nAs of 2021, 136 human proteins (in 37 families) are known to contain\nselenocysteine\n\nSelenocysteine can be coded via a SECIS sequence\nhttps://en.wikipedia.org/wiki/SECIS_element and resulting products are called\n([selenoproteins](https://en.wikipedia.org/wiki/Selenoprotein))\n\n### Stop codon re-assignment: pyrrolysine\n\nPyrrolysine also is coded for by the \"amber\" (UAG) stop codon\n(https://en.wikipedia.org/wiki/Pyrrolysine), not present in humans\n\n\"It is encoded in mRNA by the UAG codon, which in most organisms is the 'amber'\nstop codon. This requires only the presence of the pylT gene, which encodes an\nunusual transfer RNA (tRNA) with a CUA anticodon, and the pylS gene, which\nencodes a class II aminoacyl-tRNA synthetase that charges the pylT-derived tRNA\nwith pyrrolysine. \"\n\nThere are several other stop codon modifications described here\nhttps://www.nature.com/articles/nrg3963\n\n### Stop codons can also be removed by RNA editing\n\nas in the case of mammalian apoliprotein B, B100 isoform.\n\n\"A posttranscriptional modification of the apoB mRNA by conversion of cytidine\ninto uridine at nucleotide position 6666 changes the genomically encoded\nglutamine codon CAA at amino acid residue 2153 into a translational stop codon\nUAA.\"\n\nhttps://pubmed.ncbi.nlm.nih.gov/8409768/\n\n### Stop codons can be added by polyadenylation\n\nThere is a stop codon not in the genome, but one is added post-transcriptionally\nby polyadenylation\n\nNoted in vertebrate mitochondrial section here\nhttps://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG2\n\n### Ciliates with \"No stop codons\"\n\n\"Flexibility in the nuclear genetic code has been demonstrated in ciliates that\nreassign standard stop codons to amino acids...Surprisingly, in two of these\nspecies, we find efficient translation of all 64 codons as standard amino acids\nand recognition of either one or all three stop codons\"\n\nTermination is therefore \"context dependent\" rather than a specific 3 letter\nsequence https://pubmed.ncbi.nlm.nih.gov/27426948/\n\n### Readthrough transcription\n\nSee also this Ensembl blog on annotating readthrough transcription which joins\nmultiple genes\nhttp://www.ensembl.info/2019/02/11/annotating-readthrough-transcription-in-ensembl/\n\nRNA-seq often makes extremely compelling cases for two-or-more different genes\nto be conjoined by splicing\n\nSome algorithms e.g. mikado\nhttps://academic.oup.com/gigascience/article/7/8/giy093/5057872 try to avoid\nthis calling it artifactual fusion/chimera that can be due to some tandem\nduplication but it does seem to be very prevalent in real data sets\n\n### Non-canonical splice sites\n\nThe standard splice site recognition sequence is an GU in RNA (or GT in DNA) on\nthe 5' end and AG on the 3' (remember, goes 5' to 3'). This recognition motif\naccounts for the large majority of splicing. If a different sequence is used it\nis said that a different spliceosome complex is being used \"minor spliceosome\"\n\nhttps://en.wikipedia.org/wiki/Minor_spliceosome\n\n### Cryptic splice sites\n\nSome exons harbor internal splice sites (e.g. they get split) that might be\nunused or underused and are so called \"cryptic splice sites\"\n\nReview article https://academic.oup.com/nar/article/39/14/5837/1382796\n\nThe snaptron project from Ben Langmead analyzed huge amounts of RNA-seq public\ndata and found many types of these cryptic splicing http://snaptron.cs.jhu.edu/\n\n### Wobble splicing\n\nNAGNAG, GYNGYN, repeats of the splicing signal cause modified transcriptional\nbehavior\n\n\"Another mechanism introducing small variations to protein isoforms is wobble\nsplicing. Here, a GYN repeat at the donor splice site (5’ splice site; Y stands\nfor C or T and N stands for A, C, G, or T) or an NAG repeat at the acceptor\nsplice site (3’ splice site) leads to subtle length variations in the spliced\ntranscripts and finally to alternative isoforms differing in few amino acids.\"\nref https://onlinelibrary.wiley.com/doi/full/10.1002/bies.201900066?af=R\n\n### Intron retention\n\nIntron retention (IR) is a phenomenon where intron sequence is preserved, or\ndoesn't get spliced out, in mature RNA\n\nIt can occur in both abnormal and normal biological conditions. Transcript with\nIR often undergo nonsense-mediated decay.\n\n### Self-splicing RNA\n\nNormally RNA is spliced by a specialized protein complex called a spliceosome.\nThere is also self-splicing RNA where the splicing is done itself with RNA\n\nThe Group 1 intron type mentioned above is a \"self splicing\" function of RNA not\nrequiring external spliceosome\nhttps://en.wikipedia.org/wiki/Group_I_catalytic_intron\n\nGroup 2 and group 3 with similar but different mechanisms also exist\n\n### Bulge helix bulge introns (archael tRNA)\n\nThere are some small intron types called \"bulge-helix-bulge\" in archaea (and\nother organisms)\n\n![](img/bhb.jpg)\n\nFrom https://www.embopress.org/doi/full/10.1038/embor.2008.101\n\nThe figure above shows that the orange part is excised as an intron for the tRNA\n\n### Twintron\n\nA twintron is essentially an intron-within-an-intron, and has similar qualities\nto the 0bp splicing mentioned above. A twintron may be defined as one where the\ninternal intron has to be spliced first before the outer one is (may be referred\nto as a nested intron if internal is not necessary to be spliced out before the\nnext)\n\nSee https://en.wikipedia.org/wiki/Twintron\n\n![](img/twintron.png)\n\nFigure from https://doi.org/10.1080/15476286.2015.1103427 showing twintron\nconformations with a) spliceosome type introns (the spliceosome is a protein\ncomplex that performs splicing) b) ribosomal type introns (e.g. self splicing\nRNA) and c) tRNA/bulge helix bulge type introns\n\n### Introns in viruses\n\nIntrons were actually first discovered in viruses before eukaryotes, and the\nwikipedia article on introns details this\n\nhttps://en.wikipedia.org/wiki/Intron#Discovery_and_etymology (see also\nhttps://www.proquest.com/docview/303935681/)\n\n### Nuclear mitochondrial (NUMT) insertions\n\nPieces of the mitochondrial genome can be inserted into the autosomes in\neukaryotes\n\nhttps://en.wikipedia.org/wiki/Nuclear_mitochondrial_DNA_segment\n\n### Codon tables\n\nMany eukaryotes use the \"standard genetic code\" for changing codons to amino\nacids but frequent changes occur across the domains of life. The NCBI \"genetic\ncode\" table lists several of these and contains recent additions for particular\nspecies\n\nhttps://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG31\n\nOne article explains how alternative genetic codes work\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6207430/\n\n### Untranslated regions\n\nThe 5' and 3' UTR (un-translated region) is a part of the pre-mRNA at the start\nand end of the gene respectively that is spliced away in the mature RNA\n\nThis blog post by Ensembl shows how they annotate UTR and 19kb 3' UTR in Grin2b\nhttp://www.ensembl.info/2018/08/17/ensembl-insights-how-are-utrs-annotated/\n\nThey have many important functionality and are often targets of miRNA binding\nwhich leads to degradation.\n\n### Polyadenylation\n\nPolyadenylation is the addition of a string of \"A\"s to the pre-mRNA on the 3'\nend of the transcript (the \"A\"s are not part of the genome). There is a \"poly-A\nsignal\" in the genome that is recognized by the \"RNA cleavage complex\" and after\nit is cleaved, the poly-A tail is added\nhttps://en.wikipedia.org/wiki/Polyadenylation\n\nA survey of poly-A using Oxford Nanopore found a transcript isoform with a 450bp\npoly-A tail ENST00000581230, with intron retention being a possible correlate of\nhaving a longer poly-A tails\nhttps://www.biorxiv.org/content/early/2018/11/09/459529.article-info\n\n\"Intronic polyadenylation\" can also occur, which leads to different isoforms\n(the wording intronic polyadenylation is maybe a bit odd, but my understanding\nis that the \"transcription stops\" at a poly-A site inside an intron essentially)\n\n![](img/ipa.png)\n\nFigure showing \"intronic polyadenylation\" (IpA) creating a different isoform\nfrom https://www.nature.com/articles/s41467-018-04112-z\n\nIn mammalian mitochondria, some messages are polyadenylated after a U residue\nwhich is the U in a UAA stop codon -- the post-transcriptional polyadenylation\ncompletes the stop codon\n\n### Circular chromosomes\n\nCircularized chromosomes should be unsurprising to anyone working with plasmids\nand many prokaryotic genomes but for gene annotation formats which use linear\ncoordinates, representing anything wrapping around the origin is challenging.\n\nMany genomic viewers do not do this well. For GFF format this is done by making\nthe end go past the end of the genome. Below, the genome is 6407 bp in length,\nbut the CDS feature extends past this and sets Is_circular=true\n\n```\n##gff-version 3.2.1\n# organism Enterobacteria phage f1\n# Note Bacteriophage f1, complete genome.\nJ02448  GenBank region  1      6407    .       +       .       ID=J02448;Name=J02448;Is_circular=true;\nJ02448  GenBank CDS     6006   7238    .       +       0       ID=geneII;Name=II;Note=protein II;\n```\n\n### Dynamic DNA structures in vivo\n\nThe replication of the 2 micron plasmid found in Saccharomyces cerevisiae relies\non a programmed DNA rearrangement; in any population of cells two different\nstates of the 2 micron plasmid can be expected and these will interconvert in\nlater generations. Reference: https://pubmed.ncbi.nlm.nih.gov/23541845/\n\n### Overlapping genes\n\nIt is possible for gene sequences to overlap, on different strands\n(sense-antisense) or same strand, possibly in alternate coding frames\n\nhttps://en.wikipedia.org/wiki/Overlapping_gene\n\nSome articles\n\n- The novel EHEC gene asa overlaps the TEGT transporter gene in antisense and is\n  regulated by NaCl and growth phase\n  https://www.ncbi.nlm.nih.gov/m/pubmed/30552341/\n- Overlapping genes in natural and engineered genomes\n  https://www.nature.com/articles/s41576-021-00417-w\n- Uncovering de novo gene birth in yeast using deep transcriptomics\n  https://www.nature.com/articles/s41467-021-20911-3\n\n## Flybase\n\n### Chimeric genes\n\nThe gene Jingwei is a chimera (or fusion) of two genes, alcohol dehydrogenage\nand yellow emperor. Many chimeras are damaging but this has been selected for\n\nhttp://www.pnas.org/content/101/46/16246\n\nTwo Cytochrome P450 genes that don't confer any insecticide resistance on their\nown but a chimeric P450 does https://pubmed.ncbi.nlm.nih.gov/22949643/\n\n## Wormbase\n\n### Adding leader sequence to mRNA\n\n\"About 70% of C. elegans mRNAs are trans-spliced to one of two 22 nucleotide\nspliced leaders. SL1 is used to trim off the 5' ends of pre-mRNAs and replace\nthem with the SL1 sequence. This processing event is very closely related to\ncis-splicing, or intron removal.\"\n\nThe region that is spliced out is called an outron\n\nhttp://www.wormbook.org/chapters/www_transsplicingoperons/transsplicingoperons.html\n\n### Polycistronic transcripts/operons\n\nAlthough prevalent in bacteria, operons are not common in eukaryotes. However,\nthey are common in C. elegans specifically. \"A characteristic feature of the\nworm genome is the existence of genes organized into operons. These\npolycistronic gene clusters contain two or more closely spaced genes, which are\noriented in a head to tail direction. They are transcribed as a single\npolycistronic mRNA and separated into individual mRNAs by the process of\ntrans-splicing\"\n\nhttp://www.wormbook.org/chapters/www_overviewgenestructure.2/genestructure.html\n\nAnother paper says \"Once considered rare in eukaryotes, polycistronic mRNA expression has been identified in kinetoplastids and, more\nrecently, green algae, red algae, and certain fungi. This study provides comprehensive evidence supporting the\nexistence of polycistronic mRNA expression in the apicomplexan parasite Cryptosporidium parvum\"\n\nhttps://www.biorxiv.org/content/10.1101/2025.01.17.633476v1.full.pdf\n\n### Trans-splicing of exons on different strands\n\nA pre-mRNA from both strands of DNA eri6 and eri7 are combined to create eri-6/7\n\nSource http://forums.wormbase.org/index.php?topic=1225.0\nhttp://www.ncbi.nlm.nih.gov/pmc/articles/PMC2756026/\n\n### Exon shared across different genes\n\nAn example from drosophila, C. elegans, and rat shows a gene with a 5' exon\nbeing shared between two genes\n\n![](img/twosplice.png)\n\nSource http://forums.wormbase.org/index.php?topic=1225.0\nhttps://www.fasebj.org/doi/full/10.1096/fj.00-0313rev\n\nAn example here shows 5'UTR exons shared across different olfactory receptor\ngenes (\"Some OR genes share 5'UTR exons\")\n\nhttps://www.biorxiv.org/content/biorxiv/early/2019/09/19/774612.full.pdf\n\n## Evolution\n\n### Possible adaptive bacteria-\u003eeukaryote HGT\n\nA possible horizontal gene transfer from bacteria to eukaryotes is found in an\ninsect that feeds on coffee beans. Changes that the gene had to undergo are\ncovered (added poly-A tail, shine-dalgarno sequence deleted)\n\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3306691/\n\nalso https://www.cell.com/cell/fulltext/S0092-8674(19)30097-2\n\n### Transgenerational epigenetic inheritance\n\nThis phenomena of epigenetic modifications being passed down across generations\ngarners a lot of media attention and scientific attention. The idea of it being\ninfluenced by what \"one does in life\" such as experiencing famine is also very\ninteresting.\n\nhttps://en.wikipedia.org/wiki/Transgenerational_epigenetic_inheritance\n\nThere are skeptics also\nhttp://www.wiringthebrain.com/2018/07/calibrating-scientific-skepticism-wider.html\nbut the science is hopefully what speaks for itself\n\n## Codon usage\n\n### Alternative start codons\n\n\"The most common start codons for known Escherichia coli genes are AUG (83% of\ngenes), GUG (14%) and UUG (3%)\"\n\n\"Here, we systematically quantified translation initiation of green fluorescent\nprotein (GFP) from all 64 codons and nanoluciferase from 12 codons on plasmids\ndesigned to interrogate a range of translation initiation conditions.\"\n\nhttps://www.sciencedaily.com/releases/2017/02/170221080506.htm\n\nTesting in eukaryotes has also revealed alternative starts being viable\nhttps://en.wikipedia.org/wiki/Start_codon#Eukaryotes\n\n## Molecular\n\n### 4-base/quaternary/quadruplet codons\n\n3-base codon system is assumed by many, but engineered tRNAs can decode 4-base\ncodons with potential applications for using amino acids outside the 20\ncanonical ones\n\nreview https://elifesciences.org/articles/78869\n\nevolving improved 4-base efficiency\nhttps://www.nature.com/articles/s41467-021-25948-y\n\n### Complex DNA structures\n\nThe standard DNA double stranded helix is called B-DNA\n\n\"There are also triple-stranded DNA forms and quadruplex forms such as the\nG-quadruplex and the i-motif. \"\nhttps://en.wikipedia.org/wiki/Nucleic_acid_double_helix\n\n### Triplex DNA\n\nhttps://en.wikipedia.org/wiki/Triple-stranded_DNA\n\n### Polytene chromosome\n\nSome organisms, famously insects in their salivary glands, create many copies of\ngenes through multiple phases of incomplete DNA replication\nhttps://en.wikipedia.org/wiki/Polytene_chromosome\n\n![](img/polytene.png)\n\nFigure source https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5768140/\n\n\"Polytene chromosomes are produced by endoreplication, in which chromosomal DNA\nundergoes mitotic replication, but the strands do not separate. Ten rounds of\nendoreplication produces 2^10 = 1,024 DNA strands, which when arranged alongside\nof each other produce distinctive banding patterns. Endoreplication occurs in\ncells of the larval salivary glands of many species of Diptera, and increases\nproduction of mRNA for Glue Protein that the larvae use to anchor themselves to\nthe walls of (for example) culture vials.\" from\nhttps://www.mun.ca/biology/scarr/Polytene_Chromosomes.html\n\n### Endoreplication\n\nThe above section about polytene chromosomes mentions endoreplication but this\ncan also affect many other contexts and was mentioned as an issue in genome\nassembly of some plants. A talk given about vanilla bean found a lot of\nendoreplication during their genome assembly which leads to very uneven\ncoverage. They tried to select tissue samples that had the least amount of\nendoreplication.\nhttps://plan.core-apps.com/pag_2023/abstract/e26dbeb1-df8f-4c57-a062-dcaf881b79f4\n\n### Endo-(poly)ploidy\n\nDifferent cells may have different numbers of copies of chromosomes and it also\noccurs in some human cell types: \"polyploid cells can exist in otherwise diploid\norganisms (endopolyploidy). In humans, polyploid cells are found in critical\ntissues, such as liver and placenta. A general term often used to describe the\ngeneration of polyploid cells is endoreplication, which refers to multiple\ngenome duplications without intervening division/cytokinesis\"\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC4442802/\n\n### Programmed DNA elimination\n\n\"While we commonly assume the genome to be largely identical across different\ncells of a multicellular organism, a number of species undergo a developmentally\nregulated elimination process by which the genome in somatic cells is reduced,\nwhile the germline genome remains intact. This process, called Programmed DNA\nElimination (PDE), affects a number of species including copepod crustaceans,\nlamprey fish, single-celled ciliates and nematode worms (though not C.\nelegans!).\"\n\nFrom ISMB2023 video \"Deciphering developmentally programmed DNA elimination in\nMesorhabditis nematodes\" https://www.youtube.com/watch?v=2x6ElKeISRY\n\nSee also the term \"internal eliminated sequences\" (IES)\n\n### Range of ploidy\n\nWikipedia lists this table with examples of organisms with different ploidy\nhttps://en.wikipedia.org/wiki/Polyploidy#Types\n\n- haploid (one set; 1x), for example male European fire ants\n- diploid (two sets; 2x), for example humans\n- triploid (three sets; 3x), for example sterile saffron crocus, or seedless\n  watermelons, also common in the phylum Tardigrada[7]\n- tetraploid (four sets; 4x), for example, Plains viscacha rat, Salmonidae\n  fish,[8] the cotton Gossypium hirsutum[9]\n- pentaploid (five sets; 5x), for example Kenai Birch (Betula kenaica)\n- hexaploid (six sets; 6x), for example some species of wheat,[10] kiwifruit[11]\n- heptaploid or septaploid (seven sets; 7x)\n- octaploid or octoploid, (eight sets; 8x), for example Acipenser (genus of\n  sturgeon fish), dahlias\n- decaploid (ten sets; 10x), for example certain strawberries\n- dodecaploid or duodecaploid (twelve sets; 12x), for example the plants Celosia\n  argentea and Spartina anglica [12] or the amphibian Xenopus ruwenzoriensis.\n- tetratetracontaploid (forty-four sets; 44x), for example black mulberry[13]\n\n### DNA modifications\n\nThere are many chemical modifications that can happen to DNA, leading to an\n\"extended alphabet\" with functional changes.\n\nA common DNA modification is called methylation. The most common is a 5mC\nmodification, a methylation of the letter C, and is mostly found in a CpG (a C\nfollowed by a G in the genome)\n\nMany other modifications exist, see https://dnamod.hoffmanlab.org/\n\n## RNA world\n\n### RNA modifications\n\nhttps://www.hindawi.com/journals/jna/2011/408053/tab1/\n\nupdated link on hindawi should point here http://mods.rna.albany.edu/mods/ (this\nlink now dead too, see maybe http://genesilico.pl/modomics/modifications)\n\n### RNA editing\n\nRNA editing is a post-transcriptional modification to the mRNA, which can change\nwhat we would see when the RNA is sequenced. A-to-I editing is common in some\nspecies, which would make the RNA, when sequenced, appear to have a G instead of\nan A. If the genome was sequenced, it would not show a SNP but the RNA-seq would\nappear to have A-\u003eG.\n\nRNA editing can be conditional; mammalian apolipoprotein B is synthesized as a\n48 kilodalton form or a 100 kilodalton form; the latter is created by editing\nout a stop codon to enable read through\n\nOther editing occurs also https://en.wikipedia.org/wiki/RNA_editing\n\nEditing in some ciliate mitochondria adds information to messages and can\nincrease the length of the final message by over 2-fold.\n\n### Post-Transcriptional Exon Shuffling (PTES)\n\nWhile the exon structure of most mRNAs follows the linear sequence of the\ntranscribed DNA, there are a few cases where mature mRNAs contain exons in a\nnon-linear order.\n\nAl-Balool and Weber _et al_ (2011) validated several cases of PTES in human\ngenes that are evolutionarily conserved, including _MAN1A2_, _PHC3_, _TLE4_, and\n_CDK13_: https://genome.cshlp.org/content/21/11/1788.short\n\n### Maternal RNAs being passed down\n\nMaternal RNAs can show activity in the zygote (e.g.\nhttps://en.wikipedia.org/wiki/Maternal_to_zygotic_transition) which can lead to\ncomplex transgenerational effects\n\n### Lowly expressed RNA has large effects\n\nA lncRNA VELUCT almost flies under the radar in a lung cancer screen due to\nbeing very lowly expressed such that it is \"below the detection limit in total\nRNA from NCI-H460 cells by RT-qPCR as well as RNA-Seq\", however this study\nconfirms it as a factor in experiments\n\nhttps://www.ncbi.nlm.nih.gov/pubmed/28160600?dopt=Abstract\n\nNote that X inactivation relies on relatively lowly expressed RNA also\nhttps://twitter.com/mitchguttman/status/1454256452990734336\n\n### X chromosome inactivation\n\nX chromosome inactivation is produced by a non-coding transcript called Xist\nthat is transcribed on the X that is being inactivated. The Xist transcript\n\"coats\" the X chromosome with itself. An anti-sense transcript called Tsix\nregulates Xist\n\nhttps://en.wikipedia.org/wiki/XIST\n\nhttps://en.wikipedia.org/wiki/X-inactivation#Xist_and_Tsix_RNAs\n\nhttps://www.youtube.com/watch?v=y3ST0whbA4k (great series from iBiology on X\nchromosome inactivation)\n\n### Types of RNA\n\nThere are many types of RNA some more weird an exotic than others, a large list\nhttps://en.wikipedia.org/wiki/List_of_RNAs\n\nSome are named based on where they are expressed or active\n\nOthers are uniquely shaped. There are also circular RNA for example\nhttps://en.wikipedia.org/wiki/Circular_RNA\n\nSmall and long non coding RNAs often fold into important structural shapes\n\n## Proteins\n\n### Removal of start amino acid in proteins\n\nThis is probably obvious to many people who work on proteins but while the\ngenome has almost all genes starting with a start codon which produces\nmethionine, this is often post translationally removed\nhttps://en.m.wikipedia.org/wiki/Methionyl_aminopeptidase\n\n### Inteins\n\nAn intein is like an intron but for a protein, a segment of protein that is\nspliced out https://en.wikipedia.org/wiki/Intein\n\nSee section here\nhttps://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#pathological-cases\n\n### Polyprotein\n\nViral sequences can create a polyprotein which is fully transcribed and\ntranslated before being cleaved by a protease. In some viruses (such as\ncoronaviruses) their translation involves ribosomal frameshifting.\n\nDengue, HIV, flu, etc. use this\n\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC6040172/\nhttps://www.sciencedirect.com/science/article/abs/pii/S0959440X15000597\n\n### Interesting PDB entries\n\nFrom another repo\nhttps://github.com/molstar/molstar/blob/master/docs/docs/misc/interesting-pdb-entries.md\n\n## Transposons\n\n### Cross-species BovB transposon transfers\n\nOr \"How a quarter of the cow genome came from snakes\"\nhttp://phenomena.nationalgeographic.com/2013/01/01/how-a-quarter-of-the-cow-genome-came-from-snakes/\n\nSource http://www.pnas.org/content/110/3/1012.full\n\n### LINE1 important for embryonic development\n\nTransposon activity can mutate DNA as it will insert itself into the genome. The\ngenome has functions for keeping transposons inactive. However, evidence shows\nthat the LINE1 is important for embryonic development.\n\nhttps://www.ucsf.edu/news/2018/06/410781/not-junk-jumping-gene-critical-early-embryo\n\n## Immunity\n\n### VDJ Recombination\n\nVDJ recombination is a process of somatic recombination that is done in immune\ncells. It recognizes certain \"recombination signal sequences\". Different gene\nsegments of class \"V\", class \"D\", and class \"J\" exons (sometimes the exons are\nreferred to as \"genes\" themselves in literature) are somatically rearranged into\ncoherent genes that are then transcribed to create immune diversity. Splicing at\nthe DNA level is not precise, with terminal transferase adding random\nnucleotides to further diversify the sequences\n\nhttps://en.wikipedia.org/wiki/V(D)J_recombination\n\n### MHC region\n\nThe MHC region is a very polymorphic region of the genome on chr6. I'm not\npersonally familiar with all the intricacies of MHC beyond that it is a unique\ncontributor of some additional hg38 alternative loci/contigs due to it's high\ndiversity\n\n- https://en.wikipedia.org/wiki/Major_histocompatibility_complex\n\n- https://en.wikipedia.org/wiki/Human_leukocyte_antigen\n\n## Structural variations\n\n### Tandem duplication\n\nA tandem duplication can be seen as a piece of DNA that copied side by side in\nthe genome. But why would this occur?\n\nSome biological factors can include\n\n- replication slippage\n- retrotransposition\n- unequal crossing over (UCO).\n- imperfect repair of double-strand breaks by nonhomologous end joining (NHEJ)\n  (specifically generates 1-100bp range indels according to article)\n\nRef https://academic.oup.com/mbe/article/24/5/1190/1038942\n\n## Pseudogenes\n\n### A pseudogene that can protect against cancer in Elephants\n\nThe LIF gene has many copies in Elephant but many are non-functional. One copy\ncan be \"turned back on\" and play a role in cancer protection. They call this a\n\"zombie gene\"\n\nhttps://www.cell.com/cell-reports/fulltext/S2211-1247(18)31145-8\n\nhttps://www.sciencealert.com/lif6-pseudogene-elephant-tumour-suppression-solution-petos-paradox\n\n## Regulation\n\n### Intron mediated enhancement (IME)\n\nIt has been shown that some intron sequences can enhance expression similar to\nhow promoter sequences work\nhttps://en.wikipedia.org/wiki/Intron-mediated_enhancement\n\nThe first intron of the UBQ10 gene in Arabidopsis exhibits IME, and \"the\nsequences responsible for increasing mRNA accumulation are redundant and\ndispersed throughout the UBQ10 intron\"\nhttp://www.plantcell.org/content/early/2017/04/03/tpc.17.00020.full.pdf+html\n\nThe classic peppered moth phenotype is a intron TE insertion\nhttps://www.nature.com/articles/nature17951 (may not be strictly IME, I'm\npersonally not sure)\n\n### Bidirectional promoters\n\nWikipedia\nhttps://en.wikipedia.org/wiki/Promoter_(genetics)#Bidirectional_(mammalian)\n\n\"Bidirectional promoters are a common feature of mammalian genomes. About 11% of\nhuman genes are bidirectionally paired.\"\n\n\"The two genes are often functionally related, and modification of their shared\npromoter region allows them to be co-regulated and thus co-expressed\"\n\n## Chromosomal abnormalities\n\n### Uniparental disomy (UPD)\n\nA child can inherit both copies of the genome from one parent, instead of the\n\"usual\" one copy from mom, one from dad\n\n\"UPD arises usually from the failure of the two members of a chromosome pair to\nseparate properly into two daughter cells during meiosis in the parent’s\ngermline (nondisjunction). The resulting abnormal gametes contain either two\ncopies of a chromosome (disomic) or no copy of that chromosome (nullisomic),\ninstead of the normal single copy of each chromosome (haploid). This leads to a\nconception with either three copies of one chromosome (trisomy) or a single copy\nof a chromosome (monosomy). If a second event occurs by either the loss of one\nof the extra chromosomes in a trisomy or the duplication of the single\nchromosome in a monosomy, the karyotypically normal cell may have a growth\nadvantage as compared to the aneuploid cells. UPD results primarily from one of\nthese “rescue” events\"\n\nhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3111049/\n\n### Mosaic loss of Y chromosome\n\nOlder men can have a mosaic loss of the Y chromosome\nhttps://en.wikipedia.org/wiki/Mosaic_loss_of_chromosome_Y\n\nhttps://www.karger.com/Article/FullText/508564 (found from\nhttps://www.biostars.org/p/9482437/)\n\nmay be associated with cardiac issues\nhttps://www.science.org/doi/10.1126/science.abn3100\n\n### Mosaic loss of X chromosome\n\nSimilar to the above but for X\nhttps://www.cancer.gov/news-events/press-releases/2024/genetic-factors-predict-x-chromosome-loss\n\n### Ring chromosome\n\nIn organisms with normally linear chromosomes, circular or \"ring\" chromosomes\ncan form from aberrant processes https://en.wikipedia.org/wiki/Ring_chromosome\n\n![](https://upload.wikimedia.org/wikipedia/commons/d/da/NLM_ring_chromosome.jpg)\n\nThere are also smaller fragments that can be circularized called \"supernumerary\nsmall ring chromosomes\" (sSRC) or their normal linear part, \"supernumary small\nmarker chromosomes\" (sSMC)\nhttps://en.wikipedia.org/wiki/Small_supernumerary_marker_chromosome\n\n## File formats\n\n### Non-ACGT letters in fasta files\n\nThe latest human genome, for example, downloaded from NCBI, contains a number of\nNon-ACGT letters in the form of IUPAC codes\nhttps://www.bioinformatics.org/sms/iupac.html These represent ambiguous bases.\n\nHere is the incidence of non-ACGTN IUPAC letters in the entire human genome\nGRCh38.p14 from\nhttps://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz\n(same for the \"analysis set\" files in\nhttps://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/)\n\n```\n{\n  'B' =\u003e 2,\n  'K' =\u003e 8,\n  'Y' =\u003e 36,\n  'M' =\u003e 8,\n  'R' =\u003e 29,\n  'W' =\u003e 15,\n  'S' =\u003e 5\n};\n```\n\nDid you expect that in your bioinformatics software? Note that the mouse genome\n(GRCm38.p5) as far as I could tell does not contain any non-ACGT IUPAC letters\n\nSee [count_fasta_letters.pl](count_fasta_letters.pl) for a script to count this.\nThe UCSC hg38.fa.gz does not have any non-ACGTN letters.\n\n### rs SNP identifiers occurring in multiple places\n\nDue to how dbSNP is created (based on alignments), an rs ID can occur in\nmultiple places on the genome https://www.biostars.org/p/2323/\n\n### Weird characters in FASTA sequence names\n\nIn response to hg38 including a colon in sequence names, which conflicts with\ncommonly used representation of a range as chr1:1-100 for example (note:\nSAMv1.pdf contains a regex to help resolve this), people analyzed meta-character\nfrequencies in sequence names https://github.com/samtools/hts-specs/issues/291\n\n```\nENA\n#   16927\n*   1\n,   231\n-   122563947\n.   521540419\n/   236951\n\\   0\n:   30181\n;   72892\n=   186611\n@   3713\n|   949\n\nBroad(?)\n     12 #\n    527 *\n    357 ,\n1451132 -\n1492749 .\n  86114 /\n 233731 :\n   2034 =\n     17 @\n1735713 |\n\nReference sequences\n # 203\n % 203\n * 525\n + 1\n , 496\n - 154226\n . 1826561\n : 1577\n = 26\n _ 4961932\n | 1098333\n```\n\nNote that commas in FASTA names is being suggested as an illegal character\nbecause of the supplementary alignment tag in SAM/BAM using comma separated\nvalues\n\n## Humongous chromosomes V1\n\nGenomes such as wheat have large chromosomes averaging 806Mbp but the BAI/TBI\nfile formats are limited to 2^29-1 ~ 536Mbp in size (this is due to the binning\nstrategy, the max bin size is listed as 2^29). The CSI index format was created\nto help index BAM and tabix files with large chromosomes.\n\nBonus: I made a web tool to help visualize BAI files to show how the binning\nindex works https://cmdcolin.github.io/bam_index_visualizer/\n\n## Humongous chromosomes V2\n\nThe axolotl genome has individual chromosomes that are of size 3.14 Gbp\nhttps://genome.cshlp.org/content/29/2/317.long (2019) which is almost as big as\nthe entire human genome\n\nThe BAM and CRAM formats can only store 2^31-1 (~2.14Gbp) length chromosomes\nhowever so bgzip/tabix SAM is used (discussion\nhttps://github.com/samtools/hts-specs/issues/655)\n\n## Largest genomes\n\nJust some honorable mentions for largest genome\n\n- Polychaos dubium/Amoeba dubium/Chaos chaos - ~600-1300Gbp (unsequenced, 1968\n  back of envelope measurement, needs confirmation)\n  https://en.wikipedia.org/wiki/Polychaos_dubium (another ref\n  https://bionumbers.hms.harvard.edu/bionumber.aspx?\u0026id=117342)\n- Dinoflagellates - up to 250Gbp (unsequenced, 1987 book referenced in this\n  paper, needs confirmation, has weird chromosome \"rod-like\" structures)\n  https://www.nature.com/articles/s41588-021-00841-y\n- Tmesipteris oblanceolata (fork fern) - ~160Gb (unsequenced)\n  https://www.nature.com/articles/d41586-024-01567-7\n- Paris japonica (canopy plant) - ~149Gbp (unsequenced)\n  https://en.wikipedia.org/wiki/Paris_japonica\n- Tmesipteris_obliqua (fern) - ~147Gbp (unsequenced) -\n  https://en.wikipedia.org/wiki/Tmesipteris_obliqua\n- South American lungfishes (Lepidosiren paradoxa) - ~91Gbp (sequenced)\n  https://www.nature.com/articles/s41586-024-07830-1\n- European mistletoe - ~90Gbp (sequenced)\n  https://www.darwintreeoflife.org/news_item/2022-the-year-we-built-the-biggest-genome-in-britain-and-ireland/\n- Antarctic krill - ~48Gbp (sequenced)\n  https://www.cell.com/cell/pdf/S0092-8674(23)00107-1.pdf\n- Neoceratodus forsteri (Australian lungfish) - ~43Gbp (sequenced)\n  https://www.smithsonianmag.com/smart-news/australian-lungfish-has-biggest-genome-ever-sequenced-180976837/\n  https://www.ncbi.nlm.nih.gov/genome/?term=Neoceratodus+forsteri\n- Ambystoma mexicanum (axolotl) - ~32Gbp (sequenced)\n  https://en.wikipedia.org/wiki/Axolotl\n  https://www.ncbi.nlm.nih.gov/genome/?term=axolotl\n- Allium ursinum (wild garlic) - ~30gb https://en.wikipedia.org/wiki/Onion_Test\n- Coastal redwood - ~26Gbp (sequenced)\n  https://www.ucdavis.edu/climate/news/coast-redwood-and-sequoia-genome-sequences-completed\n  https://www.ncbi.nlm.nih.gov/genome/?term=redwood\n- Loblolly pine - ~22Gbp (sequenced)\n  https://blogs.biomedcentral.com/on-biology/wp-content/uploads/sites/5/2014/03/genomelog030.jpg\n  https://www.ncbi.nlm.nih.gov/genome/?term=loblolly+pine\n- Wheat genome - ~17Gbp\n  https://academic.oup.com/gigascience/article/6/11/gix097/4561661\n  https://www.ncbi.nlm.nih.gov/genome/?term=wheat\n\nInspired by twitter thread\nhttps://twitter.com/PetrovADmitri/status/1506824610360168455\n\nAlso see http://www.genomesize.com/statistics.php?stats=entire#stats_top\n\nSee also the plant C-value database, which is a measurement you will sometimes\nsee instead of base pair length https://cvalues.science.kew.org/ (\"C-value is\nthe amount, in picograms, of DNA contained within a haploid nucleus\")\n\n## Humongous CIGAR strings\n\nThe CG tag was invented in order to store CIGAR strings longer than 64k\noperations, since n_cigar_opt is a uint16 in BAM. The CIGAR string is relevant\nonly for BAM files, CRAM uses a different storage mechanism for CIGAR type data\n(e.g. the reference based compression).\n\n## Interesting gene names\n\n## Update Dec 2023\n\nI extracted all the genes from a number of model organism databases here\nhttps://cmdcolin.github.io/genes/\n\nHere are some random highlights from earlier work\n\n- Tinman - \"In mutant or knockout organisms, the loss of tinman results in the\n  lack of heart formation\" https://en.wikipedia.org/wiki/Tinman_gene\n- Sonic hedgehog (SHH) - SHH mutants have 'spiky' fruit fly embryos\n  https://en.wikipedia.org/wiki/Sonic_hedgehog\n- Robotnikin - antagonist of SHH, villain of the sonic hedgehog franchise -\n  https://pmc.ncbi.nlm.nih.gov/articles/PMC2770933/\n- Heart of glass (heg) - a zebrafish gene with mutant phenotype \"Individual heg\n  myocardial cells are also thinner than wild-type\"\n  https://www.ncbi.nlm.nih.gov/pubmed/14680629\n- Dracula (drc) - \"we isolated a mutation, dracula (drc), which manifested as a\n  light-dependent lysis of red blood cells\"\n  https://www.ncbi.nlm.nih.gov/pubmed/10985389 (now renamed\n  https://zfin.org/ZDB-GENE-000928-1)\n- Sleeping Beauty transposon system -\n  https://en.wikipedia.org/wiki/Sleeping_Beauty_transposon_system\n- Skywalker (sky) -\n  https://www.ncbi.nlm.nih.gov/gene?Db=gene\u0026Cmd=DetailsSearch\u0026Term=35359\n- TIME FOR COFFEE (TIC) - \"We characterize the time for coffee (tic) mutant that\n  disrupts circadian gating, photoperiodism, and multiple circadian rhythms,\n  with differential effects among rhythms\"\n  https://www.ncbi.nlm.nih.gov/gene?Db=gene\u0026Cmd=DetailsSearch\u0026Term=821807\n- WTF - \"Some alleles of the wtf gene family can increase their chances of\n  spreading by using poisons to kill other alleles, and antidotes to save\n  themselves.\" - https://www.ebi.ac.uk/interpro/entry/IPR004982\n  https://www.sciencedaily.com/releases/2017/06/170620093209.htm\n- Mothers against decapentaplegic - \"it was found that a mutation in the gene in\n  the mother repressed the gene decapentaplegic in the embryo. The phrase\n  \"Mothers against\" was added as a humorous take-off\"\n  https://en.wikipedia.org/wiki/Mothers_against_decapentaplegic\n- Saxophone (sax) - http://www.sdbonline.org/sites/fly/gene/saxophon.htm\n- Beethovan (btv) - http://www.uniprot.org/uniprot/Q0E8P6\n- Superman+kryptonite - https://en.wikipedia.org/wiki/Superman_(gene)\n- Supervillin (SVIL) - https://www.uniprot.org/uniprot/O95425\n- Wishful thinking (wit) - https://www.wikigenes.org/e/gene/e/44096.html\n- Doublesex (dsx) - \"The gene is expressed in both male and female flies and is\n  subject to alternative splicing, producing the protein isoforms dsx_f in\n  females and the longer dsx_m in males.\"\n  https://en.wikipedia.org/wiki/Doublesex\n- Fruitless (fru) - \"Early work refers to the gene as fruity, an apparent pun on\n  both the common name of D. melanogaster, the fruit fly, as well as a slang\n  word for homosexual. As social attitudes towards homosexuality changed, fruity\n  came to be regarded as offensive, or at best, not politically correct. Thus,\n  the gene was re-dubbed fruitless, alluding to the lack of offspring produced\n  by flies with the mutation.[10] However, despite the original name and a\n  continuing history of misleading inferences by the popular media, fruitless\n  mutants primarily show defects in male-female courtship, though certain\n  mutants cause male-male or female-female courtship.[11]\"\n  https://en.wikipedia.org/wiki/Fruitless_(gene)\n- Transformer (tra) - https://en.wikipedia.org/wiki/Transformer_(gene)\n- Gypsy+Flamenco - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1206375/ also\n  described in wiki\n  https://en.wikipedia.org/wiki/Piwi-interacting_RNA#History_and_loci\n- Jockey - http://flybase.org/reports/FBgn0015952.html\n- Tigger - https://www.omim.org/entry/612972\n- Nanog - celtic legend\n  https://www.sciencedaily.com/releases/2003/06/030602024530.htm (source\n  https://twitter.com/EpgntxEinstein/status/1057359656220348417)\n- Jerky (jrk) - \"A deficit in the Jerky protein in mice causes recurrent\n  seizures\" https://www.genecards.org/cgi-bin/carddisp.pl?gene=JRK\n- Hippo (Hpo) - https://www.wikigenes.org/e/gene/e/37247.html\n- Dishevelled (Dsh) - https://en.wikipedia.org/wiki/Dishevelled\n- Glass bottom boat (gbb) - \"fruit fly larvae with a faulty glass bottom boat\n  gene are transparent\"\n  https://www.thenakedscientists.com/articles/interviews/gene-month-glass-bottom-boat\n  http://www.sdbonline.org/sites/fly/dbzhnsky/60a-1.htm\n- Makes caterpillars floppy (mcf) - https://www.pnas.org/content/99/16/10742\n  (source https://twitter.com/JUNIUS_64/status/1081007886560608256)\n- Eyeless http://flybase.org/reports/FBgn0005558.html\n- Straightjaket (stj) - http://flybase.org/reports/FBgn0261041.html\n- Huluwa http://science.sciencemag.org/content/362/6417/eaat1045 ref\n  https://twitter.com/zhouwanding/status/1065960714978897921\n- frameshifts or pseudogene? - check sequence -\n  https://www.ncbi.nlm.nih.gov/gene/?term=24562233%5Buid%5D\n- Bad response to refrigeration (brr)\n  https://twitter.com/hitenmadhani/status/1149471071675924481?s=20\n- Mindbomb (mib1) - https://www.sdbonline.org/sites/fly/hjmuller/mindbomb1.htm\n- β'COP http://flybase.org/reports/FBgn0025724.html\n  (https://twitter.com/DarrenObbard/status/1260613447198412800)\n- King-tubby https://www.uniprot.org/uniprot/B0XFQ9 see also tubby\n  https://www.uniprot.org/uniprot/P50586\n- fucK https://www.uniprot.org/uniprot/?query=fuck\u0026sort=score\n- Halloween genes https://en.wikipedia.org/wiki/Halloween_genes\n- VANDAL21\n  https://www.arabidopsis.org/servlets/TairObject?type=transposon_family\u0026id=139\n- HotDog domain - superfamily of genes/proteins\n  https://www.wikidata.org/wiki/Q24785143\n  https://www.ebi.ac.uk/interpro/entry/IPR029069\n- Flower/fwe - https://flybase.org/reports/FBgn0261722.html\n- Brahma https://www.sdbonline.org/sites/fly/polycomb/brahma.htm\n- Pokemon gene - \"The Pokémon Company threatened MSKCC with legal action in\n  December 2005 for creating an association between cancer and the media\n  franchise, and as a consequence MSKCC is now referring to it by its gene name\n  Zbtb7\" - Pokemon/pikachu/zubat (story\n  https://bsky.app/profile/c0nc0rdance.bsky.social/post/3k6w3gwtell2j)\n- Bring lots of money (blom7α)\n  https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2781463/\n  https://www.uniprot.org/uniprotkb/Q7Z7F0/entry\n- MAGOH - Drosophila flies produce unfit progeny when they have mutations in\n  their mago nashi (Japanese: 孫なし, Hepburn: mago nashi,\n  lit. 'grandchildless') gene. The progeny have defects in germplasm assembly\n  and germline development https://www.uniprot.org/uniprotkb/P61326/entry\n- IGL@ - a locus containing many immunoglobulin genes, but why the @ sign?\n  https://en.wikipedia.org/wiki/IGL@\n- Spooky toxin - https://en.wikipedia.org/wiki/Ssm_spooky_toxin\n  (https://twitter.com/depthsofwiki/status/1712555421918245242)\n- Always early (aly) - http://flybase.org/reports/FBgn0004372.html\n- Lonely guy (LOG) - https://onlinelibrary.wiley.com/doi/full/10.1111/pbi.13783\n- PKZILLA (very large gene) -\n  https://www-science-org.libproxy.berkeley.edu/doi/10.1126/science.ado3290\n- Dachshund (dac) \"plays a role in leg development\" (in flies)\n  https://en.wikipedia.org/wiki/Dachshund_(gene)\n- Blanks (\"Loss of Blanks causes complete male sterility\")\n  https://www.pnas.org/doi/10.1073/pnas.1009781108\n- LUMP (and with a p-element insertion p-lump)\n  https://pmc.ncbi.nlm.nih.gov/articles/PMC3166160/\n- loquacious\n  https://www.ncbi.nlm.nih.gov/gene?Db=gene\u0026Cmd=DetailsSearch\u0026Term=34751\n- TOPLESS https://pmc.ncbi.nlm.nih.gov/articles/PMC2643930/\n\n### Allele names\n\nSometimes it is not the gene, but the allele that is named\n\n- Bad hair day http://www.informatics.jax.org/allele/MGI:3764934\n- Samba, chacha, bossa nova http://www.informatics.jax.org/allele/MGI:3708457\n- Yoda http://www.informatics.jax.org/allele/MGI:3797584\n\nRef https://twitter.com/hmdc_mgi/status/1242893531779391496\n\n## More reading\n\nGreat illustrations of interesting biology, including information about gene\nnames https://twitter.com/vividbiology\n\nMany of the stories behind fly gene nomenclature is available at\nhttps://web.archive.org/web/20110716201703/http://www.flynome.com/cgi-bin/search?source=browse\nincluding the famous ForRentApartments dot com gene (just kidding but lol\nhttps://web.archive.org/web/20110716202150/http://www.flynome.com/cgi-bin/search?storyID=180)\n\nMusing article: \"What is in a (gene) name?\"\nhttps://web.archive.org/web/20180731060319/https://blogs.plos.org/toothandclaw/2012/06/17/whats-in-a-gene-name/\n\n## Send PRs for more things!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmdcolin%2Foddgenes","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcmdcolin%2Foddgenes","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcmdcolin%2Foddgenes/lists"}