{"id":18046647,"url":"https://github.com/ltla/gesel-feedstock","last_synced_at":"2025-04-05T04:25:10.733Z","repository":{"id":100163593,"uuid":"608512337","full_name":"LTLA/gesel-feedstock","owner":"LTLA","description":"Generate pre-built gesel indices for client-side gene set search.","archived":false,"fork":false,"pushed_at":"2023-08-29T18:17:19.000Z","size":54,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-10T12:29:38.900Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LTLA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-02T07:01:43.000Z","updated_at":"2024-04-10T13:09:00.000Z","dependencies_parsed_at":"2024-12-18T09:41:00.591Z","dependency_job_id":"ea423b10-3bdf-45d6-bb97-a87516abb51f","html_url":"https://github.com/LTLA/gesel-feedstock","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Fgesel-feedstock","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Fgesel-feedstock/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Fgesel-feedstock/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LTLA%2Fgesel-feedstock/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LTLA","download_url":"https://codeload.github.com/LTLA/gesel-feedstock/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247288341,"owners_count":20914347,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-30T19:08:22.542Z","updated_at":"2025-04-05T04:25:10.710Z","avatar_url":"https://github.com/LTLA.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Build gene sets to feed gesel\n\nThis repository contains code to build the gene set database for **gesel**, the client-side gene set search interface.\nThe database files themselves are available on the [Releases page](https://github.com/LTLA/gesel-feedstock/releases);\nthis can be fetched by applications directly or via a CORS proxy.\n\n## Overview of database files\n\nEach species contains a separate copy of the files described in this section.\nFiles from a particular species will be prefixed with that species' NCBI taxonomy ID, e.g., `9606_ensembl.tsv.gz`.\nFor brevity, we will omit the prefix in the rest of this section.\n\n### Gene mappings\n\nGenes are defined as abstract \"equivalence classes\" that can be associated with one, zero or multiple identifiers or symbols.\nEach equivalence class is defined as a component of the graph constructed from the relationships between Ensembl and Entrez identifiers.\n\n- `ensembl.tsv.gz` is a Gzip-compressed tab-separated file where each line corresponds to a gene equivalence class, and the fields are Ensembl identifiers associated with that gene.\n  An empty line indicates that the equivalence class contains no Ensembl IDs.\n- `entrez.tsv.gz` is a Gzip-compressed tab-separated file where each line corresponds to a gene equivalence class, and the fields are Entrez identifiers associated with that gene.\n  An empty line indicates that the equivalence class contains no Entrez IDs.\n- `symbol.tsv.gz` is a Gzip-compressed tab-separated file where each line corresponds to a gene equivalence class, and the fields are symbols associated with that gene.\n  An empty line indicates that the equivalence class contains no symbols.\n\nAll files have the same number of lines as they represent different aspects of the same underlying array of equivalence classes.\nEach gene's identity (i.e., the \"gene ID\") is defined as the 0-based index of the corresponding line in either file.\n\n### Collection details\n\n`collections.tsv.gz` is a Gzip-compressed tab-separated file where each line corresponds to a gene set collection and contains the following fields:\n\n- `title`: the title of the collection.\n- `description`: the description of the collection.\n- `species`: the species involved in the collection.\n- `maintainer`: the maintainer of the collection.\n- `source`: the source URL for the collection.\n- `number`: the number of gene sets in this collection.\n\nEach collection's identity (i.e., the \"collection ID\") is defined as the 0-based index of the corresponding line in `collections.tsv.gz`.\n\n`collections.tsv` is an uncompressed tab-separated file that has the same number of lines and order of collections as `collections.tsv.gz`.\nIt contains all fields in `collections.tsv.gz` except for `number`.\n\n`collections.tsv.ranges.gz` is a Gzip-compressed file where each line corresponds to a collection in `collections.tsv`.\nEach line contains the following fields:\n\n- `bytes`: the number of bytes taken up by the corresponding line in `collections.tsv` (excluding the newline).\n- `number`: the number of gene sets in this collection.\n\nApplications can either download `collections.tsv.gz` to obtain information about all collections,\nor they can download `collections.tsv.ranges.gz` and perform HTTP range requests on `collections.tsv` to obtain information about individual collections.\nThe former pays a higher up-front cost for easier batch processing.\nTo reduce the download size, we do not store `start` in `collections.tsv.gz`, as these can be computed easily on the client. \n\n### Set details\n\n`sets.tsv.gz` is a Gzip-compressed tab-separated file where each line corresponds to a gene set and contains the following fields:\n\n- `name`: the name of the set.\n- `description`: the description of the set.\n- `size`: the number of genes in the set.\n\nEach set's identity (i.e., the \"set ID\") is defined as the 0-based index of the corresponding line in `sets.tsv.gz`.\nSets from the same collection are always stored in consecutive lines, ordered by their position within that collection.\n\n`sets.tsv` is an uncompressed tab-separated file that has the same number of lines and order of sets as `sets.tsv.gz`.\nIt contains all fields in `sets.tsv.gz` except for `size`.\n\n`sets.tsv.ranges.gz` is a Gzip-compressed file where each line corresponds to a set in `sets.tsv`.\nEach line contains two tab-separated fields:\n\n- `bytes`: the number of bytes taken up by the corresponding line in `sets.tsv` (excluding the newline).\n- `size`: the number of genes in the set.\n\nApplications can either download `sets.tsv.gz` to obtain information about all sets,\nor they can download `sets.tsv.ranges.gz` and perform HTTP range requests on `sets.tsv` to obtain information about individual sets.\nThe former pays a higher up-front cost for easier batch processing.\nTo reduce the download size, we do not store `collection` and `position` in `sets.tsv.gz`, as these can be computed easily on the client. \n\n### Mappings between sets and genes\n\n`set2gene.tsv` is a tab-separated file where each line corresponds to a gene set in the same order as `sets.tsv.gz`.\nOn each line, the first field contains the gene ID of the first gene in the set.\nAll subsequent fields contain increments from the preceding ID, i.e., computing the cumulative sum across all fields yields the array of gene IDs for this set.\n\n`set2gene.tsv.ranges.gz` is a Gzip-compressed file where each line corresponds to a set in `set2gene.tsv`.\nEach line contains an integer specifying the number of bytes taken up by the corresponding line in `set2gene.tsv` (excluding the newline).\nThis can be used for HTTP range requests to obtain the composition of each set.\n\n`gene2set.tsv` is a tab-separated file where each line corresponds to a gene in the same order as `symbol2gene.tsv.gz`.\nOn each line, the first field contains the set ID of the first set containing that gene.\nAll subsequent fields contain increments from the preceding ID, i.e., computing the cumulative sum across all fields yields the array of IDs of sets containing this gene.\n\n`gene2set.tsv.ranges.gz` is a Gzip-compressed file where each line corresponds to a set in `gene2set.tsv`.\nEach line contains an integer specifying the number of bytes taken up by the corresponding line in `gene2set.tsv` (excluding the newline).\nThis can be used for HTTP range requests to obtain the identities of the sets containing a particular gene.\n\n`set2gene.tsv.gz` is a Gzip-compressed version of `set2gene.tsv`.\nSimilarly, `gene2set.tsv.gz` is a Gzip-compressed version of `gene2set.tsv`.\nApplications can either download these `*.tsv.gz` files to obtain all relationships up-front,\nor they can download `*.ranges.gz` and perform HTTP range requests on the corresponding `*.tsv` to obtain each individual relationship.\n\n### Text search tokens\n\n`tokens-names.tsv` is a tab-separated file where each line corresponds to a token.\nOn each line, the first field contains the set ID of the first set where the name contains the corresponding token.\nAll subsequent fields contain increments from the preceding ID, i.e., computing the cumulative sum across all fields yields the array of set IDs that contain this token in its name.\n\n`tokens-descriptions.tsv` is a tab-separated file where each line corresponds to a token.\nOn each line, the first field contains the set ID of the first set where the description contains the corresponding token.\nAll subsequent fields contain increments from the preceding ID, i.e., computing the cumulative sum across all fields yields the array of set IDs that contain this token in its description.\n\n`tokens-names.tsv.ranges.gz` is a Gzip-compressed file where each line corresponds to a set in `tokens-names.tsv`.\nEach line contains:\n\n- `token`: a token string.\n- `number`: an integer specifying the number of bytes taken up by the corresponding line in `tokens-names.tsv` (excluding the newline).\n\nThe same logic applies to `tokens-descriptions.tsv.ranges.gz` for `tokens-descriptions.tsv`.\nThis can be used for HTTP range requests to obtain the identities of the sets that match tokens in their names or descriptions.\n\nThe tokenization strategy is very simple - every contiguous stretch of ASCII alphanumeric characters or dashes (`-`) is treated as a token.\nQuery strings should be processed in the same manner to generate tokens for matching against `token`.\nHandling of `?` or `*` wildcards is at the discretion of the client implementation.\n\n`tokens-names.tsv.gz` is a Gzip-compressed version of `tokens-names.tsv`.\nSimilarly, `tokens-descriptions.tsv.gz` is a Gzip-compressed version of `tokens-descriptions.tsv`.\nApplications can either download these `*.tsv.gz` files to obtain all relationships up-front,\nor they can download `*.ranges.gz` and perform HTTP range requests on the corresponding `*.tsv` to obtain each individual relationship.\n\n### Embeddings\n\n`tsne.tsv.gz` is a tab-separated file where each line corresponds to a gene set in the same order as `sets.tsv.gz`.\nEach line contains two tab-separated floating-point values representing the coordinates of the set in a 2-dimensional t-SNE plot.\n\n`tsne.png` is a PNG file containing an image of the embedding.\nThis is only provided for diagnostic purposes.\n\n## Contributing gene sets\n\nMake a [pull request](https://github.com/LTLA/gesel-feedstock/pulls) and add an entry to [`manifest.json`](manifest.json) to point to a GMT file of your choice.\nEach entry in the array represents a collection with the following metadata:\n\n- `title`: the title of the collection.\n  This should not contain tabs or newlines.\n- `description`: the description of the collection.\n  This should not contain tabs or newlines.\n- `species`: the NCBI taxonomy ID for the species.\n- `maintainer`: the name of the maintainer of the collection.\n- `source`: the source of the collection.\n  This may reference an article or the code used to generate the collection, and is intended for human readers.\n- `id`: the type of identifier.\n  This should be one of `\"entrez\"`, `\"ensembl\"` or `\"symbol\"`; the former two are more reliable and preferred.\n- `url`: the URL to the collection's GMT file.\n  This should be downloadable.\n  The GMT file should use Ensembl identifiers for all genes.\n\n## Rebuilding the indices\n\nThe [`define_genes.R`](define_genes.R) script will compile the equivalence classes for each species.\nThis should be run first and uploaded to a GitHub Releases under a `genes-vX.Y.Z` tag.\n\nThe [`build_index.R`](build_index.R) script will scan the manifest, download the GMT files and compile them into most of the files described above.\nThis should be uploaded to a GitHub Release under a `indices-vA.B.C` tag.\n\nThe [`create_embedding.R`](create_embedding.R) script will examine the `set2gene.tsv` file and use this to perform neighbor search for t-SNE.\nThis should be added to the `indices-vA.B.C` release.\n\nThe [`versions.json`](versions.json) specify the current versions of all the gene- and index-related resources.\nThis is used to coordinate versions across different scripts.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fltla%2Fgesel-feedstock","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fltla%2Fgesel-feedstock","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fltla%2Fgesel-feedstock/lists"}