{"id":18584508,"url":"https://github.com/bertrand31/cagire","last_synced_at":"2025-05-16T05:33:17.777Z","repository":{"id":170071142,"uuid":"236775440","full_name":"Bertrand31/Cagire","owner":"Bertrand31","description":"🔍 An experimental search engine supporting real-time partial-match plaintext search","archived":false,"fork":false,"pushed_at":"2022-05-19T15:46:47.000Z","size":26593,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-02-17T16:52:04.662Z","etag":null,"topics":["data-structures","functional-programming","inverted-index","scala","search-engine","trie"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Bertrand31.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-28T16:00:34.000Z","updated_at":"2022-05-19T12:54:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"0dafe0a6-89c1-4bea-9263-9d78657f3a13","html_url":"https://github.com/Bertrand31/Cagire","commit_stats":null,"previous_names":["bertrand31/cagire"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bertrand31%2FCagire","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bertrand31%2FCagire/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bertrand31%2FCagire/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Bertrand31%2FCagire/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Bertrand31","download_url":"https://codeload.github.com/Bertrand31/Cagire/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254473983,"owners_count":22077202,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-structures","functional-programming","inverted-index","scala","search-engine","trie"],"created_at":"2024-11-07T00:27:45.577Z","updated_at":"2025-05-16T05:33:17.752Z","avatar_url":"https://github.com/Bertrand31.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cagire\n\nThis project aims to create a backend for a fulltext search service with autocomplete and real-time\nresults.\n\nThrough the use of a custom variation of a trie, it aims to search through thousands of documents\nin a few miliseconds.\n\nIt supports two types of queries: the search of a whole word, which will return the matches for this\nexact word, and the search of a partial word, or \"prefix\" (used to provide results as the user is\ntyping).\n\nThis \"custom trie\" works as follow: inside each leaf marking the end of a word, it also contains a\nmap of all the matches across all documents for that given word.\n\nWhen we're searching for a prefix, we descend the trie along the prefix's characters, and then take\nall the maps from all the leaves below that point. We then concatenate them.\n\nThe search through that trie is very fast, however since the API returns the whole lines where the\nmatches were found, we need to pull all those lines from the actual files on the disk (since we\ndon't want to keep all the data in memory). That part is the slowest because of the accesses to the\ndisk, and gets extremely slow when we're dealing with files of a few million lines.\n\nTo improve this, the files ingested are split into small chunks of 10 thousand lines. This way, we\nlater have to load a lot less data from the disk since we only open the useful chunks.\n\nOn my laptop (with an _Intel Core i7-1065G7 CPU @ 1.30GHz CPU_), it'll search through 31 million\nwords and return all the partial matches in \u003c30ms.\nIt will search through the same amount of data and return exact word matches in \u003c10ms.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbertrand31%2Fcagire","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbertrand31%2Fcagire","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbertrand31%2Fcagire/lists"}