{"id":17317923,"url":"https://github.com/marty1885/tlgs","last_synced_at":"2025-04-14T13:32:27.613Z","repository":{"id":45935828,"uuid":"427417287","full_name":"marty1885/tlgs","owner":"marty1885","description":"\"Totally Legit\" Gemini Search - Open source search engine for the Gemini protocol","archived":false,"fork":false,"pushed_at":"2024-10-11T05:59:58.000Z","size":447,"stargazers_count":21,"open_issues_count":1,"forks_count":2,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-01T04:51:42.343Z","etag":null,"topics":["drogon","gemini","gemini-protocol","indexer","search-engine"],"latest_commit_sha":null,"homepage":"https://tlgs.one","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/marty1885.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-12T15:59:31.000Z","updated_at":"2024-10-11T06:00:01.000Z","dependencies_parsed_at":"2024-07-27T05:48:49.561Z","dependency_job_id":null,"html_url":"https://github.com/marty1885/tlgs","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marty1885%2Ftlgs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marty1885%2Ftlgs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marty1885%2Ftlgs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/marty1885%2Ftlgs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/marty1885","download_url":"https://codeload.github.com/marty1885/tlgs/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223634176,"owners_count":17176879,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["drogon","gemini","gemini-protocol","indexer","search-engine"],"created_at":"2024-10-15T13:18:15.119Z","updated_at":"2024-11-08T05:01:31.130Z","avatar_url":"https://github.com/marty1885.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TLGS - Totally Legit Gemini Search\n\n## Overview\n\nTLGS is a search engine for Gemini. It's slightly overengineered for what it currently is and uses weird tech. And I'm proud of that. The current code basse is kinda messy - I promise to clean them up. The main features/characteristics are as follows:\n\n* Using the state of the art C++20\n* Parses and indexes textual contents on Gemninispace\n* Highly concurrent and asynchronous\n* Stores index on PostgreSQL\n* Developed for Linux. But should work on Windows, OpenBSD, HaikuOS, macOS, etc..\n* Only fetch headers for files it can't index to save bandwith and time\n* Handles all kinds of source encoding\n* Link analysis using the SALSA algorithm\n\nAs of now, indexing of news sites, RFCs, documentations are mostly disabled. But likely be enabled once I have the mean and resources to scale the setup.\n\n## Using this project\n\n### Requirments\n\n* [drogon](https://github.com/an-tao/drogon)\n* [nlohmann-json](https://github.com/nlohmann/json)\n* [CLI11](https://github.com/CLIUtils/CLI11)\n* [libfmt](https://github.com/fmtlib/fmt)\n* [TBB](https://github.com/oneapi-src/oneTBB)\n* [xxHash](https://github.com/Cyan4973/xxHash)\n* iconv\n* PostgreSQL\n\n### Building and running the project\n\nTo build the project. You'll need a fully C++20 capable compiler. The following compilers should work as of writing this README\n\n* GCC \u003e= 11.2\n* MSVC \u003e= 16.25\n\nInstall all dependencies. And run the commands:\n\n```bash\nmkdir build\ncd build\ncmake ..\nmake -j\n```\n\n### Creating and maintaining the index\n\nTo create the inital index:\n\n1. Initialize the database `./tlgs/tlgs_ctl/tlgs_ctl ../tlgs/config.json populate_schema`\n2. Place the seed URLs into `seeds.text`\n3. In the build folder, run `./tlgs/crawler/tlgs_crawler -s seeds.text -c 4 ../tlgs/config.json`\n\nNow the crawler will start crawling the geminispace while also updating outdated indices (if any). To update an existing index. Run: \n\n```bash\n./tlgs/crawler/tlgs_crawler -c 2 ../tlgs/config.json\n# -c is the maximum concurrent connections the crawler will make\n```\n\n**NOTE:** TLGS's crawler is distributable. You can run multiple instances in parallel. But some intances may drop out early towards the end or crawling. Though it does not effect the result of crawling.\n\n### Running the capsule\n\n```bash\nopenssl req -new -subj \"/CN=my.host.name.space\" -x509 -newkey ec -pkeyopt ec_paramgen_curve:prime256v1 -days 36500 -nodes -out cert.pem -keyout key.pem\ncd tlgs/server\n./tlgs_server ../../../tlgs/server_config.json\n```\n\n### Via systemd\n\n```bash\nsudo systemctl start tlgs_server\nsudo systemctl start tlgs_crawler\n```\n\n## Server config\n\nThe `custom_config.tlgs` section in `search_config.json` (installed at `/etc/tlgs/server_config.json`) contains confgurations for TLGS server. Besides the usual [Drogon's config options](https://drogon.docsforge.com/master/configuration-file/). custom_config changes the property of TLGS itself. Current supported options are:\n\n### ranking_algo\nThe ranking algorithm TLGS uses to rank pages in search result. The ranking is then combined with the text match score to produce the final search rank. Current supported values are `hits` and `salsa`. Refering to the [HITS][hits] and [SALSA][salsa] ranking algorithm. It defaults to `salsa` if no value is provided.\n\nSALSA runs slightly faster than HITS for large search results. Both [literature][najork2007comparing] and imperical experience suggests SALSA provides better ranking. Thus we switched from HITS to SALSA.\n\n```json\n\"ranking_algo\": \"salsa\"\n```\n\n## TODOs\n\n- [ ] Code cleanup\n  - [ ] I really need to centralized the crawling logic\n- [x] Randomize the order of crawling. Avoid bashing a single capsule\n  * Sort of.. by sampling the pages table with low percentage and increase later\n- [ ] Support parsing markdown\n- [ ] Try indexing news sites\n- [ ] Optimize the crawler even more\n  - [x] Checks hash before updating index\n  - [ ] Peoper UTF-8 handling in ASCII art detection\n  - [x] Use a trie for blacklist URL match\n- [x] Link analysis using SALSA\n- [ ] BM25 for text scoring\n- [x] Dedeuplicate search result\n- [x] Impement Filters\n- [ ] Proper(?) way to migrate schema\n\n[hits]: http://www.cs.cornell.edu/home/kleinber/auth.pdf\n[salsa]: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.5859\n[najork2007comparing]: https://www.ccs.neu.edu/home/vip/teach/IRcourse/4_webgraph/notes/najork05_HITS_vs_salsa.pdf","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarty1885%2Ftlgs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmarty1885%2Ftlgs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmarty1885%2Ftlgs/lists"}