{"id":15320353,"url":"https://github.com/cocainecong/tangseng","last_synced_at":"2025-04-05T02:04:34.567Z","repository":{"id":167829095,"uuid":"643172164","full_name":"CocaineCong/tangseng","owner":"CocaineCong","description":"Tangseng search engine including full text search and vector search base on golang. 基于go语言的搜索引擎，信息检索系统","archived":false,"fork":false,"pushed_at":"2025-01-12T10:38:36.000Z","size":6417,"stargazers_count":122,"open_issues_count":4,"forks_count":35,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-29T01:03:55.679Z","etag":null,"topics":["boltdb","distributed-systems","dockcer-compose","docker","etcd","full-text-search","gin","grpc","inverted-index","kafka","losertree","lsm-tree","mapreduce","search-engine","segment","vector-search"],"latest_commit_sha":null,"homepage":"https://cocainecong.github.io/tangseng/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"Go-SearchEngine/Go-SearchEngine","license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CocaineCong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING_CN.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-20T10:21:05.000Z","updated_at":"2025-03-25T16:40:08.000Z","dependencies_parsed_at":"2023-10-25T02:25:16.564Z","dependency_job_id":"81798ef4-e888-4b9b-9678-35513d0a5678","html_url":"https://github.com/CocaineCong/tangseng","commit_stats":{"total_commits":283,"total_committers":9,"mean_commits":"31.444444444444443","dds":"0.20848056537102477","last_synced_commit":"0bb2ddf8f967455740328c168ea43aaa1f741adf"},"previous_names":["cocainecong/go-searchengine","cocainecong/tangseng"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CocaineCong%2Ftangseng","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CocaineCong%2Ftangseng/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CocaineCong%2Ftangseng/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CocaineCong%2Ftangseng/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CocaineCong","download_url":"https://codeload.github.com/CocaineCong/tangseng/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247276163,"owners_count":20912288,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["boltdb","distributed-systems","dockcer-compose","docker","etcd","full-text-search","gin","grpc","inverted-index","kafka","losertree","lsm-tree","mapreduce","search-engine","segment","vector-search"],"created_at":"2024-10-01T09:08:17.275Z","updated_at":"2025-04-05T02:04:34.547Z","avatar_url":"https://github.com/CocaineCong.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Tangseng 基于Go语言的搜索引擎\n\n**[项目详细内容地址点击这里](https://cocainecong.github.io/tangseng/#/)**\n\n## 项目大体框架\u0026功能\n\n1. gin作为http框架，grpc作为rpc框架，etcd作为服务发现。\n2. 总体服务分成`用户模块`、`收藏夹模块`、`索引平台`、`搜索引擎(文字模块)`、`搜索引擎(图片模块)`。\n3. 分布式爬虫爬取数据，并发送到kafka集群中，再落库消费。 (虽然爬虫还没写，但不妨碍我画饼...)\n4. 搜索引擎模块的文本搜索单独设立使用boltdb存储index，mapreduce加速索引构建并使用`roaring bitmap`存储索引。\n5. 使用trie tree实现词条联想(后面打算加上算法模型辅助词条联想)。\n6. 图片搜索使用ResNet50来进行向量化查询 + Milvus or Faiss 向量数据库的查询 (开始做了... DeepLearning也太难了...)。\n7. 支持多路召回，go中进行倒排索引召回，python进行向量召回。通过grpc调用连接，进行融合。\n8. 支持TF-IDF，BM25等等算法排序。\n\n![项目大体框架](docs/images/tangseng.png)\n\n## 🧑🏻‍💻 前端地址\n\nall in react, but still coding\n\n[react-tangseng](https://github.com/CocaineCong/react-tangseng)\n\n## 未来规划\n\n### 架构相关\n\n- [ ] 引入降级熔断\n- [ ] 引入jaeger进行全链路追踪(go追踪到python)\n- [ ] 引入skywalking or prometheus进行监控\n- [ ] 抽离dao的init，用key来获取相关数据库实例\n- [ ] 冷热数据分离(参考es的方案,关键在于判断冷热的标准,或许可以写在中间件里面？)\n- [ ] 目前来说mysql已经足够存储正排索引，但后续可能直接一步到位到OLAP，starrocks单表亿级数据也能毫秒查询，mysql到这个级别早就分库分表了..\n\n### 功能相关\n\n- [x] 构建索引的时候太慢了.后面加上并发，建立索引的地方加上并发\n- [ ] 索引压缩，inverted index，也就是倒排索引表，后续改成存offset,用mmap\n- [x] 相关性的计算要考虑一下，TFIDF，bm25\n- [x] 使用前缀树存储联想信息\n- [ ] 哈夫曼编码压缩前缀树\n- [ ] 建索引的时候，传文件地址改成传文件流\n- [ ] python 引入 bert 模型进行分词的推荐词并提供 grpc 接口\n- [ ] inverted 和 trie tree 的存储支持一致性hash分片存储\n- [x] 词向量\n- [ ] pagerank\n- [ ] 分离 trie tree 的 build 和 recall 过程\n- [x] 分词加入ik分词器\n- [x] 构建索引平台，计算存储分离，构建索引与召回分开\n- [ ] 并且差运算 (位运算)\n- [ ] 分页\n- [x] 排序\n- [ ] 纠正输入的query,比如\"陆加嘴\"--\u003e\"陆家嘴\"\n- [x] 输入进行词条可以进行联想，比如 \"东方明\" 提示--\u003e \"东方明珠\"\n- [x] 目前是基于块的索引方法，后续看看能不能改成分布式mapreduce来构建索引 (6.824 lab1)\n- [ ] 在上一条的基础上再加上动态索引（还不知道上一条能不能实现...）\n- [x] 改造倒排索引，使用 roaring bitmap 存储docid (好难)\n- [ ] 实现TF类\n- [x] 搜索完一个接着搜索，没有清除缓存导致结果是和上一个产生并集\n- [x] 排序器优化\n\n![文本搜索](docs/images/text2text.jpg)\n\n## 快速开始\n   环境启动！\n\n   ```shell\n   make env-up\n   ```\n\n小小数据集就在 `source_data/movies_data.csv` \n\n### Python 启动!\n\n1. 确保电脑已经安装了python,确保python version\u003e=3.9,我的版本是3.10.2\n\n    ```shell\n    python --version\n    ```\n\n2. 安装venv环境\n\n    ```shell\n    python -m venv venv\n    ```\n\n3. 激活 venv python 环境\n    \n   macos:\n\n    ```shell\n    source venv/bin/activate\n    ```\n\n    windows:\n\n    等我清完C盘再兼容一下...还没在win上跑过...\n\n4. 安装第三方依赖\n\n   ```python\n   pip install -r requirements.txt\n   ```\n5. 启动主程序\n   ```python\n   sh python-start.sh\n   ```\n6. 启动脚本构建索引\n    ```python\n    make python-consume\n    ```\n\n### Golang 启动! \n\ngolang version \u003e= go 1.16 即可。我的go版本是 1.18.6\n\n1. 下载第三方依赖包\n\n    ```shell\n    go mod tidy\n    ```\n\n2. 目录下执行\n\n    ```shell\n    make run-xxx(user,favortie ...)\n    # e.g:\n    # make run-user\n    # make run-favorite\n    # 具体看makefile文件\n    ```\n\n## 开源贡献\n\n在提交pr之前，请查看 `CONTRIBUTING_CN.md`","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcocainecong%2Ftangseng","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcocainecong%2Ftangseng","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcocainecong%2Ftangseng/lists"}