{"id":13706034,"url":"https://github.com/OpenKaito/openkaito","last_synced_at":"2025-05-05T19:34:20.293Z","repository":{"id":224911165,"uuid":"757821696","full_name":"OpenKaito/openkaito","owner":"OpenKaito","description":null,"archived":false,"fork":false,"pushed_at":"2024-04-12T06:22:17.000Z","size":8497,"stargazers_count":22,"open_issues_count":2,"forks_count":9,"subscribers_count":4,"default_branch":"main","last_synced_at":"2024-04-12T15:12:58.834Z","etag":null,"topics":["bittensor","decentralized-ai","indexing","search"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenKaito.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2024-02-15T03:47:59.000Z","updated_at":"2024-04-15T09:00:58.197Z","dependencies_parsed_at":"2024-03-11T13:46:45.534Z","dependency_job_id":"5d6650cd-16bd-4a2d-91b2-d8d44216f1b2","html_url":"https://github.com/OpenKaito/openkaito","commit_stats":null,"previous_names":["metasearch-io/decentralized-search","openkaito/subnet-otika","openkaito/openkaito"],"tags_count":0,"template":false,"template_full_name":"opentensor/bittensor-subnet-template","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenKaito%2Fopenkaito","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenKaito%2Fopenkaito/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenKaito%2Fopenkaito/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenKaito%2Fopenkaito/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenKaito","download_url":"https://codeload.github.com/OpenKaito/openkaito/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252563153,"owners_count":21768413,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bittensor","decentralized-ai","indexing","search"],"created_at":"2024-08-02T22:00:51.558Z","updated_at":"2025-05-05T19:34:15.279Z","avatar_url":"https://github.com/OpenKaito.png","language":"Python","funding_links":[],"categories":["Registered Subnets"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n# **OpenKaito - Decentralized Kaito AI** \u003c!-- omit in toc --\u003e\n\n[![Discord Chat](https://img.shields.io/discord/308323056592486420.svg)](https://discord.gg/bittensor)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n---\n\n[Discord](https://discord.gg/bittensor) • [Network](https://taostats.io/) • [Research](https://bittensor.com/whitepaper)\n\u003c/div\u003e\n\n## Installation\n\n### Validator Installation\n\nPlease see [Validator Setup](https://github.com/MetaSearch-IO/decentralized-search/blob/main/quickstart.md#validator-setup) in the [quick start guide](https://github.com/MetaSearch-IO/decentralized-search/blob/main/quickstart.md).\n\n### Miner Installation\n\nPlease see [Miner Setup](https://github.com/MetaSearch-IO/decentralized-search/blob/main/quickstart.md#miner-setup) in the [quick start guide](https://github.com/MetaSearch-IO/decentralized-search/blob/main/quickstart.md).\n\n---\n\n\u003e There is a legacy version of the project focusing on decentralized indexing of various data sources, see [here](./docs/openkaito_v0_legacy.md) for more details.\n\n## Abstract\n\nBittensor Subnet 5's primary focus is the development of the world’s best performing and most generalizable text embedding model.\n\nLeveraging an extensive Large Language Model (LLM)-augmented corpus for evaluation, miners are empowered to develop and deploy text-embedding models that surpass current state-of-the-art (SOTA) performance.\n\n## Objectives \u0026 Contributions\n\nThe primary objective of Subnet 5 is to train and serve the best and most generalizable text-embedding models. Such text-embedding models can empower plenty of downstream applications such as semantic search, natural language understanding, and so on.\n\nMiners will be responsible for training models using an extensive corpus of textual data and serving the model in a low-latency and high-throughput way. These models will be utilized to generate high-quality embeddings for diverse text inputs.\n\nValidators will conduct rigorous evaluations of the models using multiple benchmarks. Performance comparisons will be made against existing SOTA text embedding models to ensure continuous improvement and competitiveness.\n\nSubnet users will gain access to cutting-edge text embedding models that are most generic and exceed SOTA performance. These models will be made publicly available through the validator API of Bittensor Subnet 5, facilitating widespread adoption and integration into various applications.\n\n## Incentive Mechanism\n\nMiners will receive a batch of texts and embed them.\n\nFor the text embeddings, validators have the pairwise relevance information to evaluate them via the contrastive learning loss:\n\n```math\n\\mathcal{L}_\\text{InfoNCE} = - \\mathbb{E} \\left[\\log \\frac{f(\\mathbf{x}, \\mathbf{c})}{\\sum_{\\mathbf{x}' \\in X} f(\\mathbf{x}', \\mathbf{c})} \\right]\n```\n\nwhere $f(x,c) = \\exp{(x \\cdot c)}$ is an estimate of $\\frac{p(x | c)}{p(x)}$, and $c$ is the target embedding, and $x$ is the positive sample, and $x'$ are negative samples.\n\nThis is to maximize the mutual information between positive pairs $x$ and $c$:\n\n$I(\\mathbf{x}; \\mathbf{c}) = \\sum_{\\mathbf{x}, \\mathbf{c}} p(\\mathbf{x}, \\mathbf{c}) \\log\\frac{p(\\mathbf{x}, \\mathbf{c})}{p(\\mathbf{x})p(\\mathbf{c})} = \\sum_{\\mathbf{x}, \\mathbf{c}} p(\\mathbf{x}, \\mathbf{c})\\log\\frac{p(\\mathbf{x}|\\mathbf{c})}{p(\\mathbf{x})}$\n\nand minimize the mutual information between negative pairs $x'$ and $c$:  $I(\\mathbf{x'}; \\mathbf{c})$.\n\nGradually we can potentially add processing time into consideration to encourage faster embedding and lower latency.\n\n## Computing Requirements\n\nThere are no hard requirements for miners’ equipment, as long as they can serve their text-embedding model in a low-latency and high-throughput manner.\n\nTo achieve this, miners typically need the following infrastructures:\n\nModel Training:\n\n- Machines with GPUs for fast training models on large datasets\n\nModel Serving:\n\n- Dedicated model inference server\n\n## Subnet User Interface\n\nEventually, Subnet 5 will serve the text-embedding model via the subnet validator API.\n\nThe dev experience of using Subnet 5 Embedding API will be similar to the OpenAI text-embedding API [https://platform.openai.com/docs/guides/embeddings/embedding-models](https://platform.openai.com/docs/guides/embeddings/embedding-models).\n\n## Development Roadmap\n\nV1:\n\n- The text-embedding model evaluation and incentive mechanism\n- Subnet dashboard with model performance growing curve, and comparison to OpenAI text-embedding-3-small and text-embedding-3-large models as baselines\n- Subnet API for serving the miners trained model to the subnet users.\n\nV2 and further:\n\n- Extending the dataset\n- Extending the evaluation incentive model to tasks like document re-ranking\n- Incorporating the documents’ pairwise distance in the evaluation\n- …\n\n## Appendix - Backgrounds\n\n### Text Embedding Model\n\nText embedding models are fundamental to modern Natural Language Processing (NLP), representing words, phrases, or documents as dense vectors in a continuous space. These models have evolved significantly over time:\n\nClassic Approaches:\n\n- One-hot encoding and count-based methods (e.g., TF-IDF)\n- Limited in capturing semantic relationships\n\nWord Embeddings:\n\n- Based on distributional semantics\n- Key models: Word2Vec, GloVe, FastText\n- Capture word similarities and relationships\n\nSentence and Document Embeddings:\n\n- Extend word-level techniques to larger text units, dynamic representations based on context\n- Examples: ELMo, BERT, GPT\n- Better at handling polysemy and context-dependent meanings\n\nApplications span various NLP tasks, including semantic similarity, machine translation, and sentiment analysis. Ongoing challenges include addressing bias and improving efficiency.\n\nThis evolution from simple representations to sophisticated contextual models has dramatically enhanced NLP capabilities, enabling a more nuanced understanding of language by machines.\n\n### Vector-based Semantic Search\n\nVector-based semantic search evolved from traditional keyword-based methods to address limitations in understanding context and meaning. It leverages advances in natural language processing and machine learning to represent text as dense vectors in a high-dimensional space.\n\nKey components of vector-based semantic search include:\n\n- Text embedding (e.g., Word2Vec, GloVe, BERT, GPT)\n- Efficient nearest-neighbor search algorithms (e.g., indexing vectors using HNSW)\n\nBy indexing documents with their embeddings, it is possible to:\n\n- Capture semantic relationships between words and concepts\n- Improve handling of synonyms and related terms\n- More intuitive and context-aware search experiences\n\nVector-based semantic search has significantly enhanced information retrieval across various applications, offering more relevant results by understanding the intent behind queries rather than relying solely on exact keyword matches.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenKaito%2Fopenkaito","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenKaito%2Fopenkaito","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenKaito%2Fopenkaito/lists"}