{"id":24978363,"url":"https://github.com/lh0x00/embs","last_synced_at":"2026-05-08T13:41:24.009Z","repository":{"id":274426973,"uuid":"922860557","full_name":"lh0x00/embs","owner":"lh0x00","description":"embs is a Python toolkit for retrieving documents (via Docsifer), generating embeddings (via Lightweight Embeddings API), and ranking texts with an optional caching system.","archived":false,"fork":false,"pushed_at":"2025-02-02T09:57:06.000Z","size":115,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-03T23:52:13.860Z","etag":null,"topics":["docsifer","document-retrieval","embeddings","embs","markitdown","openai","rag","ranking"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/embs","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh0x00.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-27T08:05:27.000Z","updated_at":"2025-02-02T09:57:10.000Z","dependencies_parsed_at":"2025-01-27T09:32:59.624Z","dependency_job_id":null,"html_url":"https://github.com/lh0x00/embs","commit_stats":null,"previous_names":["lh0x00/embs"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh0x00%2Fembs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh0x00%2Fembs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh0x00%2Fembs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh0x00%2Fembs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh0x00","download_url":"https://codeload.github.com/lh0x00/embs/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246162148,"owners_count":20733357,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["docsifer","document-retrieval","embeddings","embs","markitdown","openai","rag","ranking"],"created_at":"2025-02-03T23:52:20.482Z","updated_at":"2026-05-08T13:41:18.966Z","avatar_url":"https://github.com/lh0x00.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# embs\n\n[![PyPI](https://img.shields.io/pypi/v/embs.svg?style=flat-square)](https://pypi.org/project/embs/)\n[![License](https://img.shields.io/pypi/l/embs.svg?style=flat-square)](https://pypi.org/project/embs/)\n[![Downloads](https://img.shields.io/pypi/dm/embs.svg?style=flat-square)](https://pypi.org/project/embs/)\n\n**embs** is a powerful Python library for **document retrieval, embedding, and ranking**, making it easier to build **Retrieval-Augmented Generation (RAG) systems**, **chatbots**, and **semantic search engines**.\n\n## Why Choose embs?\n\n- **Web \u0026 Local Document Search**:\n\n  - DuckDuckGo-powered **web search** retrieves and ranks relevant documents.\n  - Supports **PDFs, Word, HTML, Markdown**, and more.\n\n- **Optimized for RAG, Chatbots \u0026 Multilingual Search**:\n\n  - **Automatic document chunking (Splitter) for improved retrieval accuracy.**\n  - Rank documents **by relevance to a query**.\n  - **Strong multilingual model support** for global applications.\n    ✅ Supported multilingual models:\n    - `snowflake-arctic-embed-l-v2.0`\n    - `bge-m3`\n    - `gte-multilingual-base`\n    - `paraphrase-multilingual-MiniLM-L12-v2`\n    - `paraphrase-multilingual-mpnet-base-v2`\n    - `multilingual-e5-small`\n    - `multilingual-e5-base`\n    - `multilingual-e5-large`\n\n- **Fast \u0026 Efficient**:\n\n  - **Cache support (in-memory \u0026 disk)** for faster queries.\n  - **Flexible batch embedding with cache optimization**.\n\n- **Scalable \u0026 Customizable**:\n\n  - Works with **synchronous \u0026 asynchronous processing**.\n  - Supports **custom splitting rules**.\n\n## 🚀 Installation\n\nInstall via pip:\n\n```bash\npip install embs\n```\n\nFor Poetry users:\n\n```toml\n[tool.poetry.dependencies]\nembs = \"^0.1.8\"\n```\n\n## 📖 Quick Start Guide\n\n### 1️⃣ Searching Documents via DuckDuckGo (Recommended!)\n\nRetrieve **relevant web pages**, **convert them to Markdown**, and **rank them using embeddings**.\n\n\u003e **🚀 Always use a splitter!**  \n\u003e Improves ranking, reduces redundancy, and ensures better retrieval.\n\n```python\nimport asyncio\nfrom functools import partial\nfrom embs import Embs\n\n# Configure a Markdown-based splitter\nsplit_config = {\n    \"headers_to_split_on\": [(\"#\", \"h1\"), (\"##\", \"h2\"), (\"###\", \"h3\")],\n    \"return_each_line\": True,\n    \"strip_headers\": True,\n    \"split_on_double_newline\": True,\n}\nmd_splitter = partial(Embs.markdown_splitter, config=split_config)\n\nclient = Embs()\n\nasync def run_search():\n    results = await client.search_documents_async(\n        query=\"Latest AI research\",\n        limit=3,\n        blocklist=[\"youtube.com\"],  # Exclude unwanted domains\n        splitter=md_splitter,  # Enable smart chunking\n    )\n    for item in results:\n        print(f\"File: {item['filename']} | Score: {item['similarity']:.4f}\")\n        print(f\"Snippet: {item['markdown'][:80]}...\\n\")\n\nasyncio.run(run_search())\n```\n\nFor **synchronous usage**:\n\n```python\nresults = client.search_documents(\n    query=\"Latest AI research\",\n    limit=3,\n    blocklist=[\"youtube.com\"],\n    splitter=md_splitter,  # Always use a splitter\n    model=\"snowflake-arctic-embed-l-v2.0\",\n)\nfor item in results:\n    print(f\"File: {item['filename']} | Score: {item['similarity']:.4f}\")\n```\n\n### 2️⃣ Multilingual Document Querying (Local \u0026 Online)\n\nRetrieve and **rank multilingual documents from local files or URLs**.\n\n```python\nasync def run_query():\n    docs = await client.query_documents_async(\n        query=\"Explique la mécanique quantique\",  # French query\n        files=[\"/path/to/quantum_theory.pdf\"],\n        urls=[\"https://example.com/quantum.html\"],\n        splitter=md_splitter,  # Chunking for better retrieval\n    )\n    for d in docs:\n        print(f\"{d['filename']} =\u003e Score: {d['similarity']:.4f}\")\n        print(f\"Snippet: {d['markdown'][:80]}...\\n\")\n\nasyncio.run(run_query())\n```\n\nFor **synchronous usage**:\n\n```python\ndocs = client.query_documents(\n    query=\"Explique la mécanique quantique\",\n    files=[\"/path/to/quantum_theory.pdf\"],\n    splitter=md_splitter,\n)\nfor d in docs:\n    print(d[\"filename\"], \"=\u003e Score:\", d[\"similarity\"])\n```\n\n💡 **Perfect for multilingual retrieval!** Whether you're searching documents in English, French, Spanish, German, or other supported languages, `embs` ensures optimal ranking and retrieval.\n\n## ⚡ Caching for Performance\n\nEnable **in-memory** or **disk caching** to speed up repeated queries.\n\n```python\ncache_conf = {\n    \"enabled\": True,\n    \"type\": \"memory\",       # or \"disk\"\n    \"prefix\": \"myapp\",\n    \"dir\": \"cache_folder\",  # Required for disk caching\n    \"max_mem_items\": 128,\n    \"max_ttl_seconds\": 86400\n}\n\nclient = Embs(cache_config=cache_conf)\n```\n\n## 🔍 Key Features \u0026 API Methods\n\n### 🔹 `search_documents_async()`\n\n**Search for documents via DuckDuckGo, retrieve, and rank them.**\n\n```python\nawait client.search_documents_async(\n    query=\"Recent AI breakthroughs\",\n    limit=3,\n    blocklist=[\"example.com\"],\n    splitter=md_splitter\n)\n```\n\n### 🔹 `query_documents_async()`\n\n**Retrieve, split, and rank local/online documents.**\n\n```python\nawait client.query_documents_async(\n    query=\"Climate change effects\",\n    files=[\"/path/to/report.pdf\"],\n    urls=[\"https://example.com\"],\n    splitter=md_splitter,\n)\n```\n\n### 🔹 `embed_async()`\n\n**Generate embeddings for texts with multilingual support.**  \n\n```python\nembeddings = await client.embed_async(\n    [\"Este es un ejemplo de texto.\", \"Ceci est un exemple de phrase.\"],\n    optimized=True  # Process one at a time for better caching\n)\n```\n\n### 🔹 `rank_async()`\n\n**Rank candidate texts by similarity to a query.**\n\n```python\nranked_results = await client.rank_async(\n    query=\"Machine learning\",\n    candidates=[\"Deep learning is a subset of ML\", \"Quantum computing is unrelated\"]\n)\n```\n\n## 🔬 Testing\n\nRun **pytest** and **pytest-asyncio** for automated testing:\n\n```bash\npytest --asyncio-mode=auto\n```\n\n## 📝 Best Practices: Always Use a Splitter!\n\n### ✅ How to Use the Built-in Markdown Splitter\n\n```python\nfrom functools import partial\n\nsplit_config = {\n    \"headers_to_split_on\": [(\"#\", \"h1\"), (\"##\", \"h2\"), (\"###\", \"h3\")],\n    \"return_each_line\": True,\n    \"strip_headers\": True,\n    \"split_on_double_newline\": True,\n}\n\nmd_splitter = partial(Embs.markdown_splitter, config=split_config)\n\ndocs = client.query_documents(\n    query=\"Machine Learning Basics\",\n    files=[\"/path/to/ml_guide.pdf\"],\n    splitter=md_splitter\n)\n```\n\n## 📜 License\n\nLicensed under **MIT License**. See [LICENSE](./LICENSE) for details.\n\n## 🤝 Contributing\n\nPull requests, issues, and discussions are welcome!\n\n🚀 With enhanced **multilingual support**, `embs` is now even more powerful for global retrieval applications! 🌍\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh0x00%2Fembs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh0x00%2Fembs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh0x00%2Fembs/lists"}