https://github.com/lh0x00/embs
embs is a Python toolkit for retrieving documents (via Docsifer), generating embeddings (via Lightweight Embeddings API), and ranking texts with an optional caching system.
https://github.com/lh0x00/embs
docsifer document-retrieval embeddings embs markitdown openai rag ranking
Last synced: about 2 months ago
JSON representation
embs is a Python toolkit for retrieving documents (via Docsifer), generating embeddings (via Lightweight Embeddings API), and ranking texts with an optional caching system.
- Host: GitHub
- URL: https://github.com/lh0x00/embs
- Owner: lh0x00
- License: mit
- Created: 2025-01-27T08:05:27.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-02T09:57:06.000Z (4 months ago)
- Last Synced: 2025-02-03T23:52:13.860Z (4 months ago)
- Topics: docsifer, document-retrieval, embeddings, embs, markitdown, openai, rag, ranking
- Language: Python
- Homepage: https://pypi.org/project/embs
- Size: 112 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# embs
[](https://pypi.org/project/embs/)
[](https://pypi.org/project/embs/)
[](https://pypi.org/project/embs/)**embs** is a powerful Python library for **document retrieval, embedding, and ranking**, making it easier to build **Retrieval-Augmented Generation (RAG) systems**, **chatbots**, and **semantic search engines**.
## Why Choose embs?
- **Web & Local Document Search**:
- DuckDuckGo-powered **web search** retrieves and ranks relevant documents.
- Supports **PDFs, Word, HTML, Markdown**, and more.- **Optimized for RAG, Chatbots & Multilingual Search**:
- **Automatic document chunking (Splitter) for improved retrieval accuracy.**
- Rank documents **by relevance to a query**.
- **Strong multilingual model support** for global applications.
β Supported multilingual models:
- `snowflake-arctic-embed-l-v2.0`
- `bge-m3`
- `gte-multilingual-base`
- `paraphrase-multilingual-MiniLM-L12-v2`
- `paraphrase-multilingual-mpnet-base-v2`
- `multilingual-e5-small`
- `multilingual-e5-base`
- `multilingual-e5-large`- **Fast & Efficient**:
- **Cache support (in-memory & disk)** for faster queries.
- **Flexible batch embedding with cache optimization**.- **Scalable & Customizable**:
- Works with **synchronous & asynchronous processing**.
- Supports **custom splitting rules**.## π Installation
Install via pip:
```bash
pip install embs
```For Poetry users:
```toml
[tool.poetry.dependencies]
embs = "^0.1.8"
```## π Quick Start Guide
### 1οΈβ£ Searching Documents via DuckDuckGo (Recommended!)
Retrieve **relevant web pages**, **convert them to Markdown**, and **rank them using embeddings**.
> **π Always use a splitter!**
> Improves ranking, reduces redundancy, and ensures better retrieval.```python
import asyncio
from functools import partial
from embs import Embs# Configure a Markdown-based splitter
split_config = {
"headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
"return_each_line": True,
"strip_headers": True,
"split_on_double_newline": True,
}
md_splitter = partial(Embs.markdown_splitter, config=split_config)client = Embs()
async def run_search():
results = await client.search_documents_async(
query="Latest AI research",
limit=3,
blocklist=["youtube.com"], # Exclude unwanted domains
splitter=md_splitter, # Enable smart chunking
)
for item in results:
print(f"File: {item['filename']} | Score: {item['similarity']:.4f}")
print(f"Snippet: {item['markdown'][:80]}...\n")asyncio.run(run_search())
```For **synchronous usage**:
```python
results = client.search_documents(
query="Latest AI research",
limit=3,
blocklist=["youtube.com"],
splitter=md_splitter, # Always use a splitter
model="snowflake-arctic-embed-l-v2.0",
)
for item in results:
print(f"File: {item['filename']} | Score: {item['similarity']:.4f}")
```### 2οΈβ£ Multilingual Document Querying (Local & Online)
Retrieve and **rank multilingual documents from local files or URLs**.
```python
async def run_query():
docs = await client.query_documents_async(
query="Explique la mΓ©canique quantique", # French query
files=["/path/to/quantum_theory.pdf"],
urls=["https://example.com/quantum.html"],
splitter=md_splitter, # Chunking for better retrieval
)
for d in docs:
print(f"{d['filename']} => Score: {d['similarity']:.4f}")
print(f"Snippet: {d['markdown'][:80]}...\n")asyncio.run(run_query())
```For **synchronous usage**:
```python
docs = client.query_documents(
query="Explique la mΓ©canique quantique",
files=["/path/to/quantum_theory.pdf"],
splitter=md_splitter,
)
for d in docs:
print(d["filename"], "=> Score:", d["similarity"])
```π‘ **Perfect for multilingual retrieval!** Whether you're searching documents in English, French, Spanish, German, or other supported languages, `embs` ensures optimal ranking and retrieval.
## β‘ Caching for Performance
Enable **in-memory** or **disk caching** to speed up repeated queries.
```python
cache_conf = {
"enabled": True,
"type": "memory", # or "disk"
"prefix": "myapp",
"dir": "cache_folder", # Required for disk caching
"max_mem_items": 128,
"max_ttl_seconds": 86400
}client = Embs(cache_config=cache_conf)
```## π Key Features & API Methods
### πΉ `search_documents_async()`
**Search for documents via DuckDuckGo, retrieve, and rank them.**
```python
await client.search_documents_async(
query="Recent AI breakthroughs",
limit=3,
blocklist=["example.com"],
splitter=md_splitter
)
```### πΉ `query_documents_async()`
**Retrieve, split, and rank local/online documents.**
```python
await client.query_documents_async(
query="Climate change effects",
files=["/path/to/report.pdf"],
urls=["https://example.com"],
splitter=md_splitter,
)
```### πΉ `embed_async()`
**Generate embeddings for texts with multilingual support.**
```python
embeddings = await client.embed_async(
["Este es un ejemplo de texto.", "Ceci est un exemple de phrase."],
optimized=True # Process one at a time for better caching
)
```### πΉ `rank_async()`
**Rank candidate texts by similarity to a query.**
```python
ranked_results = await client.rank_async(
query="Machine learning",
candidates=["Deep learning is a subset of ML", "Quantum computing is unrelated"]
)
```## π¬ Testing
Run **pytest** and **pytest-asyncio** for automated testing:
```bash
pytest --asyncio-mode=auto
```## π Best Practices: Always Use a Splitter!
### β How to Use the Built-in Markdown Splitter
```python
from functools import partialsplit_config = {
"headers_to_split_on": [("#", "h1"), ("##", "h2"), ("###", "h3")],
"return_each_line": True,
"strip_headers": True,
"split_on_double_newline": True,
}md_splitter = partial(Embs.markdown_splitter, config=split_config)
docs = client.query_documents(
query="Machine Learning Basics",
files=["/path/to/ml_guide.pdf"],
splitter=md_splitter
)
```## π License
Licensed under **MIT License**. See [LICENSE](./LICENSE) for details.
## π€ Contributing
Pull requests, issues, and discussions are welcome!
π With enhanced **multilingual support**, `embs` is now even more powerful for global retrieval applications! π