Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/leettools-dev/leettools

AI Search tools.
https://github.com/leettools-dev/leettools

Last synced: 8 days ago
JSON representation

AI Search tools.

Awesome Lists containing this project

README

        


Logo

[![Follow on X](https://img.shields.io/twitter/follow/LeetTools?logo=X&color=%20%23f5f5f5)](https://twitter.com/intent/follow?screen_name=LeetTools)
[![GitHub license](https://img.shields.io/badge/License-Apache_2.0-blue.svg?labelColor=%20%23155EEF&color=%20%23528bff)](https://github.com/leettools-dev/leettools)

- [AI Search Assistant with Local Knowledge Base](#ai-search-assistant-with-local-knowledge-base)
- [Quick Start](#quick-start)
- [Use Different LLM Endpoints](#use-different-llm-endpoints)
- [Use local Ollama service for inference and embedding](#use-local-ollama-service-for-inference-and-embedding)
- [Using DeepSeek API with different embedding services](#using-deepseek-api-with-different-embedding-services)
- [Usage Examples](#usage-examples)
- [Build a local knowledge base using PDFs from the web](#build-a-local-knowledge-base-using-pdfs-from-the-web)
- [Generate news list from updates in KB](#generate-news-list-from-updates-in-kb)
- [Main Components](#main-components)
- [Community](#community)

# AI Search Assistant with Local Knowledge Base

LeetTools is an AI search assistant that can perform highly customizable search workflows
and save the search results and generated outputs to local knowledge bases. With an
automated document pipeline that handles data ingestion, indexing, and storage, we can
easily run complext search workflows that query, extract and generate content from the
web or local knowledge bases.

LeetTools can run with minimal resource requirements on the command line with a
DuckDB-backend and configurable LLM settings. It can be easily integrated with other
applications need AI search and knowledge base support.

Here is an illustration of the LeetTools **digest** flow where it can search the web
(or local KB) and generate a digest article from the search results:

![LeetTools Digest Flow](docs/assets/process-digest.drawio.svg)

And here is an example output article generated by the **digest** flow for the query
[How does Ollama work?](docs/examples/ollama.md).

Currently LeetTools provides the following workflow:

* answer : Answer the query directly with source references (similar to Perplexity). [📖](https://leettools-dev.github.io/Flow/answer)
* digest : Generate a multi-section digest article from search results (similar to Google Deep Research). [📖](https://leettools-dev.github.io/Flow/digest)
* search : Search for top segements that match the query. [📖](https://leettools-dev.github.io/Flow/search)
* news : Generate a list of news items for the specified topic. [📖](https://leettools-dev.github.io/Flow/news)
* extract : Extract and store structured data for given schema. [📖](https://leettools-dev.github.io/Flow/extract)
* opinions: Generate sentiment analysis and facts from the search results. [📖](https://leettools-dev.github.io/Flow/opinions)

# Quick Start

We can use any OpenAI-compatible LLM endpoint, such as local Ollama service or public
provider such as Gemini or DeepSeek. We can switch the service easily by [defining
environment variables or switching .env files](#use-different-llm-endpoints).

**Run with pip**

If you are using an OpenAI compatible LLM endpoint, you can install and run LeetTools
with pip as follows:

```bash
% conda create -y -n leettools python=3.11
% conda activate leettools
% pip install leettools
% export EDS_LLM_API_KEY=
% leet flow -t answer -q "How does GraphRAG work?" -k graphrag -l info
```

The above `flow -t answer` command will run the `answer` flow with the query "How does
GraphRAG work?" and save the scraped web page to the knowledge base `graphrag`. The
`-l info` option will show the essential log messages.

By default the data is saved under ${HOME}/leettools, you can set a different LeetHome
environment variable to change the location:

```bash
% export LEET_HOME=
% mkdir -p ${LEET_HOME}
```

The default API endpoint is set to the OpenAI API endpoint, which you can modify by
changing the `EDS_DEFAULT_LLM_BASE_URL` environment variable:

```bash
% export EDS_DEFAULT_LLM_BASE_URL=https://api.openai.com/v1
```

**Run with source code**

```bash
% git clone https://github.com/leettools-dev/leettools.git
% cd leettools

% conda create -y -n leettools python=3.11
% conda activate leettools
% pip install -r requirements.txt
% pip install -e .
# add the script path to the path
% export PATH=`pwd`/scripts:${PATH}

% export EDS_LLM_API_KEY=

% leet flow -t answer -q "How does GraphRAG work?" -k graphrag -l info
```

**Sample Output**

Here is an example output of the `answer` flow:

```markdown
# How Does Graphrag Work?
GraphRAG operates by constructing a knowledge graph from a set of documents, which
involves several key steps. Initially, it ingests textual data and utilizes a large
language model (LLM) to extract entities (such as people, places, and concepts) and
their relationships, mapping these as nodes and edges in a graph structure[1].

The process begins with pre-processing and indexing, where the text is segmented into
manageable units, and entities and relationships are identified. These entities are
then organized into hierarchical "communities," which are clusters of related topics
that allow for a more structured understanding of the data[2][3].

When a query is made, GraphRAG employs two types of searches: Global Search, which
looks across the entire knowledge graph for broad connections, and Local Search, which
focuses on specific subgraphs for detailed information[3]. This dual approach enables
GraphRAG to provide comprehensive answers that consider both high-level themes and
specific details, allowing it to handle complex queries effectively[3][4].

In summary, GraphRAG enhances traditional retrieval-augmented generation (RAG) by
leveraging a structured knowledge graph, enabling it to provide nuanced responses that
reflect the interconnected nature of the information it processes[1][2].
## References
[1] [https://www.falkordb.com/blog/what-is-graphrag/](https://www.falkordb.com/blog/what-is-graphrag/)
[2] [https://medium.com/@zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1](https://medium.com/@zilliz_learn/graphrag-explained-enhancing-rag-with-knowledge-graphs-3312065f99e1)
[3] [https://medium.com/data-science-in-your-pocket/how-graphrag-works-8d89503b480d](https://medium.com/data-science-in-your-pocket/how-graphrag-works-8d89503b480d)
[4] [https://github.com/microsoft/graphrag/discussions/511](https://github.com/microsoft/graphrag/discussions/511)
```

# Use Different LLM Endpoints

We can run LeetTools with different env files to use different LLM providers and other
related settings.

## Use local Ollama service for inference and embedding

```bash
# you may need to pull the models first
% ollama pull llama3.2
% ollama pull nomic-embed-text
% ollama serve

% cat > .env.ollama < .env.deepseek <
EDS_DEFAULT_LLM_BASE_URL=https://api.deepseek.com/v1
EDS_LLM_API_KEY=
EDS_DEFAULT_INFERENCE_MODEL=deepseek-chat
EDS_DEFAULT_DENSE_EMBEDDER=dense_embedder_local_mem
EOF

# Then run the command with the -e option to specify the .env file to use
% leet flow -e .env.deepseek -t answer -q "How does GraphRAG work?" -k graphrag -l info
```

If you want to use another API provider (OpenAI compatible) for embedding, say a local
Ollama embedder, you can set the embedding endpoint URL and API key separately as follows:

```bash
% cat > .env.deepseek <
EDS_DEFAULT_INFERENCE_MODEL=deepseek-chat

# this specifies to use an OpenAI compatible embedding endpoint
EDS_DEFAULT_DENSE_EMBEDDER=dense_embedder_openai

# the following specifies the embedding endpoint URL and model to use
EDS_DEFAULT_EMBEDDING_BASE_URL=http://localhost:11434/v1
EDS_DEFAULT_EMBEDDING_MODEL=nomic-embed-text
EDS_EMBEDDING_MODEL_DIMENSION=768
EOF
```

# Usage Examples

## Build a local knowledge base using PDFs from the web

We can build a local knowledge base with PDFs from the web. Suppose we have set up
the local Ollama service as described [above](#use-local-ollama-service-for-inference-and-embedding),
now we can use the following commands to build a local knowledge base with PDFs from the web:

```bash
# create a KB with a URL
# the book downloded here is "Foundations of Large Language Models"
# it has 231 pages and take some time to process
% leet kb add-url -e .env.ollama -k llmbook -r "https://arxiv.org/pdf/2501.09223"

# now you can query the KB with any topic you want to explore
% leet kb flow -e .env.ollama -t answer -k llmbook -l info \
-q "How does LLM Finetuning process work?"
```

We have a more [detailed example](docs/run_ollama_with_deepseek_r1.md) to show how to
use the local Ollama service with the DeepSeek-r1:1.5B model to build a local knowledge
base.

## Generate news list from updates in KB

We can create a knowledge base with a list of URLs or a search query, and then generate
a list of news items from the KB. Here is an example:

```bash
# create a KB with a google search
# -d 1 means to search for news from the last day
# -m 30 means to scrape the top 30 search results
% leet kb add-search -k genai -q "LLM GenAI Startups" -d 1 -m 30
# you can add single url to the KB
% leet kb add-url -k genai -r "https://www.techcrunch.com"
# you can also add a list of urls, example in [docs/sample_urls.txt](docs/sample_urls.txt)
% leet kb add-url-list -k genai -f

# generate a news list from the KB
% leet flow -t news -q "LLM GenAI Startups" -k genai -l info -o llm_genai_news.md

# Next time you want to refresh the KB and generate the news list
# this command will re-ingest all the docsources specified above
% leet kb ingest -k genai

# run the news flow again with parameter you need
% leet flow -t news --info
====================================================================================================
news: Generating a list of news items from the KB.

This flow generates a list of news items from the updated items in the KB:
1. check the KB for recently updated documents and find news items in them.
2. combine all the similar items into one.
3. remove items that have been reported before.
4. rank the items by the number of sources.
5. generate a list of news items with references.

====================================================================================================
Use -p name=value to specify options for news:

article_style : The style of the output article such as analytical research reports, humorous
news articles, or technical blog posts. [default: analytical research reports]
[FLOW: news]
days_limit : Number of days to limit the search results. 0 or empty means no limit. In
local KB, filters by the import time. [FLOW: news]
news_include_old : Include all news items in the result, even if it has been reported
before.Default is False. [default: False] [FLOW: news]
news_source_min : Number of sources a news item has to have to be included in the result.Default
is 2. Depends on the nature of the knowledge base. [default: 2] [FLOW: news]
output_language : Output the result in the language. [FLOW: news]
word_count : The number of words in the output section. Empty means automatics.
[FLOW: news]

```

Note: scheduler support and UI view are coming soon.

# Main Components

The main components of the backend include:
* 🚀 Automated document pipeline to ingest, convert, chunk, embed, and index documents.
* 🗂️ Knowledge base to manage and serve the indexed documents.
* 🔍 Search and retrieval library to fetch documents from the web or local KB.
* 🤖 Workflow engine to implement search-based AI workflows.
* ⚙ Configuration system to support dynamic configurations used for every component.
* 📝 Query history system to manage the history and the context of the queries.
* 💻 Scheduler for automatic execution of the pipeline tasks.
* 🧩 Accounting system to track the usage of the LLM APIs.

The architecture of the document pipeline is shown below:

![LeetTools Document Pipeline](https://gist.githubusercontent.com/pengfeng/4b2e36bda389e0a3c338b5c42b5d09c1/raw/6bc06db40dadf995212270d914b46281bf7edae9/leettools-eds-arch.svg)

See the [Documentation](docs/documentation.md) for more details.

# Community

**Acknowledgements**

Right now we are using the following open source libraries and tools (not limited to):

- [DuckDB](https://github.com/duckdb/duckdb)
- [Docling](https://github.com/DS4SD/docling)
- [Chonkie](https://github.com/bhavnicksm/chonkie)
- [Ollama](https://github.com/ollama/ollama)
- [Jinja2](https://jinja.palletsprojects.com/en/3.0.x/)
- [BS4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

We plan to add more plugins for different components to support different workloads.

**Get help and support**

Please feel free to connect with us using the [discussion section](https://github.com/leettools-dev/leettools/discussions).

**Contributing**

Please read [Contributing to LeetTools](CONTRIBUTING.md) for details.

**License**

LeetTools is licensed under the Apache License, Version 2.0. See [LICENSE](LICENSE)
for the full license text.