Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/plageon/HtmlRAG
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems
https://github.com/plageon/HtmlRAG
Last synced: 3 days ago
JSON representation
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems
- Host: GitHub
- URL: https://github.com/plageon/HtmlRAG
- Owner: plageon
- License: mit
- Created: 2024-08-28T08:59:47.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-01-16T06:24:35.000Z (11 days ago)
- Last Synced: 2025-01-16T07:32:21.380Z (11 days ago)
- Language: Python
- Homepage: https://pypi.org/project/htmlrag
- Size: 31.9 MB
- Stars: 303
- Watchers: 2
- Forks: 25
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesomerag_paper - https://github.com/plageon/HtmlRAG
- awesomerag_paper - https://github.com/plageon/HtmlRAG
README
#
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems## 📖 Table of Contents
- [Introduction](#-introduction)
- [News](#-latest-news)
- [Installation](#-installation)
- [Quick Start](#-quick-start)
- [Reproduce Results](#-dependencies)
- [Citation](#-citation)## ✨ Latest News
- [01/21/2025]: 🎉Our paper has been accepted by WWW 2025.
- [12/20/2024]: Add Chinese toolkit user guide in [toolkit/README_zh.md](toolkit/README_zh.md). 添加中文工具包使用指南。
- [12/12/2024]: The latest version of htmlrag package is v0.0.5, which now supports Chinese HTML documents. You can install it by running `pip install htmlrag==0.0.5`.
- [11/12/2024]: Our data and model are now available on ModelScope. You can access them [here](https://www.modelscope.cn/collections/HtmlRAG-c290f7cf673648) for faster downloading.
- [11/11/2024]: The training and test data are now available in the huggingface dataset [HtmlRAG-train](https://huggingface.co/datasets/zstanjj/HtmlRAG-train) and [HtmlRAG-test](https://huggingface.co/datasets/zstanjj/HtmlRAG-test).
- [11/06/2024]: Our paper is available on arXiv. You can access it [here](https://arxiv.org/abs/2411.02959).
- [11/05/2024]: The open-source toolkit and models are released. You can apply HtmlRAG in your own RAG systems now.**🔔Important:**
- Parameter `max_node_words` is removed from class `GenHTMLPruner` since `v0.1.0`.
- If you switch from htmlrag v0.0.4 to v0.0.5, please download the latest version of modeling files for Gerative HTML Pruners, which are available at [modeling_llama.py](https://github.com/plageon/HtmlRAG/blob/main/llm_modeling/Llama32/modeling_llama.py), and [modeling_phi3.py](https://github.com/plageon/HtmlRAG/blob/main/llm_modeling/Phi35/modeling_phi3.py). Alternatively, you can re-download the models from HuggingFace ([HTML-Pruner-Phi-3.8B](https://huggingface.co/zstanjj/HTML-Pruner-Phi-3.8B) and [HTML-Pruner-Llama-1B](https://huggingface.co/zstanjj/HTML-Pruner-Llama-1B)).## 📝 Introduction
We propose HtmlRAG, which uses HTML instead of plain text as the format of external knowledge in RAG systems. To tackle the long context brought by HTML, we propose **Lossless HTML Cleaning** and **Two-Step Block-Tree-Based HTML Pruning**.- **Lossless HTML Cleaning**: This cleaning process just removes totally irrelevant contents and compress redundant structures, retaining all semantic information in the original HTML. The compressed HTML of lossless HTML cleaning is suitable for RAG systems that have long-context LLMs and are not willing to loss any information before generation.
- **Two-Step Block-Tree-Based HTML Pruning**: The block-tree-based HTML pruning consists of two steps, both of which are conducted on the block tree structure. The first pruning step uses a embedding model to calculate scores for blocks, while the second step uses a path generative model. The first step processes the result of lossless HTML cleaning, while the second step processes the result of the first pruning step.
![HtmlRAG](./figures/html-pipeline.png)
---
## 🔌 Apply HtmlRAG in your own RAG systems
We provide a simple tookit to apply HtmlRAG in your own RAG systems.
![PyPI - Version](https://img.shields.io/pypi/v/htmlrag) ![PyPI - Downloads](https://img.shields.io/pypi/dw/htmlrag) ![PyPI - Downloads](https://img.shields.io/pypi/dm/htmlrag)
### 📦 Installation
Install the package using pip:
```bash
pip install htmlrag
```
Or install the package from source:
```bash
cd toolkit/
pip install -e .
```### 🎯 Quick Start
An example of using HtmlRAG in your own RAG systems:
```shell
python run_htmlrag_pipeline.py \
--html_file "./html_data/example/Washington Post.html" \
--question "What are the main policies or bills that Biden touted besides the American Rescue Plan?" \
--lang en \
--embed_model "BAAI/bge-large-en" \
--gen_model "zstanjj/HTML-Pruner-Phi-3.8B" \
--chat_tokenizer_name "meta-llama/Llama-3.1-70B-Instruct" \
--max_node_words_embed 256 \
--max_context_window_embed 4096 \
--max_node_words_gen 128 \
--max_context_window_gen 2048
```
Please refer to the [Documentation](toolkit/README.md) for more details.一个简单的例子,如何在自己的RAG系统中使用HtmlRAG:
```shell
python html4rag/run_htmlrag_pipeline.py \
--html_file "./html_data/example/对经济政策的预期.html" \
--question "今年1-10月,专项债券的投资情况怎么样?" \
--lang zh \
--embed_model "BAAI/bge-large-zh" \
--gen_model "zstanjj/HTML-Pruner-Phi-3.8B" \
--chat_tokenizer_name "meta-llama/Llama-3.1-70B-Instruct" \
--max_node_words_embed 256 \
--max_context_window_embed 4096 \
--max_node_words_gen 128 \
--max_context_window_gen 2048
```
请访问[中文文档](toolkit/README_zh.md)了解更多细节。If you are interested in reproducing the results in the paper, please follow the instructions below.
---
## 🔧 Dependencies
You can directly import a conda environment by importing the yml file.
```bash
conda env create -f environment.yml
conda activate htmlrag
```
Or you can intsall the dependencies by yourself.
```bash
conda create -n htmlrag python=3.9 -y
conda activate htmlrag
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c conda-forge faiss-cpu
pip install scikit-learn transformers transformers[deepspeed] rouge_score evaluate dataset gpustat anytree json5 tensorboardX accelerate bitsandbytes markdownify bs4 sentencepiece loguru tiktoken matplotlib langchain lxml vllm notebook trl spacy rank_bm25 -i https://pypi.tuna.tsinghua.edu.cn/simple
```---
## 📂 Data Preparation
### Download datasets used in the paper
1. We randomly sample 400 questions from the test set (if any) or validation set in the original datasets for our evaluation. The processed data is stored in the [html_data](./html_data) folder.
2. We apply query rewriting to extract sub-queries and Bing search to retrieve relevant URLs for each querys, and then we scrap static HTML documents through URLs in returned search results.
Original webpages are stored in the [html_data](./html_data) folder.
Due to git file size limitation, we only provide a small subset of test data in this repository. The full processed data is available in huggingface dataset [HtmlRAG-test](https://huggingface.co/datasets/zstanjj/HtmlRAG-test).| Dataset | ASQA | HotpotQA | NQ | TriviaQA | MuSiQue | ELI5 |
|:----------:|:----------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------:|:----------------------------------------------------------------------------:|
| Query Data | [asqa-test.jsonl](html_data/asqa/asqa-test.jsonl) | [hotpot-qa-test.jsonl](html_data/hotpot-qa/hotpot-qa-test.jsonl) | [nq-test.jsonl](html_data/nq/nq-test.jsonl) | [trivia-qa-test.jsonl](html_data/trivia-qa/trivia-qa-test.jsonl) | [musique-test.jsonl](html_data/musique/musique-test.jsonl) | [eli5-test.jsonl](html_data/eli5/eli5-test.jsonl) |
| HTML Data | [html-sample](html_data/asqa/bing/binghtml-slimplmqr-asqa-test-sample.jsonl) | [html-sample](html_data/hotpot-qa/bing/binghtml-slimplmqr-hotpot-qa-test-sample.jsonl) | [html-sample](html_data/nq/bing/binghtml-slimplmqr-nq-test-sample.jsonl) | [html-sample](html_data/trivia-qa/bing/binghtml-slimplmqr-trivia-qa-test-sample.jsonl) | [html-sample](html_data/musique/bing/binghtml-slimplmqr-musique-test-sample.jsonl) | [html-sample](html_data/eli5/bing/binghtml-slimplmqr-eli5-test-sample.jsonl) |
### Use your own data
You can use your own data by following the format of the datasets in the [html_data](./html_data) folder.
1. Prepare your query file in `.jsonl` format. Each line is a json object with the following fields:
```json
{
"id": "unique_id",
"question": "query_text",
"answers": ["answer_text_1", "answer_text_2"]
}
```
2. Conduct a optional pre-retrieval process to get sub-queries from the original user's question. The processed sub-queries should be stored in a `{rewrite_method}_result` key in the json object.```json
{
"id": "unique_id",
"question": "query_text",
"answers": ["answer_text_1", "answer_text_2"],
"your_method_rewrite": {
"questions": [
{
"question": "sub_query_text_1"
},
{
"question": "sub_query_text_2"
}
]
}
}
```3. Conduct web search using bing
```bash
./scripts/search_pipeline_apply.sh
```---
## 🧹 HTML Cleaning
```bash
bash ./scripts/simplify_html.sh
```## ✂️ Block-Tree-Based HTML Pruning
### Step 1: HTML Pruning with Text Embedding
```bash
bash ./scripts/tree_rerank.sh
bash ./scripts/trim_html_tree_rerank.sh
```### Step 2: Generative HTML Pruning
```bash
bash ./scripts/tree_rerank_tree_gen.sh
```
---## 📊 Evaluation
### Baselines
We provide the following baselines for comparison:
- **BM25**: A widely used sparse rerank model.
```bash
export rerank_model="bm25"
./scripts/chunk_rerank.sh
./scripts/trim_html_fill_chunk.sh
```- **[BGE](https://huggingface.co/BAAI/bge-large-en)**: An embedding model, BGE-Large-EN with encoder-only structure. Our scripts requires instantiation of an embedding model with [TEI](https://github.com/huggingface/text-embeddings-inference).
```bash
export rerank_model="bge"
./scripts/chunk_rerank.sh
./scripts/trim_html_fill_chunk.sh
```
- **[E5-Mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct)**: A embedding model based on an LLM, Mistral-7B. Our scripts requires instantiation of an embedding model with [TEI](https://github.com/huggingface/text-embeddings-inference).
```bash
export rerank_model="e5-mistral"
./scripts/chunk_rerank.sh
./scripts/trim_html_fill_chunk.sh
```
- **LongLLMLingua**: An abstractive model using Llama7B to select useful context.
```bash
./scripts/longlongllmlingua.sh
```
- **[JinaAI Reader](https://huggingface.co/jinaai/reader-lm-1.5b)**: An end-to-end light-weight LLM with 1.5B parameters fine-tuned on an HTML to Markdown converting task dataset.
```bash
./scripts/jinaai_reader.sh
```
### Evaluation Scripts1. Instantiaize a LLM inference model with [VLLM](https://github.com/vllm-project/vllm/). We recommend using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) or [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).
2. Run the chatting inference using the following command:
```bash
./scripts/chat_inference.sh
```
2. Follow the evaluation scripts in [eval_scrips.ipynb](./jupyter/eval_scrips.ipynb)### Results
- **Results for [HTML-Pruner-Phi-3.8B](https://huggingface.co/zstanjj/HTML-Pruner-Phi-3.8B) and [HTML-Pruner-Llama-1B](https://huggingface.co/zstanjj/HTML-Pruner-Llama-1B) with Llama-3.1-70B-Instruct as chat model**.
| Dataset | ASQA | HotpotQA | NQ | TriviaQA | MuSiQue | ELI5 |
|------------------|-----------|-----------|-----------|-----------|-----------|-----------|
| Metrics | EM | EM | EM | EM | EM | ROUGE-L |
| BM25 | 49.50 | 38.25 | 47.00 | 88.00 | 9.50 | 16.15 |
| BGE | 68.00 | 41.75 | 59.50 | 93.00 | 12.50 | 16.20 |
| E5-Mistral | 63.00 | 36.75 | 59.50 | 90.75 | 11.00 | 16.17 |
| LongLLMLingua | 62.50 | 45.00 | 56.75 | 92.50 | 10.25 | 15.84 |
| JinaAI Reader | 55.25 | 34.25 | 48.25 | 90.00 | 9.25 | 16.06 |
| HtmlRAG-Phi-3.8B | **68.50** | **46.25** | 60.50 | **93.50** | **13.25** | **16.33** |
| HtmlRAG-Llama-1B | 66.50 | 45.00 | **60.75** | 93.00 | 10.00 | 16.25 |---
## 🚀 Training
### 1. Download Pretrained Models
```bash
mkdir ../../huggingface
cd ../../huggingface
huggingface-cli download --resume-download --local-dir-use-symlinks False microsoft/Phi-3.5-mini-instruct --path ../../huggingface/Phi-3.5-mini-instruct/# alternatively you can download Llama-3.2-1B as the base model
huggingface-cli download --resume-download --local-dir-use-symlinks False meta-llama/Llama-3.2-1B --path ../../huggingface/Llama-3.2-1B/
```### 2. Configure training data
We release the training data in the huggingface dataset [HtmlRAG-train](https://huggingface.co/datasets/zstanjj/HtmlRAG-train). You can download the dataset by running the following command:
```bash
mkdir html_data/tree_gen
cd html_data/tree_gen
huggingface-cli download --resume-download --local-dir-use-symlinks False zstanjj/HtmlRAG-train --path HtmlRAG-train
```Configure the sample rate in a `.json5` file, and we provide our default settings in [sample-train.json5](sft/experiments/sample-train.json5). Can you can check your training data with the following command:
```bash
cd sft/
python dataset.py
```### 3. Train the model
You can follow our settings if you are training on A800 clusters.
```bash
bash ./scripts/train_longctx.sh
```---
## 📜 Citation
```bibtex
@misc{tan2024htmlraghtmlbetterplain,
title={HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems},
author={Jiejun Tan and Zhicheng Dou and Wen Wang and Mang Wang and Weipeng Chen and Ji-Rong Wen},
year={2024},
eprint={2411.02959},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2411.02959},
}
```