https://github.com/plageon/HtmlRAG

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems
https://github.com/plageon/HtmlRAG
Last synced: 6 months ago
JSON representation
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems
Host: GitHub
URL: https://github.com/plageon/HtmlRAG
Owner: plageon
License: mit
Created: 2024-08-28T08:59:47.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-01-16T06:24:35.000Z (6 months ago)
Last Synced: 2025-01-16T07:32:21.380Z (6 months ago)
Language: Python
Homepage: https://pypi.org/project/htmlrag
Size: 31.9 MB
Stars: 303
Watchers: 2
Forks: 25
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

awesomerag_paper - https://github.com/plageon/HtmlRAG
awesomerag_paper - https://github.com/plageon/HtmlRAG
StarryDivineSky - plageon/HtmlRAG
README

        # 
HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems


















Quick Start (快速开始)&nbsp ｜ &nbsp中文文档&nbsp ｜ &nbspEnglish Documentation&nbsp





## 📖 Table of Contents

- [Introduction](#-introduction)

- [News](#-latest-news)

- [Installation](#-installation)

- [Quick Start](#-quick-start)

- [Reproduce Results](#-dependencies)

- [Citation](#-citation)

## ✨ Latest News

- [01/21/2025]: 🎉Our paper has been accepted by WWW 2025.

- [12/20/2024]: Add Chinese toolkit user guide in [toolkit/README_zh.md](toolkit/README_zh.md). 添加中文工具包使用指南。

- [12/12/2024]: The latest version of htmlrag package is v0.0.5, which now supports Chinese HTML documents. You can install it by running `pip install htmlrag==0.0.5`.

- [11/12/2024]: Our data and model are now available on ModelScope. You can access them [here](https://www.modelscope.cn/collections/HtmlRAG-c290f7cf673648) for faster downloading.

- [11/11/2024]: The training and test data are now available in the huggingface dataset [HtmlRAG-train](https://huggingface.co/datasets/zstanjj/HtmlRAG-train) and [HtmlRAG-test](https://huggingface.co/datasets/zstanjj/HtmlRAG-test).

- [11/06/2024]: Our paper is available on arXiv. You can access it [here](https://arxiv.org/abs/2411.02959).

- [11/05/2024]: The open-source toolkit and models are released. You can apply HtmlRAG in your own RAG systems now.

**🔔Important:** 

- Parameter `max_node_words` is removed from class `GenHTMLPruner` since `v0.1.0`.

- If you switch from htmlrag v0.0.4 to v0.0.5, please download the latest version of modeling files for Gerative HTML Pruners, which are available at [modeling_llama.py](https://github.com/plageon/HtmlRAG/blob/main/llm_modeling/Llama32/modeling_llama.py), and [modeling_phi3.py](https://github.com/plageon/HtmlRAG/blob/main/llm_modeling/Phi35/modeling_phi3.py). Alternatively, you can re-download the models from HuggingFace ([HTML-Pruner-Phi-3.8B](https://huggingface.co/zstanjj/HTML-Pruner-Phi-3.8B) and [HTML-Pruner-Llama-1B](https://huggingface.co/zstanjj/HTML-Pruner-Llama-1B)).

## 📝 Introduction

We propose HtmlRAG, which uses HTML instead of plain text as the format of external knowledge in RAG systems. To tackle the long context brought by HTML, we propose **Lossless HTML Cleaning** and **Two-Step Block-Tree-Based HTML Pruning**.

- **Lossless HTML Cleaning**: This cleaning process just removes totally irrelevant contents and compress redundant structures, retaining all semantic information in the original HTML. The compressed HTML of lossless HTML cleaning is suitable for RAG systems that have long-context LLMs and are not willing to loss any information before generation.

- **Two-Step Block-Tree-Based HTML Pruning**: The block-tree-based HTML pruning consists of two steps, both of which are conducted on the block tree structure. The first pruning step uses a embedding model to calculate scores for blocks, while the second step uses a path generative model. The first step processes the result of lossless HTML cleaning, while the second step processes the result of the first pruning step.

![HtmlRAG](./figures/html-pipeline.png)

---

## 🔌 Apply HtmlRAG in your own RAG systems

We provide a simple tookit to apply HtmlRAG in your own RAG systems. 

![PyPI - Version](https://img.shields.io/pypi/v/htmlrag) ![PyPI - Downloads](https://img.shields.io/pypi/dw/htmlrag) ![PyPI - Downloads](https://img.shields.io/pypi/dm/htmlrag)

### 📦 Installation

Install the package using pip:

```bash

pip install htmlrag

```

Or install the package from source:

```bash

cd toolkit/

pip install -e .

```

### 🎯 Quick Start

An example of using HtmlRAG in your own RAG systems:

```shell

python run_htmlrag_pipeline.py \

    --html_file "./html_data/example/Washington Post.html" \

    --question "What are the main policies or bills that Biden touted besides the American Rescue Plan?" \

    --lang en \

    --embed_model "BAAI/bge-large-en" \

    --gen_model "zstanjj/HTML-Pruner-Phi-3.8B" \

    --chat_tokenizer_name "meta-llama/Llama-3.1-70B-Instruct" \

    --max_node_words_embed 256 \

    --max_context_window_embed 4096 \

    --max_node_words_gen 128 \

    --max_context_window_gen 2048

```

Please refer to the [Documentation](toolkit/README.md) for more details.

一个简单的例子，如何在自己的RAG系统中使用HtmlRAG：

```shell

python html4rag/run_htmlrag_pipeline.py \

    --html_file "./html_data/example/对经济政策的预期.html" \

    --question "今年1-10月，专项债券的投资情况怎么样？" \

    --lang zh \

    --embed_model "BAAI/bge-large-zh" \

    --gen_model "zstanjj/HTML-Pruner-Phi-3.8B" \

    --chat_tokenizer_name "meta-llama/Llama-3.1-70B-Instruct" \

    --max_node_words_embed 256 \

    --max_context_window_embed 4096 \

    --max_node_words_gen 128 \

    --max_context_window_gen 2048

```

请访问[中文文档](toolkit/README_zh.md)了解更多细节。

If you are interested in reproducing the results in the paper, please follow the instructions below.

---

## 🔧 Dependencies

You can directly import a conda environment by importing the yml file.

```bash

conda env create -f environment.yml

conda activate htmlrag

```

Or you can intsall the dependencies by yourself.

```bash

conda create -n htmlrag python=3.9 -y

conda activate htmlrag

conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia

conda install -c conda-forge faiss-cpu

pip install scikit-learn transformers transformers[deepspeed] rouge_score evaluate dataset gpustat anytree json5 tensorboardX accelerate bitsandbytes markdownify bs4 sentencepiece loguru tiktoken matplotlib langchain lxml vllm notebook trl spacy rank_bm25 -i https://pypi.tuna.tsinghua.edu.cn/simple

```

---

## 📂 Data Preparation

### Download datasets used in the paper

1. We randomly sample 400 questions from the test set (if any) or validation set in the original datasets for our evaluation. The processed data is stored in the [html_data](./html_data) folder.

2. We apply query rewriting to extract sub-queries and Bing search to retrieve relevant URLs for each querys, and then we scrap static HTML documents through URLs in returned search results. 

Original webpages are stored in the [html_data](./html_data) folder. 

Due to git file size limitation, we only provide a small subset of test data in this repository. The full processed data is available in huggingface dataset [HtmlRAG-test](https://huggingface.co/datasets/zstanjj/HtmlRAG-test).

|  Dataset   |                                     ASQA                                     |                                        HotpotQA                                        |                                    NQ                                    |                                        TriviaQA                                        |                                      MuSiQue                                       |                                     ELI5                                     |

|:----------:|:----------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------:|:----------------------------------------------------------------------------:|

| Query Data |              [asqa-test.jsonl](html_data/asqa/asqa-test.jsonl)               |            [hotpot-qa-test.jsonl](html_data/hotpot-qa/hotpot-qa-test.jsonl)            |               [nq-test.jsonl](html_data/nq/nq-test.jsonl)                |            [trivia-qa-test.jsonl](html_data/trivia-qa/trivia-qa-test.jsonl)            |             [musique-test.jsonl](html_data/musique/musique-test.jsonl)             |              [eli5-test.jsonl](html_data/eli5/eli5-test.jsonl)               |

| HTML Data  | [html-sample](html_data/asqa/bing/binghtml-slimplmqr-asqa-test-sample.jsonl) | [html-sample](html_data/hotpot-qa/bing/binghtml-slimplmqr-hotpot-qa-test-sample.jsonl) | [html-sample](html_data/nq/bing/binghtml-slimplmqr-nq-test-sample.jsonl) | [html-sample](html_data/trivia-qa/bing/binghtml-slimplmqr-trivia-qa-test-sample.jsonl) | [html-sample](html_data/musique/bing/binghtml-slimplmqr-musique-test-sample.jsonl) | [html-sample](html_data/eli5/bing/binghtml-slimplmqr-eli5-test-sample.jsonl) |

   

### Use your own data

You can use your own data by following the format of the datasets in the [html_data](./html_data) folder.

1. Prepare your query file in `.jsonl` format. Each line is a json object with the following fields:

```json

{

  "id": "unique_id",

  "question": "query_text",

  "answers": ["answer_text_1", "answer_text_2"]

}

```

2. Conduct a optional pre-retrieval process to get sub-queries from the original user's question. The processed sub-queries should be stored in a `{rewrite_method}_result` key in the json object.

```json

{

  "id": "unique_id",

  "question": "query_text",

  "answers": ["answer_text_1", "answer_text_2"],

  "your_method_rewrite": {

    "questions": [

      {

        "question": "sub_query_text_1"

      },

      {

        "question": "sub_query_text_2"

      }

    ]

  }

}

```

3. Conduct web search using bing

```bash

./scripts/search_pipeline_apply.sh

```

---

## 🧹 HTML Cleaning

```bash

bash ./scripts/simplify_html.sh

```

## ✂️ Block-Tree-Based HTML Pruning

### Step 1: HTML Pruning with Text Embedding

```bash

bash ./scripts/tree_rerank.sh

bash ./scripts/trim_html_tree_rerank.sh

```

### Step 2: Generative HTML Pruning

```bash

bash ./scripts/tree_rerank_tree_gen.sh

```

---

## 📊 Evaluation

### Baselines

We provide the following baselines for comparison:

- **BM25**: A widely used sparse rerank model. 

```bash

export rerank_model="bm25"

./scripts/chunk_rerank.sh

./scripts/trim_html_fill_chunk.sh

```

- **[BGE](https://huggingface.co/BAAI/bge-large-en)**: An embedding model, BGE-Large-EN with encoder-only structure. Our scripts requires instantiation of an embedding model with [TEI](https://github.com/huggingface/text-embeddings-inference).

```bash

export rerank_model="bge"

./scripts/chunk_rerank.sh

./scripts/trim_html_fill_chunk.sh

```

- **[E5-Mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct)**: A embedding model based on an LLM, Mistral-7B.  Our scripts requires instantiation of an embedding model with [TEI](https://github.com/huggingface/text-embeddings-inference).

```bash

export rerank_model="e5-mistral"

./scripts/chunk_rerank.sh

./scripts/trim_html_fill_chunk.sh

```

- **LongLLMLingua**: An abstractive model using Llama7B to select useful context.

```bash

./scripts/longlongllmlingua.sh

```

- **[JinaAI Reader](https://huggingface.co/jinaai/reader-lm-1.5b)**: An end-to-end light-weight LLM with 1.5B parameters fine-tuned on an HTML to Markdown converting task dataset. 

```bash

./scripts/jinaai_reader.sh

```

### Evaluation Scripts

1. Instantiaize a LLM inference model with [VLLM](https://github.com/vllm-project/vllm/). We recommend using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) or [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).

2. Run the chatting inference using the following command:

```bash

./scripts/chat_inference.sh

```

2. Follow the evaluation scripts in [eval_scrips.ipynb](./jupyter/eval_scrips.ipynb)

### Results

- **Results for [HTML-Pruner-Phi-3.8B](https://huggingface.co/zstanjj/HTML-Pruner-Phi-3.8B) and [HTML-Pruner-Llama-1B](https://huggingface.co/zstanjj/HTML-Pruner-Llama-1B) with Llama-3.1-70B-Instruct as chat model**.

| Dataset          | ASQA      | HotpotQA  | NQ        | TriviaQA  | MuSiQue   | ELI5      |

|------------------|-----------|-----------|-----------|-----------|-----------|-----------|

| Metrics          | EM        | EM        | EM        | EM        | EM        | ROUGE-L   |

| BM25             | 49.50     | 38.25     | 47.00     | 88.00     | 9.50      | 16.15     |

| BGE              | 68.00     | 41.75     | 59.50     | 93.00     | 12.50     | 16.20     |

| E5-Mistral       | 63.00     | 36.75     | 59.50     | 90.75     | 11.00     | 16.17     |

| LongLLMLingua    | 62.50     | 45.00     | 56.75     | 92.50     | 10.25     | 15.84     |

| JinaAI Reader    | 55.25     | 34.25     | 48.25     | 90.00     | 9.25      | 16.06     |

| HtmlRAG-Phi-3.8B | **68.50** | **46.25** | 60.50     | **93.50** | **13.25** | **16.33** |

| HtmlRAG-Llama-1B | 66.50     | 45.00     | **60.75** | 93.00     | 10.00     | 16.25     |

---

## 🚀 Training

### 1. Download Pretrained Models

```bash

mkdir ../../huggingface

cd ../../huggingface  

huggingface-cli download --resume-download --local-dir-use-symlinks False microsoft/Phi-3.5-mini-instruct --path ../../huggingface/Phi-3.5-mini-instruct/

# alternatively you can download Llama-3.2-1B as the base model

huggingface-cli download --resume-download --local-dir-use-symlinks False meta-llama/Llama-3.2-1B --path ../../huggingface/Llama-3.2-1B/

```

### 2. Configure training data

We release the training data in the huggingface dataset [HtmlRAG-train](https://huggingface.co/datasets/zstanjj/HtmlRAG-train). You can download the dataset by running the following command:

```bash

mkdir html_data/tree_gen

cd html_data/tree_gen

huggingface-cli download --resume-download --local-dir-use-symlinks False zstanjj/HtmlRAG-train --path HtmlRAG-train

```

Configure the sample rate in a `.json5` file, and we provide our default settings in [sample-train.json5](sft/experiments/sample-train.json5). Can you can check your training data with the following command:

```bash

cd sft/

python dataset.py

```

### 3. Train the model

You can follow our settings if you are training on A800 clusters.

```bash

bash ./scripts/train_longctx.sh

```

---

## 📜 Citation

```bibtex

@misc{tan2024htmlraghtmlbetterplain,

      title={HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems}, 

      author={Jiejun Tan and Zhicheng Dou and Wen Wang and Mang Wang and Weipeng Chen and Ji-Rong Wen},

      year={2024},

      eprint={2411.02959},

      archivePrefix={arXiv},

      primaryClass={cs.IR},

      url={https://arxiv.org/abs/2411.02959}, 

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/plageon/HtmlRAG

Awesome Lists containing this project

README