{"id":24588646,"url":"https://github.com/plageon/HtmlRAG","last_synced_at":"2025-10-05T13:31:41.792Z","repository":{"id":261201501,"uuid":"848707613","full_name":"plageon/HtmlRAG","owner":"plageon","description":"HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems","archived":false,"fork":false,"pushed_at":"2025-01-16T06:24:35.000Z","size":33435,"stargazers_count":303,"open_issues_count":0,"forks_count":25,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-16T07:32:21.380Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://pypi.org/project/htmlrag","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/plageon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-28T08:59:47.000Z","updated_at":"2025-01-16T06:24:36.000Z","dependencies_parsed_at":"2024-12-20T10:34:30.177Z","dependency_job_id":"ed2f1c11-aaba-4316-a09c-cf2772beb483","html_url":"https://github.com/plageon/HtmlRAG","commit_stats":null,"previous_names":["plageon/htmlrag"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plageon%2FHtmlRAG","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plageon%2FHtmlRAG/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plageon%2FHtmlRAG/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plageon%2FHtmlRAG/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/plageon","download_url":"https://codeload.github.com/plageon/HtmlRAG/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":235398954,"owners_count":18983817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-24T07:16:12.435Z","updated_at":"2025-10-05T13:31:41.786Z","avatar_url":"https://github.com/plageon.png","language":"Python","funding_links":[],"categories":["2024.11","A01_文本生成_文本对话"],"sub_categories":["HtmlRAG【排版师】","大语言对话模型及数据"],"readme":"# \u003cdiv align=\"center\"\u003eHtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003ca href=\"https://arxiv.org/abs/2411.02959\" target=\"_blank\"\u003e\u003cimg src=https://img.shields.io/badge/arXiv-b5212f.svg?logo=arxiv\u003e\u003c/a\u003e\n\u003ca href=\"https://huggingface.co/collections/zstanjj/htmlrag-671f03af5c3da2e7b5371aa4\" target=\"_blank\"\u003e\u003cimg src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Models-27b3b4.svg\u003e\u003c/a\u003e\n\u003ca href=\"https://www.modelscope.cn/collections/HtmlRAG-c290f7cf673648\" target=\"_blank\"\u003e\u003cimg src=https://custom-icon-badges.demolab.com/badge/ModelScope%20Models-624aff?style=flat\u0026logo=modelscope\u0026logoColor=white\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/plageon/HtmlRAG/blob/main/toolkit/LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/badge/LICENSE-MIT-green\"\u003e\u003c/a\u003e\n\u003ca\u003e\u003cimg alt=\"Static Badge\" src=\"https://img.shields.io/badge/made_with-Python-blue\"\u003e\u003c/a\u003e\n\u003cimg alt=\"PyPI - Version\" src=\"https://img.shields.io/pypi/v/htmlrag\"\u003e\n\u003cp\u003e\n\u003ca href=\"https://github.com/plageon/HtmlRAG#-quick-start\"\u003eQuick Start (快速开始)\u003c/a\u003e\u0026nbsp ｜ \u0026nbsp\u003ca href=\"toolkit/README_zh.md\"\u003e中文文档\u003c/a\u003e\u0026nbsp ｜ \u0026nbsp\u003ca href=\"toolkit/README.md\"\u003eEnglish Documentation\u003c/a\u003e\u0026nbsp\n\u003c/p\u003e\n\u003c/div\u003e\n\n## 📖 Table of Contents\n- [Introduction](#-introduction)\n- [News](#-latest-news)\n- [Installation](#-installation)\n- [Quick Start](#-quick-start)\n- [Reproduce Results](#-dependencies)\n- [Citation](#-citation)\n\n## ✨ Latest News\n- [01/21/2025]: 🎉Our paper has been accepted by WWW 2025.\n- [12/20/2024]: Add Chinese toolkit user guide in [toolkit/README_zh.md](toolkit/README_zh.md). 添加中文工具包使用指南。\n- [12/12/2024]: The latest version of htmlrag package is v0.0.5, which now supports Chinese HTML documents. You can install it by running `pip install htmlrag==0.0.5`.\n- [11/12/2024]: Our data and model are now available on ModelScope. You can access them [here](https://www.modelscope.cn/collections/HtmlRAG-c290f7cf673648) for faster downloading.\n- [11/11/2024]: The training and test data are now available in the huggingface dataset [HtmlRAG-train](https://huggingface.co/datasets/zstanjj/HtmlRAG-train) and [HtmlRAG-test](https://huggingface.co/datasets/zstanjj/HtmlRAG-test).\n- [11/06/2024]: Our paper is available on arXiv. You can access it [here](https://arxiv.org/abs/2411.02959).\n- [11/05/2024]: The open-source toolkit and models are released. You can apply HtmlRAG in your own RAG systems now.\n\n**🔔Important:** \n- Parameter `max_node_words` is removed from class `GenHTMLPruner` since `v0.1.0`.\n- If you switch from htmlrag v0.0.4 to v0.0.5, please download the latest version of modeling files for Gerative HTML Pruners, which are available at [modeling_llama.py](https://github.com/plageon/HtmlRAG/blob/main/llm_modeling/Llama32/modeling_llama.py), and [modeling_phi3.py](https://github.com/plageon/HtmlRAG/blob/main/llm_modeling/Phi35/modeling_phi3.py). Alternatively, you can re-download the models from HuggingFace ([HTML-Pruner-Phi-3.8B](https://huggingface.co/zstanjj/HTML-Pruner-Phi-3.8B) and [HTML-Pruner-Llama-1B](https://huggingface.co/zstanjj/HTML-Pruner-Llama-1B)).\n\n\n## 📝 Introduction\nWe propose HtmlRAG, which uses HTML instead of plain text as the format of external knowledge in RAG systems. To tackle the long context brought by HTML, we propose **Lossless HTML Cleaning** and **Two-Step Block-Tree-Based HTML Pruning**.\n\n- **Lossless HTML Cleaning**: This cleaning process just removes totally irrelevant contents and compress redundant structures, retaining all semantic information in the original HTML. The compressed HTML of lossless HTML cleaning is suitable for RAG systems that have long-context LLMs and are not willing to loss any information before generation.\n\n- **Two-Step Block-Tree-Based HTML Pruning**: The block-tree-based HTML pruning consists of two steps, both of which are conducted on the block tree structure. The first pruning step uses a embedding model to calculate scores for blocks, while the second step uses a path generative model. The first step processes the result of lossless HTML cleaning, while the second step processes the result of the first pruning step.\n\n![HtmlRAG](./figures/html-pipeline.png)\n\n---\n\n## 🔌 Apply HtmlRAG in your own RAG systems\n\nWe provide a simple tookit to apply HtmlRAG in your own RAG systems. \n\n![PyPI - Version](https://img.shields.io/pypi/v/htmlrag) ![PyPI - Downloads](https://img.shields.io/pypi/dw/htmlrag) ![PyPI - Downloads](https://img.shields.io/pypi/dm/htmlrag)\n\n### 📦 Installation\n\nInstall the package using pip:\n\n```bash\npip install htmlrag\n```\nOr install the package from source:\n```bash\ncd toolkit/\npip install -e .\n```\n\n### 🎯 Quick Start\nAn example of using HtmlRAG in your own RAG systems:\n```shell\npython run_htmlrag_pipeline.py \\\n    --html_file \"./html_data/example/Washington Post.html\" \\\n    --question \"What are the main policies or bills that Biden touted besides the American Rescue Plan?\" \\\n    --lang en \\\n    --embed_model \"BAAI/bge-large-en\" \\\n    --gen_model \"zstanjj/HTML-Pruner-Phi-3.8B\" \\\n    --chat_tokenizer_name \"meta-llama/Llama-3.1-70B-Instruct\" \\\n    --max_node_words_embed 256 \\\n    --max_context_window_embed 4096 \\\n    --max_node_words_gen 128 \\\n    --max_context_window_gen 2048\n```\nIf you only want to clean the HTML content, keep the main body and clean up irrelevant content such as advertisements, you can directly set the question as the web page title.\n\nPlease refer to the [Documentation](toolkit/README.md) for more details.\n\n一个简单的例子，如何在自己的RAG系统中使用HtmlRAG：\n```shell\npython html4rag/run_htmlrag_pipeline.py \\\n    --html_file \"./html_data/example/对经济政策的预期.html\" \\\n    --question \"今年1-10月，专项债券的投资情况怎么样？\" \\\n    --lang zh \\\n    --embed_model \"BAAI/bge-large-zh\" \\\n    --gen_model \"zstanjj/HTML-Pruner-Phi-3.8B\" \\\n    --chat_tokenizer_name \"meta-llama/Llama-3.1-70B-Instruct\" \\\n    --max_node_words_embed 256 \\\n    --max_context_window_embed 4096 \\\n    --max_node_words_gen 128 \\\n    --max_context_window_gen 2048\n```\n如果你只想清洗HTML内容，保留主体清洗广告等无关内容，可以直接把question设置成网页标题。\n\n请访问[中文文档](toolkit/README_zh.md)了解更多细节。\n\nIf you are interested in reproducing the results in the paper, please follow the instructions below.\n\n---\n\n## 🔧 Dependencies\n\nYou can directly import a conda environment by importing the yml file.\n```bash\nconda env create -f environment.yml\nconda activate htmlrag\n```\nOr you can intsall the dependencies by yourself.\n```bash\nconda create -n htmlrag python=3.9 -y\nconda activate htmlrag\nconda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia\nconda install -c conda-forge faiss-cpu\npip install scikit-learn transformers transformers[deepspeed] rouge_score evaluate dataset gpustat anytree json5 tensorboardX accelerate bitsandbytes markdownify bs4 sentencepiece loguru tiktoken matplotlib langchain lxml vllm notebook trl spacy rank_bm25 -i https://pypi.tuna.tsinghua.edu.cn/simple\n```\n\n---\n\n## 📂 Data Preparation\n\n### Download datasets used in the paper\n\n1. We randomly sample 400 questions from the test set (if any) or validation set in the original datasets for our evaluation. The processed data is stored in the [html_data](./html_data) folder.\n\n\n2. We apply query rewriting to extract sub-queries and Bing search to retrieve relevant URLs for each querys, and then we scrap static HTML documents through URLs in returned search results. \nOriginal webpages are stored in the [html_data](./html_data) folder. \nDue to git file size limitation, we only provide a small subset of test data in this repository. The full processed data is available in huggingface dataset [HtmlRAG-test](https://huggingface.co/datasets/zstanjj/HtmlRAG-test).\n\n|  Dataset   |                                     ASQA                                     |                                        HotpotQA                                        |                                    NQ                                    |                                        TriviaQA                                        |                                      MuSiQue                                       |                                     ELI5                                     |\n|:----------:|:----------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:------------------------------------------------------------------------:|:--------------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------:|:----------------------------------------------------------------------------:|\n| Query Data |              [asqa-test.jsonl](html_data/asqa/asqa-test.jsonl)               |            [hotpot-qa-test.jsonl](html_data/hotpot-qa/hotpot-qa-test.jsonl)            |               [nq-test.jsonl](html_data/nq/nq-test.jsonl)                |            [trivia-qa-test.jsonl](html_data/trivia-qa/trivia-qa-test.jsonl)            |             [musique-test.jsonl](html_data/musique/musique-test.jsonl)             |              [eli5-test.jsonl](html_data/eli5/eli5-test.jsonl)               |\n| HTML Data  | [html-sample](html_data/asqa/bing/binghtml-slimplmqr-asqa-test-sample.jsonl) | [html-sample](html_data/hotpot-qa/bing/binghtml-slimplmqr-hotpot-qa-test-sample.jsonl) | [html-sample](html_data/nq/bing/binghtml-slimplmqr-nq-test-sample.jsonl) | [html-sample](html_data/trivia-qa/bing/binghtml-slimplmqr-trivia-qa-test-sample.jsonl) | [html-sample](html_data/musique/bing/binghtml-slimplmqr-musique-test-sample.jsonl) | [html-sample](html_data/eli5/bing/binghtml-slimplmqr-eli5-test-sample.jsonl) |\n\n   \n\n### Use your own data\n\nYou can use your own data by following the format of the datasets in the [html_data](./html_data) folder.\n\n1. Prepare your query file in `.jsonl` format. Each line is a json object with the following fields:\n\n```json\n{\n  \"id\": \"unique_id\",\n  \"question\": \"query_text\",\n  \"answers\": [\"answer_text_1\", \"answer_text_2\"]\n}\n```\n2. Conduct a optional pre-retrieval process to get sub-queries from the original user's question. The processed sub-queries should be stored in a `{rewrite_method}_result` key in the json object.\n\n```json\n{\n  \"id\": \"unique_id\",\n  \"question\": \"query_text\",\n  \"answers\": [\"answer_text_1\", \"answer_text_2\"],\n  \"your_method_rewrite\": {\n    \"questions\": [\n      {\n        \"question\": \"sub_query_text_1\"\n      },\n      {\n        \"question\": \"sub_query_text_2\"\n      }\n    ]\n  }\n}\n```\n\n3. Conduct web search using bing\n\n```bash\n./scripts/search_pipeline_apply.sh\n```\n\n---\n\n## 🧹 HTML Cleaning\n\n```bash\nbash ./scripts/simplify_html.sh\n```\n\n## ✂️ Block-Tree-Based HTML Pruning\n\n### Step 1: HTML Pruning with Text Embedding\n\n```bash\nbash ./scripts/tree_rerank.sh\nbash ./scripts/trim_html_tree_rerank.sh\n```\n\n### Step 2: Generative HTML Pruning\n\n```bash\nbash ./scripts/tree_rerank_tree_gen.sh\n```\n---\n\n## 📊 Evaluation\n\n### Baselines\n\nWe provide the following baselines for comparison:\n\n- **BM25**: A widely used sparse rerank model. \n```bash\nexport rerank_model=\"bm25\"\n./scripts/chunk_rerank.sh\n./scripts/trim_html_fill_chunk.sh\n```\n\n- **[BGE](https://huggingface.co/BAAI/bge-large-en)**: An embedding model, BGE-Large-EN with encoder-only structure. Our scripts requires instantiation of an embedding model with [TEI](https://github.com/huggingface/text-embeddings-inference).\n```bash\nexport rerank_model=\"bge\"\n./scripts/chunk_rerank.sh\n./scripts/trim_html_fill_chunk.sh\n```\n- **[E5-Mistral](https://huggingface.co/intfloat/e5-mistral-7b-instruct)**: A embedding model based on an LLM, Mistral-7B.  Our scripts requires instantiation of an embedding model with [TEI](https://github.com/huggingface/text-embeddings-inference).\n```bash\nexport rerank_model=\"e5-mistral\"\n./scripts/chunk_rerank.sh\n./scripts/trim_html_fill_chunk.sh\n```\n- **LongLLMLingua**: An abstractive model using Llama7B to select useful context.\n```bash\n./scripts/longlongllmlingua.sh\n```\n- **[JinaAI Reader](https://huggingface.co/jinaai/reader-lm-1.5b)**: An end-to-end light-weight LLM with 1.5B parameters fine-tuned on an HTML to Markdown converting task dataset. \n```bash\n./scripts/jinaai_reader.sh\n```\n### Evaluation Scripts\n\n1. Instantiaize a LLM inference model with [VLLM](https://github.com/vllm-project/vllm/). We recommend using [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) or [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct).\n2. Run the chatting inference using the following command:\n```bash\n./scripts/chat_inference.sh\n```\n2. Follow the evaluation scripts in [eval_scrips.ipynb](./jupyter/eval_scrips.ipynb)\n\n### Results\n\n- **Results for [HTML-Pruner-Phi-3.8B](https://huggingface.co/zstanjj/HTML-Pruner-Phi-3.8B) and [HTML-Pruner-Llama-1B](https://huggingface.co/zstanjj/HTML-Pruner-Llama-1B) with Llama-3.1-70B-Instruct as chat model**.\n\n| Dataset          | ASQA      | HotpotQA  | NQ        | TriviaQA  | MuSiQue   | ELI5      |\n|------------------|-----------|-----------|-----------|-----------|-----------|-----------|\n| Metrics          | EM        | EM        | EM        | EM        | EM        | ROUGE-L   |\n| BM25             | 49.50     | 38.25     | 47.00     | 88.00     | 9.50      | 16.15     |\n| BGE              | 68.00     | 41.75     | 59.50     | 93.00     | 12.50     | 16.20     |\n| E5-Mistral       | 63.00     | 36.75     | 59.50     | 90.75     | 11.00     | 16.17     |\n| LongLLMLingua    | 62.50     | 45.00     | 56.75     | 92.50     | 10.25     | 15.84     |\n| JinaAI Reader    | 55.25     | 34.25     | 48.25     | 90.00     | 9.25      | 16.06     |\n| HtmlRAG-Phi-3.8B | **68.50** | **46.25** | 60.50     | **93.50** | **13.25** | **16.33** |\n| HtmlRAG-Llama-1B | 66.50     | 45.00     | **60.75** | 93.00     | 10.00     | 16.25     |\n\n---\n\n## 🚀 Training\n\n### 1. Download Pretrained Models\n\n```bash\nmkdir ../../huggingface\ncd ../../huggingface  \nhuggingface-cli download --resume-download --local-dir-use-symlinks False microsoft/Phi-3.5-mini-instruct --path ../../huggingface/Phi-3.5-mini-instruct/\n\n# alternatively you can download Llama-3.2-1B as the base model\nhuggingface-cli download --resume-download --local-dir-use-symlinks False meta-llama/Llama-3.2-1B --path ../../huggingface/Llama-3.2-1B/\n```\n\n### 2. Configure training data\nWe release the training data in the huggingface dataset [HtmlRAG-train](https://huggingface.co/datasets/zstanjj/HtmlRAG-train). You can download the dataset by running the following command:\n```bash\nmkdir html_data/tree_gen\ncd html_data/tree_gen\nhuggingface-cli download --resume-download --local-dir-use-symlinks False zstanjj/HtmlRAG-train --path HtmlRAG-train\n```\n\nConfigure the sample rate in a `.json5` file, and we provide our default settings in [sample-train.json5](sft/experiments/sample-train.json5). Can you can check your training data with the following command:\n```bash\ncd sft/\npython dataset.py\n```\n\n### 3. Train the model\n\nYou can follow our settings if you are training on A800 clusters.\n\n```bash\nbash ./scripts/train_longctx.sh\n```\n\n---\n\n## 📜 Citation\n\n```bibtex\n@misc{tan2024htmlraghtmlbetterplain,\n      title={HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems}, \n      author={Jiejun Tan and Zhicheng Dou and Wen Wang and Mang Wang and Weipeng Chen and Ji-Rong Wen},\n      year={2024},\n      eprint={2411.02959},\n      archivePrefix={arXiv},\n      primaryClass={cs.IR},\n      url={https://arxiv.org/abs/2411.02959}, \n}\n```\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplageon%2FHtmlRAG","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplageon%2FHtmlRAG","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplageon%2FHtmlRAG/lists"}