https://github.com/Tencent/WebAggregator
https://github.com/Tencent/WebAggregator
agent
Last synced: 9 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/Tencent/WebAggregator
- Owner: Tencent
- License: other
- Created: 2025-09-29T08:55:30.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-10-18T15:37:44.000Z (6 months ago)
- Last Synced: 2026-04-03T10:06:36.444Z (9 days ago)
- Topics: agent
- Language: Python
- Homepage: https://arxiv.org/abs/2510.14438
- Size: 13.7 MB
- Stars: 67
- Watchers: 0
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: readme.md
- License: LICENSE.txt
Awesome Lists containing this project
- Awesome-GitHub-Repo - WebAggregator - 腾讯开源的 Web 信息聚合框架,包含 QA 构建引擎、查询轨迹和模型,高度可定制的数据采集与聚合。 (信息获取 / 其他信息工具)
README
# 🌐 *Explore to Evolve*: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents
## 🌟 Introduction
[](https://arxiv.org/abs/2510.14438) [](https://huggingface.co/datasets/CognitiveKernel/WebAggregatorQA) [](https://huggingface.co/CognitiveKernel/WebAggregator-8B)
[](https://huggingface.co/CognitiveKernel/WebAggregator-32B)

- ***Explore to Evolve*** aims to generate diverse, high-quality training data for web agent foundation models, enhancing their capabilities in multi-tool usage, **information seeking**, and **information aggregation**.
- WebAggregator, the finetuned model on WebAggregatorQA, demonstrates strong performance on GAIA-text and the WebAggregatorQA test set.
---
## ✨ Features

- 🤖 **Fully Automated and Verifiable QA Construction**
- 😄 **Open Source**: Complete codebase including QA construction engine, queries, trajectories, and models.
- 👍 **Highly Customizable**: Collect data tailored to your needs with minimal human effort, and easily customize your own agent!
---
## ⚡ Quick Start
Follow these steps to get started:
### 1️⃣ Clone the Repository
```bash
git clone https://github.com/Tencent/WebAggregator
```
### 2️⃣ Install Dependencies
1. This project builds upon smolagents’ “open deep research” example 👉 [smolagents open_deep_research dependencies](https://github.com/huggingface/smolagents/tree/main/examples/open_deep_research). Thanks for their great work and please cite them!
2. Install this project’s requirements:
```bash
pip install -r requirements.txt
```
3. **Please note**: the implementation must utilize the `./smolagents`, which provides the added functionality for trajectory collection by us. Or you can directly replace the smolagets/agents.py in your original library.
---
## 🚀 Usage
### ⚙️ Configuration
Set the configuration in the following files:
- `./config.py`: Contains settings for your agent's foundation LLM, the LLMs for specific tools, and dataset paths.
- `./model_list.py`: This file is used to implement the method for calling your foundation models (e.g., via vLLM, LiteLLM, or Azure). It calls the models that are configured in `./config.py`. We provide an example implementation. For more details, please refer to the smolagents repository.
The function of others:
- `./web_tools.py`: Tools for agent. You could modify it to suit your needs.
- `./run_agent.py`: The implemented agent.
- `./run`: Scripts for running the agent.
- `./data`: Input data for QA construction (URLs), evaluation (Benchmarks) and traj sampling (QAs).
---
### ▶️ Running the Project
> **Note:** Before running any scripts, ensure all paths, model checkpoints, and other necessary parameters are properly set in the source files.
---
#### 1️⃣ Evaluation
To evaluate your agent, serve your tuned checkpoint and update the corresponding settings in `config.py`. Make sure the correct `model_id` is set in the evaluation script `test.sh`, then run:
```bash
bash run/test.sh
```
This command evaluates your specified model and benchmark. After evaluation, it uses LLM-as-judge to assess performance and prints the accuracy.
---
#### 2️⃣ QA Construction
Start building automatic web agent data:
1. Download our collected URLs 👉 [URLs](https://huggingface.co/datasets/CognitiveKernel/WebAggregatorQA) **or** gather URLs related to your domains of interest!
2. Then, run the following command to collect the data.
```bash
bash run/QA_building.sh
```
---
#### 3️⃣ Trajectory Sampling
Training trajectories for fine-tuning your agent foundation models are available at 👉 [WebAggregatorQA](https://huggingface.co/datasets/CognitiveKernel/WebAggregatorQA). Sample data can be found in `./data/train-samples` for initial testing purposes.
```bash
bash run/traj_sampling.sh
```
---
## Friendly links to other works from Tencent AI Lab
- Deep Research Agent framework: [Cognitive Kernel-Pro](https://github.com/Tencent/CognitiveKernel-Pro)
- Agent Self-Evolving Research, including [WebEvolver, WebCoT](https://github.com/Tencent/SelfEvolvingAgent), [WebVoyager](https://github.com/MinorJerry/WebVoyager), [OpenWebVoyager](https://github.com/MinorJerry/OpenWebVoyager).
## Citation
```bibtex
@misc{wang2025exploreevolvescalingevolved,
title={Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents},
author={Rui Wang and Ce Zhang and Jun-Yu Ma and Jianshu Zhang and Hongru Wang and Yi Chen and Boyang Xue and Tianqing Fang and Zhisong Zhang and Hongming Zhang and Haitao Mi and Dong Yu and Kam-Fai Wong},
year={2025},
eprint={2510.14438},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.14438},
}
@misc{fang2025cognitivekernelpro,
title={Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training},
author={Tianqing Fang and Zhisong Zhang and Xiaoyang Wang and Rui Wang and Can Qin and Yuxuan Wan and Jun-Yu Ma and Ce Zhang and Jiaqi Chen and Xiyun Li and Hongming Zhang and Haitao Mi and Dong Yu},
year={2025},
eprint={2508.00414},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.00414},
}
```