https://github.com/dobbersc/fundus-evaluation

[ACL 2024] Evaluation of the Fundus News Scraper
https://github.com/dobbersc/fundus-evaluation

acl2024 crawling evaluation fundus nlp python scraping

Last synced: 14 days ago
JSON representation

[ACL 2024] Evaluation of the Fundus News Scraper

Host: GitHub
URL: https://github.com/dobbersc/fundus-evaluation
Owner: dobbersc
License: mit
Created: 2024-03-06T18:50:09.000Z (about 1 year ago)
Default Branch: master
Last Pushed: 2024-08-14T04:24:45.000Z (9 months ago)
Last Synced: 2025-03-31T12:11:10.759Z (about 2 months ago)
Topics: acl2024, crawling, evaluation, fundus, nlp, python, scraping
Language: Python
Homepage: https://github.com/flairNLP/fundus
Size: 35.8 MB
Stars: 10
Watchers: 2
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Fundus News Scraper Evaluation

This repository contains the evaluation code and dataset to reproduce the results from the [paper](https://aclanthology.org/2024.acl-demos.29/) "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions".

[Fundus](https://github.com/flairNLP/fundus) is a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code.

In the following sections, we provide instructions to reproduce the comparative evaluation of Fundus against prominent scraping libraries.
Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than comparable news scrapers.
For a more in-depth overview of Fundus, the evaluation practises, and its results, consult the [result summary](https://github.com/dobbersc/fundus-evaluation/tree/master?tab=readme-ov-file#results) and our [paper](https://aclanthology.org/2024.acl-demos.29/).

## Prerequisites

Fundus and this evaluation repository require Python 3.8 or later and Java for the Boilerpipe scraper.
*(Note: The evaluation was tested and performed using Python 3.8 and Java JDK 17.0.10.)*

To install the `fundus-evaluation` Python package, including the reference scraper dependencies, clone this GitHub repository and simply install the package using pip:

```bash
git clone https://github.com/dobbersc/fundus-evaluation.git
pip install ./fundus-evaluation
```

This installation also contains the dataset and evaluation results.
If you only are interested in the Python package directly (without the dataset and evaluation results), install the `fundus-evaluation` package directly from GitHub using pip:

```bash
pip install git+https://github.com/dobbersc/fundus-evaluation.git@master
```

Verify the installation by running `evaluate --version`, with the expected output of `evaluate `, where `` specifies the current version of the evaluation package.

#### Development

For development, install the package, including the development dependencies:

```bash
git clone https://github.com/dobbersc/fundus-evaluation.git
pip install -e ./fundus-evaluation[dev]
```

## Reproducing the Evaluation Results

In the following steps, we assume that the current working directory is the root of the repository.

To fully reproduce the evaluation results, only the dataset is required.
Each step in the evaluation pipeline requires the outputs from the previous step (*dataset -> scrape -> score -> analysis*).
To ease the reproducibility, we also provide the artifacts of intermediate steps in the `dataset` folder.
Therefore, the pipeline may be started from any step.

#### Usage

The evaluation results may be reproduced using the package's command line interface (CLI), representing the evaluation pipeline steps:

```console
$ evaluate --help
usage: evaluate [-h] [--version] {complexity,scrape,score,analysis} ...

optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit

Fundus News Scraper Evaluation:
select evaluation pipeline step

{complexity,scrape,score,analysis}
complexity calculate page complexity scores
scrape scrape extractions on the evaluation dataset
score calculate evaluation scores
analysis generate tables and plots
```

Each entry point also provides its help page, e.g. with `evaluate scrape --help`.

Alternatively to the CLI, we provide direct Python entry points in `fundus_evaluation.entry_points`.
In the following steps, we will use the CLI.

### (1) Obtaining the Evaluation Dataset

We selected the 16 English-language publishers Fundus currently supports as the data source, and retrieved five articles for each publisher from the respective RSS feeds/sitemaps.
The selection process yielded an evaluation corpus of 80 news articles.
From it, we manually extracted the plain text from each article and stored it together with information on the original paragraph structure.

The resulting evaluation dataset is included in this repository and consists of the (compressed) HTML [article files](https://github.com/dobbersc/fundus-evaluation/tree/master/dataset/html) and their [ground truth extractions](https://github.com/dobbersc/fundus-evaluation/blob/master/dataset/ground_truth.json) as JSON.

### (2) Generating the Scraper Extractions

Execute the following command to let all supported scrapers extract the plain text of the evaluation dataset's articles:

```bash
evaluate scrape \
--ground-truth-path dataset/ground_truth.json \
--html-directory dataset/html/ \
--output-directory dataset/extractions/
```

To restrict the scrapers that are part of the evaluation,
- use the `--scrapers` option to explicitly specify a list of evaluation scrapers,
- or use the `--exclude-scrapers` option to exclude scrapers from the evaluation.

E.g. to exclude BoilerNet, as this scraper is very resource intensive, add the `--exclude-scrapers boilernet` argument to the command above.

### (3) Calculating the Evaluation Scores

To evaluate the extraction results with the three supported metrics (paragraph match, ROUGE-LSum and WER), run the following command:

```bash
evaluate score \
--ground-truth-path dataset/ground_truth.json \
--extractions-directory dataset/extractions/ \
--output-directory dataset/scores/
```

#### Calculating the Page Complexity (Optional)

This step is not part of the evaluation in our paper and is thus optional.

Execute the following command to calculate the page complexity scores established in ["An Empirical Comparison of Web Content Extraction Algorithms"](https://downloads.webis.de/publications/papers/bevendorff_2023b.pdf) (Bevendorff et al., 2023):

```bash
evaluate complexity \
--ground-truth-path dataset/ground_truth.json \
--html-directory dataset/html/ \
--output-path dataset/complexity.tsv
```

### (4) Analyzing the Data

Run the following command to produce the paper's tables and plots for the ROUGE-LSum score:

```bash
evaluate analysis --rouge-lsum-path dataset/scores/rouge_lsum.tsv --output-directory dataset/analysis/
```

To also produce a boxplot of the page complexity, execute:

```bash
evaluate analysis --complexity-path dataset/complexity.tsv --output-directory dataset/analysis/
```

## Results

The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation.
In addition, we provide the scrapers' versions at their evaluation time.
The table is sorted in descending order over the F1-score:

#### Fundus-Evaluation v0.2.0

| **Scraper** | **Precision** | **Recall** | **F1-Score** | **Version** |
|-----------------------------------------------------------------------------------------------------------------|:--------------------------|---------------------------|---------------------------|-------------|
| [Fundus](https://github.com/flairNLP/fundus) | **99.89**_±0.57 | 96.75_±12.75 | **97.69**_±9.75 | 0.4.1 |
| [Trafilatura](https://github.com/adbar/trafilatura) | 93.91_±12.89 | 96.85_±15.69 | 93.62_±16.73 | 1.12.0 |
| [news-please](https://github.com/fhamborg/news-please) | 97.95_±10.08 | 91.89_±16.15 | 93.39_±14.52 | 1.6.13 |
| [BTE](https://github.com/dobbersc/fundus-evaluation/blob/master/src/fundus_evaluation/scrapers/bte.py) | 81.09_±19.41 | **98.23**_±8.61 | 87.14_±15.48 | / |
| [jusText](https://github.com/miso-belica/jusText) | 86.51_±18.92 | 90.23_±20.61 | 86.96_±19.76 | 3.0.1 |
| [BoilerNet](https://github.com/dobbersc/fundus-evaluation/tree/master/src/fundus_evaluation/scrapers/boilernet) | 85.96_±18.55 | 91.21_±19.15 | 86.52_±18.03 | / |
| [Boilerpipe](https://github.com/kohlschutter/boilerpipe) | 82.89_±20.65 | 82.11_±29.99 | 79.90_±25.86 | 1.3.0 |

Previous Results

#### Fundus-Evaluation v0.1.0

| **Scraper** | **Precision** | **Recall** | **F1-Score** | **Version** |
|-----------------------------------------------------------------------------------------------------------------|:--------------------------|---------------------------|---------------------------|-------------|
| [Fundus](https://github.com/flairNLP/fundus) | **99.89**_±0.57 | 96.75_±12.75 | **97.69**_±9.75 | 0.2.2 |
| [Trafilatura](https://github.com/adbar/trafilatura) | 90.54_±18.86 | 93.23_±23.81 | 89.81_±23.69 | 1.7.0 |
| [BTE](https://github.com/dobbersc/fundus-evaluation/blob/master/src/fundus_evaluation/scrapers/bte.py) | 81.09_±19.41 | **98.23**_±8.61 | 87.14_±15.48 | / |
| [jusText](https://github.com/miso-belica/jusText) | 86.51_±18.92 | 90.23_±20.61 | 86.96_±19.76 | 3.0.0 |
| [news-please](https://github.com/fhamborg/news-please) | 92.26_±12.40 | 86.38_±27.59 | 85.81_±23.29 | 1.5.44 |
| [BoilerNet](https://github.com/dobbersc/fundus-evaluation/tree/master/src/fundus_evaluation/scrapers/boilernet) | 84.73_±20.82 | 90.66_±21.05 | 85.77_±20.28 | / |
| [Boilerpipe](https://github.com/kohlschutter/boilerpipe) | 82.89_±20.65 | 82.11_±29.99 | 79.90_±25.86 | 1.3.0 |

## Contributing

We encourage contributions, particularly those involving competitive news scrapers. For example, you can contribute by:

- **Submitting a New Scraper:** Open an issue or submit a pull request to incorporate your scraper into our evaluation pipeline. We will review and integrate new submissions as appropriate.
- **Updating an Existing Scraper:** Please inform us if a supported scraper has undergone significant updates. We are open to re-evaluating our results accordingly. (Previous evaluation results are available on our [Release Page](https://github.com/dobbersc/fundus-evaluation/releases).)

*Note: We also appreciate contributions to the [Fundus](https://github.com/flairNLP/fundus) library!*

## Questions and Support

Please open an issue for unresolved questions about our paper or the evaluation in this repository.
For questions about the general functionality or bug reports regarding [Fundus](https://github.com/flairNLP/fundus) please refer to our main repository and submit an issue.

## Cite

Please cite the following [paper](https://aclanthology.org/2024.acl-demos.29/) when using Fundus or building upon our work:

```bibtex
@inproceedings{dallabetta-etal-2024-fundus,
title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
author = "Dallabetta, Max and
Dobberstein, Conrad and
Breiding, Adrian and
Akbik, Alan",
editor = "Cao, Yixin and
Feng, Yang and
Xiong, Deyi",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-demos.29",
pages = "305--314",
abstract = "This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work.The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.",
}
```

## Acknowledgements
- This repository's architecture has been inspired by the [web content extraction benchmark](https://github.com/chatnoir-eu/web-content-extraction-benchmark) (Bevendorff et al., 2023).
- Since BoilerNet has no Python package on PyPI, we adopted a [stripped-down version](https://github.com/chatnoir-eu/web-content-extraction-benchmark/tree/main/src/extraction_benchmark/extractors/boilernet) of the upstream BoilerNet provided by Bevendorff et al. from their web content extraction benchmark.
- Similarly, BTE has no Python package on PyPI. Here, we used the implementation by Jan Pomikalek found from [this](https://github.com/chatnoir-eu/web-content-extraction-benchmark/blob/221b6503d66bf4faa378e6ae3c3f63ee01d584c6/src/extraction_benchmark/extractors/bte.py) and [this](https://github.com/dalab/web2text/blob/0f9c7b787ff125ce5190784e741c5b453ddf0560/other_frameworks/bte/bte.py) source.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dobbersc/fundus-evaluation

Awesome Lists containing this project

README