https://github.com/eriknovak/wac

The Wasserstein distance-based news Article Clustering algorithm
https://github.com/eriknovak/wac

news-clustering online-algorithm transformers wasserstein-distance

Last synced: 9 months ago
JSON representation

The Wasserstein distance-based news Article Clustering algorithm

Host: GitHub
URL: https://github.com/eriknovak/wac
Owner: eriknovak
License: bsd-3-clause
Created: 2023-12-05T19:09:02.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-21T07:41:34.000Z (over 2 years ago)
Last Synced: 2024-06-20T09:56:30.005Z (about 2 years ago)
Topics: news-clustering, online-algorithm, transformers, wasserstein-distance
Language: Jupyter Notebook
Homepage:
Size: 14.4 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# WAC: **W**asserstein distance-based news **A**rticle **C**lustering

This project contains the implementation of the **W**asserstein distance-based news **A**rticle **C**lustering algorithm.
The algorithm is an unsupervised two-step online clustering algorithm that uses the Wasserstein distance (and distances
similar to it). The two steps are (1) monolingual clustering of news articles and (2) multilingual clustering of events into clusters.

The articles and events are represented using an SBERT language model, which are fine-tunned for clustering tasks.

The remainder of the project contains the instructions for running the experiments.

## 📚 Papers

In case you use any of the components for your research, please refer to (and cite) the papers:

**TODO**

## ☑️ Requirements

Before starting the project make sure these requirements are available:

- [python]. For setting up your research environment and python dependencies (version 3.8 or higher).
- [git]. For versioning your code.

## 🛠️ Setup

### Create a python environment

First create the virtual environment where all the modules will be stored.

#### Using venv

Using the `venv` command, run the following commands:

```bash
# create a new virtual environment
python -m venv venv

# activate the environment (UNIX)
source ./venv/bin/activate

# activate the environment (WINDOWS)
./venv/Scripts/activate

# deactivate the environment (UNIX & WINDOWS)
deactivate
```

### Install

To install the requirements run:

```bash
pip install -e .
```

## 🗃️ Data

The data used in the experiments are a currated set of news articles retrieved from the Event Registry and prepared for the scientific paper[^1].

To download the data run:

```bash
bash scripts/00_download_data.sh
```

This will download the data files and store them in the `data/raw` folder.

## ⚗️ Experiments

To run the experiments, run the folowing command:

```bash
# run the experiments
bash scripts/run_exp_pipeline.sh
```

The command above will perform a series of experiments by executing the following steps (the names of the files are listed in the `scripts/run_exp_pipeline.sh` file):

```bash
# prepare the data examples for the experiment
python scripts/01_prepare_data.py \
--input_file ./data/raw/dataset.test.json \
--output_file ./data/processed/dataset.test.csv

# cluster articles into events
python scripts/02_article_clustering.py \
--input_file ./data/processed/dataset.test.csv \
--output_file ./data/processed/article_clusters/dataset.test.csv \
--rank_th 0.5 \
--time_std 3 \
--multilingual \
--ents_th 0.0 \
-gpu

# cluster events based on their similarity
python scripts/03_event_clustering.py \
--input_file ./data/processed/article_clusters/dataset.test.csv \
--output_file ./data/processed/event_clusters/dataset.test.csv \
--rank_th 0.7 \
--time_std 3 \
--w_reg 0.1 \
--w_nit 10 \
-gpu

# evaluate the clusters
python scripts/04_evaluate.py \
--label_file_path ./data/processed/dataset.test.csv \
--pred_file_dir ./data/processed/event_clusters \
--output_file ./results/dataset.test.csv

```

The results will be stored in the `results` folder.

### Results

the hyper-parameters were selected by evaluating the performance of the clustering algorithm on the dev set. We performed a grid-search across the following hyper-parameters:

#### Performance results

The best performance is obtained with the following parameters:

Article Clustering
Cluster Merging
Standard
BCubed

Variant name
rank_th
ents_th
time_std
rank_th
time_std
F1
P
R
F1
P
R
clusters

WAC_MONO
0.5
-
3
0.7
3
87.00
98.45
77.95
85.42
93.04
78.95
1066

WAC_MONO
0.6
-
3
0.7
3
69.50
98.71
53.63
81.08
94.14
71.20
1108

WAC_MONO+NER
0.5
0.2
3
0.7
3
85.02
98.52
74.77
84.78
93.51
77.54
1089

WAC_MONO+NER
0.6
0.2
3
0.7
3
67.23
98.12
51.14
79.72
93.80
69.32
1109

WAC_MULTI
0.5
-
3
0.7
3
92.20
98.55
86.62
86.67
92.94
81.20
1074

WAC_MULTI
0.6
-
3
0.7
3
74.43
98.81
59.70
81.98
94.00
72.68
1112

#### Cluster merging assessment analysis

To evaluate the impact the cluster merging process has on the algorithm’s performance, we compare the WAC algorithm variants to those where the cluster merging phase was not performed. Note that we compare only the WAC_MULTI variant, as it already generates multilingual clusters during the article clustering phase

Article Clustering
Cluster Merging
Standard
BCubed

Variant name
rank_th
ents_th
time_std
rank_th
time_std
F1
P
R
F1
P
R
clusters

WAC_MULTI
0.5
-
3
0.7
3
92.20
98.55
86.62
86.67
92.94
81.20
1074

WAC_MULTI/MERGE
0.5
-
3
-
-
56.04
98.71
39.12
71.14
96.98
56.17
2339

WAC_MULTI
0.6
-
3
0.7
3
74.43
98.81
59.70
81.98
94.00
72.68
1112

WAC_MULTI/MERGE
0.6
-
3
-
-
24.28
99.40
13.83
47.10
99.04
31.59
4675

## 📣 Acknowledgments

This work is developed by [Department of Artificial Intelligence][ailab] at [Jozef Stefan Institute][ijs].

This work was supported by the Slovenian Research Agency, and the European Union's Horizon 2020 project Humane AI Net [[H2020-ICT-952026]].

[python]: https://www.python.org/
[git]: https://git-scm.com/
[ailab]: http://ailab.ijs.si/
[ijs]: https://www.ijs.si/
[H2020-ICT-952026]: https://cordis.europa.eu/project/id/952026

[^1]: S. Miranda, A. Znotiņš, S. B. Cohen, and G. Barzdins, “Multilingual clustering of streaming news” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 4535–4544.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eriknovak/wac

Awesome Lists containing this project

README