https://github.com/tschechlovdev/effens
https://github.com/tschechlovdev/effens
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/tschechlovdev/effens
- Owner: tschechlovdev
- Created: 2023-09-19T09:34:39.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-13T11:46:05.000Z (about 2 years ago)
- Last Synced: 2025-02-15T01:38:21.922Z (over 1 year ago)
- Language: Python
- Size: 1.19 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# EffEns: Efficient Ensemble Clustering based on Meta-Learning and Hyperparameter Optimization
Prototypical Implementation in Python of the submitted Paper "Ensemble Clustering based on Meta-Learning and Hyperparameter Optimization" at VLDB 2024.
In the following, we provide an overview of the code structure, an installation instruction, and an example on how to use EffEns.
## Overview
The main code is in the "src" folder. It contains the following modules:
- ``automlclustering``: Contains the adapted code from [AutoML4Clust](https://github.com/tschechlovdev/Automl4Clust) and [ML2DAC](https://github.com/tschechlovdev/ml2dac/tree/main), which provide
implementations of AutoML for Clustering Systems and for different meta-feature sets.
- ``consensus_functions``: Contains implementations of the five consensus functions "ABV", "ACV", "MCLA", "MM", and "QMI", which we used in our paper.
- ``ConsensusCS``: Provides the consensus functions and hyperparameters as configuration space for the optimizer.
- ``EffEnsMKR``: Contains a script that stores the path to the MKR and the filenames of the "evaluated ensembles" and the meta-features.
- ``EnsMetaLearning``: All functionality for our meta-learning procedure.
In particular, for the learning phase to evaluate different ensemble subsets and extract the meta-features.
It also contains ``EffEns`` that can be applied on new datasets.
- ``EnsOptimizer``: Contains the optimizer that we use for hyperparameter optimization of the consensus functions.
We use SMAC as optimizer and provide a wrapper class as well as the black box function for optimization.
- ``Experiments``: Contains the code for the experiments for the synthetic and real-world datasets of our paper (cf. Section 7).
- ``Utils``: Contains some utility code such as functions to process the optimizer results or to clean up temporary directories.
Note that the directory ``real_world_datasets`` contains the datasets that we used in our experiments.
Further, ``evaluation_results`` contains all results from our experimental evaluatioin.
## Installation
Our implementation is based on Python and we require Python 3.9.
Furthermore, as SMAC only runs on Linux, we also require a Linux system.
We have tested on Ubuntu 20.04.
Before installing EffEns, you first have to install the following that are required for some of the libraries:
- ``sudo apt-get install build-essential``
- ``sudo apt-get install gcc``
The easiest way of installing EffEns is to use Anaconda. Follow https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
to install Anaconda.
We will then create a prepared Python 3.9 environment:
- ``conda env create -f environment.yml``
This should create a conda environment with the name "automated_ensemble_clustering".
Then you have to install ib_base as it is not available as package:
```git clone https://collaborating.tuhh.de/cip3725/ib_base.git
cd ib_base
python setup.py install
cd ..
```
After finishing this, you have to add the "src" folder of EffEns and the path to "ib_base" to your PYTHONPATH
You may also have to add them to your conda path
``gedit ~/anaconda3/envs/automated_ensemble_clustering/lib/python3.9/site-packages/conda.pth``
Now everything should be setup and you can try to run ``python src/Experiments/SyntehticData/EffEns_Experiment_synthetic.py``.
This should run without any errors.
## Examples
In the following, we provide a simple example on how to use EffEns on new unseen datasets with the provided Meta-Knowledge Repository:
```Python
from sklearn.datasets import make_blobs
from EnsMetaLearning.EffEns import EffEns
from automlclustering.ClusterValidityIndices import CVIHandler
from Utils.Utils import process_result_to_dataframe
# Generate simple synthetic data
X, y = make_blobs()
# Instantiate EffEnse. Use provided path to MKR.
effens = EffEns(path_to_mkr="./EffEnsMKR/")
# Choose CVI to evaluate results
cvi = CVIHandler.CVICollection.CALINSKI_HARABASZ
# Apply EffEns on Data X
result, _ = effens.apply_ensemble_clustering(X, cvi=cvi, n_loops=5)
# Parse Result
result = process_result_to_dataframe(result, {"cvi": cvi.get_abbrev()},
# compare against ground-truth clustering
ground_truth_clustering=y
)
print(result[["iteration", "config", "CVI score", "Best NMI"]])
```