https://github.com/firefly-cpp/arm-preprocessing
Implementation of several preprocessing techniques for Association Rule Mining (ARM)
https://github.com/firefly-cpp/arm-preprocessing
Last synced: 2 months ago
JSON representation
Implementation of several preprocessing techniques for Association Rule Mining (ARM)
- Host: GitHub
- URL: https://github.com/firefly-cpp/arm-preprocessing
- Owner: firefly-cpp
- License: mit
- Created: 2022-09-27T11:47:48.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-19T15:35:40.000Z (3 months ago)
- Last Synced: 2025-03-27T00:54:57.018Z (3 months ago)
- Language: Python
- Homepage:
- Size: 1.1 MB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
![]()
arm-preprocessing
💡 Why arm-preprocessing? •
✨ Key features •
📦 Installation •
🚀 Usage •
🔗 Related frameworks •
📚 References •
🔑 Licensearm-preprocessing is a lightweight Python library supporting several key steps involving data preparation, manipulation, and discretisation for Association Rule Mining (ARM). 🧠 Embrace its minimalistic design that prioritises simplicity. 💡 The framework is intended to be fully extensible and offers seamless integration with related ARM libraries (e.g., [NiaARM](https://github.com/firefly-cpp/NiaARM)). 🔗
* **Free software:** MIT license
* **Documentation**: [http://arm-preprocessing.readthedocs.io](http://arm-preprocessing.readthedocs.io)
* **Python**: 3.9.x, 3.10.x, 3.11.x, 3.12x
* **Tested OS:** Windows, Ubuntu, Fedora, Alpine, Arch, macOS. **However, that does not mean it does not work on others**## 💡 Why arm-preprocessing?
While numerous libraries facilitate data mining preprocessing tasks, this library is designed to integrate seamlessly with association rule mining. It harmonises well with the NiaARM library, a robust numerical association rule mining framework. The primary aim is to bridge the gap between preprocessing and rule mining, simplifying the workflow/pipeline. Additionally, its design allows for the effortless incorporation of new preprocessing methods and fast benchmarking.
## ✨ Key features
- Loading various formats of datasets (CSV, JSON, TXT, TCX) 📊
- Converting datasets to different formats 🔄
- Loading different types of datasets (numerical dataset, discrete dataset, time-series data, text, etc.) 📉
- Dataset identification (which type of dataset) 🔍
- Dataset statistics 📈
- Discretisation methods 📏
- Data squashing methods 🤏
- Feature scaling methods ⚖️
- Feature selection methods 🎯## 📦 Installation
### pip
To install ``arm-preprocessing`` with pip, use:
```bash
pip install arm-preprocessing
```To install ``arm-preprocessing`` on Alpine Linux, please use:
```sh
$ apk add py3-arm-preprocessing
```To install ``arm-preprocessing`` on Arch Linux, please use an [AUR helper](https://wiki.archlinux.org/title/AUR_helpers):
```sh
$ yay -Syyu python-arm-preprocessing
```## 🚀 Usage
### Data loading
The following example demonstrates how to load a dataset from a file (csv, json, txt). More examples can be found in the [examples/data_loading](./examples/data_loading/) directory:
- [Loading a dataset from a CSV file](./examples/data_loading/load_dataset_csv.py)
- [Loading a dataset from a JSON file](./examples/data_loading/load_dataset_json.py)
- [Loading a dataset from a TCX file](./examples/data_loading/load_dataset_tcx.py)
- [Loading a time-series dataset](./examples/data_loading/load_dataset_timeseries.py)```python
from arm_preprocessing.dataset import Dataset# Initialise dataset with filename (without format) and format (csv, json, txt)
dataset = Dataset('path/to/datasets', format='csv')# Load dataset
dataset.load_data()
df = dataset.data
```### Missing values
The following example demonstrates how to handle missing values in a dataset using imputation. More examples can be found in the [examples/missing_values](./examples/missing_values) directory:
- [Handling missing values in a dataset using row deletion](./examples/missing_values/missing_values_rows.py)
- [Handling missing values in a dataset using column deletion](./examples/missing_values/missing_values_columns.py)
- [Handling missing values in a dataset using imputation](./examples/missing_values/missing_values_impute.py)```python
from arm_preprocessing.dataset import Dataset# Initialise dataset with filename and format
dataset = Dataset('examples/missing_values/data', format='csv')
dataset.load()# Impute missing data
dataset.missing_values(method='impute')
```### Data discretisation
The following example demonstrates how to discretise a dataset using the equal width method. More examples can be found in the [examples/discretisation](./examples/discretisation) directory:
- [Discretising a dataset using the equal width method](./examples/discretisation/equal_width_discretisation.py)
- [Discretising a dataset using the equal frequency method](./examples/discretisation/equal_frequency_discretisation.py)
- [Discretising a dataset using k-means clustering](./examples/discretisation/kmeans_discretisation.py)```python
from arm_preprocessing.dataset import Dataset# Initialise dataset with filename (without format) and format (csv, json, txt)
dataset = Dataset('datasets/sportydatagen', format='csv')
dataset.load_data()# Discretise dataset using equal width discretisation
dataset.discretise(method='equal_width', num_bins=5, columns=['calories'])
```### Data squashing
The following example demonstrates how to squash a dataset using the euclidean similarity. More examples can be found in the [examples/squashing](./examples/squashing) directory:
- [Squashing a dataset using the euclidean similarity](./examples/squashing/squash_euclidean.py)
- [Squashing a dataset using the cosine similarity](./examples/squashing/squash_cosine.py)```python
from arm_preprocessing.dataset import Dataset# Initialise dataset with filename and format
dataset = Dataset('datasets/breast', format='csv')
dataset.load()# Squash dataset
dataset.squash(threshold=0.75, similarity='euclidean')
```### Feature scaling
The following example demonstrates how to scale the dataset's features. More examples can be found in the [examples/scaling](./examples/scaling) directory:
- [Scale features using normalisation](./examples/scaling/normalisation.py)
- [Scale features using standardisation](./examples/scaling/standardisation.py)```python
from arm_preprocessing.dataset import Dataset# Initialise dataset with filename and format
dataset = Dataset('datasets/Abalone', format='csv')
dataset.load()# Scale dataset using normalisation
dataset.scale(method='normalisation')
```### Feature selection
The following example demonstrates how to select features from a dataset. More examples can be found in the [examples/feature_selection](./examples/feature_selection) directory:
- [Select features using the Kendall Tau correlation coefficient](./examples/feature_selection/feature_selection.py)```python
from arm_preprocessing.dataset import Dataset# Initialise dataset with filename and format
dataset = Dataset('datasets/sportydatagen', format='csv')
dataset.load()# Feature selection
dataset.feature_selection(
method='kendall', threshold=0.15, class_column='calories')
```## 🔗 Related frameworks
[1] [NiaARM: A minimalistic framework for Numerical Association Rule Mining](https://github.com/firefly-cpp/NiaARM)
[2] [uARMSolver: universal Association Rule Mining Solver](https://github.com/firefly-cpp/uARMSolver)
## 📚 References
[1] I. Fister, I. Fister Jr., D. Novak and D. Verber, [Data squashing as preprocessing in association rule mining](https://iztok-jr-fister.eu/static/publications/300.pdf), 2022 IEEE Symposium Series on Computational Intelligence (SSCI), Singapore, Singapore, 2022, pp. 1720-1725, doi: 10.1109/SSCI51031.2022.10022240.
[2] I. Fister Jr., I. Fister [A brief overview of swarm intelligence-based algorithms for numerical association rule mining](https://arxiv.org/abs/2010.15524). arXiv preprint arXiv:2010.15524 (2020).
## 🔑 License
This package is distributed under the MIT License. This license can be found online
at .## Disclaimer
This framework is provided as-is, and there are no guarantees that it fits your purposes or that it is bug-free. Use it at your own risk!