https://github.com/rainman226/holte-1r
An implementation of Holte's 1R discretizer
https://github.com/rainman226/holte-1r
classification data-mining data-mining-algorithms data-mining-python data-preprocessing discretization python scikit-learn
Last synced: about 2 months ago
JSON representation
An implementation of Holte's 1R discretizer
- Host: GitHub
- URL: https://github.com/rainman226/holte-1r
- Owner: rainman226
- Created: 2025-06-11T11:35:32.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-13T20:09:45.000Z (about 1 year ago)
- Last Synced: 2025-06-22T10:02:46.948Z (12 months ago)
- Topics: classification, data-mining, data-mining-algorithms, data-mining-python, data-preprocessing, discretization, python, scikit-learn
- Language: Python
- Homepage:
- Size: 8.79 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Implementation of Holte's 1R Algorithm for Data Discretization
This project implements Holte's 1R algorithm for discretizing continuous attributes, developed as part of my Data Mining course. The goal was to transform continuous data into discrete bins for classification tasks, testing the algorithm on some classic datasets.
---
## How This Works
The implementation is a supervised discretization algorithm that creates bins with high class purity, using class labels to guide bin boundaries. It’s written in Python, taking advantage of NumPy, Pandas, Scikit-learn, and SciPy.
### Details of Implementation
- **Function**: `holte_1r_discretize(feature, labels, min_size=6)`
- **Input**:
- `feature`: 1D NumPy array of continuous values.
- `labels`: 1D NumPy array of class labels.
- `min_size`: Minimum instances per bin (default: 6, or 3 for smaller datasets like Iris).
- **Output**:
- `bins`: Array of bin indices for each data point.
- `bin_edges`: List of bin boundaries.
- **Steps**:
1. Sort feature values with class labels.
2. Create bins when class changes, ensuring at least `min_size` instances.
3. Set bin boundaries at midpoints between consecutive values.
4. Merge bins with fewer than `min_size` instances, choosing the merge (left or right) that minimizes entropy (using `scipy.stats.entropy`).
- **Key Feature**: Entropy-based merging optimizes bin purity, improving over V1 (excessive bins) and V2 (simple merging).
The main script is `holte_1r.py`, with test code in `main.py`.
---
## Tests and Results
The algorithm was tested on multiple datasets from Scikit-learn using a decision tree classifier with 10-fold cross-validation. Results compare Holte’s 1R against raw data, equal-width, and equal-frequency binning. Additional datasets (e.g., Glass, Diabetes) are planned for future tests™.
### Test Results
| Dataset | Raw | Holte_1R | Equal-Width | Equal-Frequency |
|---------------|-----------|-----------|-------------|-----------------|
| Breast Cancer | 0.9280 | 0.9385 | 0.9279 | **0.9507** |
| Iris | **0.9533** | 0.9400 | 0.9200 | 0.9267 |
| Wine | 0.8650 | **0.9333** | 0.9049 | 0.8938 |
| Digits | 0.8208 | 0.8152 | 0.8046 | **0.8335** |
### Iris Bin Statistics
| Feature | Entropy | Number of Bins |
|---------------|---------|----------------|
| Feature 1 | 0.693 | 19 |
| Feature 2 | 0.925 | 22 |
| Feature 3 | 0.134 | 7 |
| Feature 4 | 0.123 | 7 |
---
## Bibliography
- [1] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and unsupervised discretization of continuous features,” in *Proc. 12th Int. Conf. Mach. Learn. (ICML)*, Tahoe City, CA, USA, 1995, pp. 194–202.
- [2] H. Liu, F. Hussain, C. L. Tan, and M. Dash, “Discretization: An enabling technique,” *Data Mining Knowl. Disc.*, vol. 6, no. 4, pp. 393–423, Oct. 2002, doi: 10.1023/A:1016304305535.