https://github.com/headless-start/bias-aware-kmeans

This repository includes Greedy Preprocessing and K-Means Integration for Large-Scale Biased Data.
https://github.com/headless-start/bias-aware-kmeans

bias distance-calculation greedy-algorithm kmeans-clustering multiprocessing python3 silhouette-score subsampling

Last synced: 4 months ago
JSON representation

This repository includes Greedy Preprocessing and K-Means Integration for Large-Scale Biased Data.

Host: GitHub
URL: https://github.com/headless-start/bias-aware-kmeans
Owner: headless-start
License: mit
Created: 2025-02-01T03:06:30.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-02-01T07:53:35.000Z (10 months ago)
Last Synced: 2025-06-02T14:18:17.124Z (6 months ago)
Topics: bias, distance-calculation, greedy-algorithm, kmeans-clustering, multiprocessing, python3, silhouette-score, subsampling
Language: Jupyter Notebook
Homepage:
Size: 5.15 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Biased Dataset Clustering: Greedy Preprocessing & k-Means Integration on Large Scale Data

## 📌 Project Overview
This project addresses challenges in applying **k-means clustering** to biased datasets by implementing a **parallelized greedy clustering algorithm** for preprocessing. The algorithm reduces dataset size by selecting representatives based on a tunable distance threshold (τ), enabling efficient clustering. A comparative analysis with **random subsampling** evaluates computational efficiency and clustering quality.

**Dataset**: 118,821 data points with inherent biases (e.g., age, wealth).
**Goal**: Mitigate bias effects by preprocessing data into representative clusters (1%, 10%, 25% sizes) and compare methods.

---

## 🚀 Key Features
1. **Parallelized Greedy Clustering**:
- Reduces dataset to target cluster ratios (1%, 10%, 25%) via adaptive τ tuning.
- Optimized for cache behavior and minimal memory usage.
2. **k-Means Integration**:
- Clusters representatives from preprocessing step.
- Post-processing assigns original data points to clusters.
3. **Random Subsampling Baseline**:
- Generates comparison dataset by randomly sampling equivalent proportions.
4. **Performance Analysis**:
- Metrics: Runtime, memory usage, Silhouette Score, intra/inter-cluster distances.

---

## 🔍 Findings
1. **Greedy Clustering**:
- Achieved target cluster sizes (1%, 10%, 25%) with τ=100.
- **Runtime**: 43.19s | **Memory**: 1.03MB.
- **Clustering Quality**: Silhouette Score (-0.0029), Intra-cluster distance (15,289.29).
2. **Random Subsampling**:
- **Runtime**: 37.14s | **Memory**: Negligible.
- **Clustering Quality**: Silhouette Score (0.1127), Intra-cluster distance (12,397.45).
3. **Conclusion**:
- Subsampling outperformed in speed and clustering quality for this dataset.
- Greedy clustering offers structured preprocessing for bias mitigation but requires tuning.

---

## 🛠 System Requirements
### Dependencies
- Python 3.8+
- Libraries: `numpy`, `pandas`, `scikit-learn`, `multiprocessing`

---

## 📄 License
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/headless-start/bias-aware-kmeans

Awesome Lists containing this project

README