An open API service indexing awesome lists of open source software.

https://github.com/coby-sonn/kmeans-python-c

An efficient K-Means clustering implementation combining Python for preprocessing and a C extension for optimized computations, featuring K-Means++ initialization and linked-list memory management.
https://github.com/coby-sonn/kmeans-python-c

kmeans machine-learning memory-management python-c-extension

Last synced: 6 months ago
JSON representation

An efficient K-Means clustering implementation combining Python for preprocessing and a C extension for optimized computations, featuring K-Means++ initialization and linked-list memory management.

Awesome Lists containing this project

README

          

# K-Means Clustering Implementation in Python and C

This repository contains an implementation of the K-Means clustering algorithm, leveraging both Python and C for optimized performance. This was built during university studies in a C & Python data analysis course.

## Overview
- **Python Implementation**: Handles data processing and initialization using K-Means++.
- **C Extension**: Optimized clustering computation using linked lists for efficient memory management.

## Files
- `kmeans_pp.py` - Python implementation, including K-Means++ initialization.
- `kmeansmodule.c` - C extension implementing the core clustering logic.
- `setup.py` - Build script for compiling the C extension into a Python module.

## Installation
### Prerequisites
- Python 3.x installed.
- A C compiler such as `gcc`.

### Building the C Extension
Run the following command to compile the C module:
```sh
python setup.py build_ext --inplace
```
This will generate a shared library (`mykmeanssp.*.so` or `.pyd` on Windows) that can be imported into Python.

## How to Run
### Running the Python Implementation
#### Command Syntax:
```sh
python kmeans_pp.py []
```
- ``: Number of clusters (integer > 1 and < N, where N is the number of points).
- ``: (Optional) Maximum number of iterations (default: 300, max: 1000).
- ``: Convergence threshold (float >= 0).
- ``, ``: CSV files containing the input data.

#### Example:
```sh
python kmeans_pp.py 3 300 0.001 data1.csv data2.csv
```

## Output Format
- The program prints the initial centroids' indices.
- The final cluster centroids are printed, with each centroid on a new line formatted to 4 decimal places.

## Notes
- Input files must be in CSV format with numerical values.
- The implementation uses the K-Means++ initialization method for better convergence.
- The Python script interfaces with the optimized C extension for better performance.

## License
This project is released under the MIT License.