An open API service indexing awesome lists of open source software.

https://github.com/adityakumarda/kmeans-web-analytics

Built with Python, Pandas, and Scikit-learn, this machine learning project uses K-Means to cluster website users by behavior. It reveals patterns in engagement and bounce, helping drive data-informed decisions.
https://github.com/adityakumarda/kmeans-web-analytics

cluster-analysis elbow-curves elbow-method elbow-plot jupyter-notebook kmeans-clustering machine-learning matplotlib numpy pandas python python3 relationship scikit-learn seaborn sklearn

Last synced: about 2 months ago
JSON representation

Built with Python, Pandas, and Scikit-learn, this machine learning project uses K-Means to cluster website users by behavior. It reveals patterns in engagement and bounce, helping drive data-informed decisions.

Awesome Lists containing this project

README

          

# kmeans-web-analytics

Built with Python, Pandas, and Scikit-learn, this machine learning project uses K-Means to cluster website users by behavior. It reveals patterns in engagement and bounce, helping drive data-informed decisions.

---

## Table of Contents

- [Key Features and Benefits](#key-features-and-benefits)
- [Prerequisites and Dependencies](#prerequisites-and-dependencies)
- [Installation and Setup Instructions](#installation-and-setup-instructions)
- [Usage Examples and API Documentation](#usage-examples-and-api-documentation)
- [Configuration Options](#configuration-options)
- [Contributing Guidelines](#contributing-guidelines)
- [License Information](#license-information)
- [Acknowledgments](#acknowledgments)
- [Project Structure](#project-structure)
- [Visual Output Snapshots](#Visual-Output-Snapshots)
- [Future Enhancements](#future-enhancements)
- [About Me](#about-me)

---

## Key Features and Benefits

- **User Segmentation:** Divides website users into distinct clusters based on their behavior patterns.
- **Behavioral Insights:** Identifies common engagement and bounce patterns within each cluster.
- **Data-Driven Decisions:** Enables data-informed decisions regarding website optimization, marketing strategies, and user experience improvements.
- **K-Means Clustering:** Employs the K-Means algorithm to effectively group users with similar behaviors.
- **Python-Based:** Leverages the power and flexibility of Python for data analysis and machine learning.

---

## Prerequisites and Dependencies

Before running this project, ensure you have the following installed:

- Python (3.6 or higher)
- Pandas: `pip install pandas`
- Scikit-learn: `pip install scikit-learn`
- Jupyter Notebook (Optional): `pip install notebook`

---

## Installation and Setup Instructions
1. **Clone the Repository:**

```bash
git clone https://github.com/AdityakumarDA/kmeans-web-analytics.git
cd kmeans-web-analytics
```

2. **Install Dependencies:**

It is recommended to create a virtual environment for this project.

```bash
# Create a virtual environment (optional)
python3 -m venv venv
source venv/bin/activate # On Linux/macOS
# venv\Scripts\activate # On Windows

# Install the required packages
pip install pandas scikit-learn notebook
```

3. **Download the data**
Ensure that the `website_traffic_data.csv` is downloaded and placed in the project directory

---

## Usage Examples and API Documentation

This project primarily consists of a Jupyter Notebook (`ML_project.ipynb`) that demonstrates the usage of K-Means clustering.

1. **Run the Notebook:**

```bash
jupyter notebook ML_project.ipynb
```

2. **Follow the steps within the notebook:** The notebook guides you through data loading, preprocessing, K-Means model training, and cluster analysis. It leverages Pandas and Scikit-learn functions directly.

Example snippet (from notebook concept):

```python
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Load the data
data = pd.read_csv("website_traffic_data.csv")

# Select features (e.g., 'engagement', 'bounce_rate')
features = ['engagement', 'bounce_rate']
X = data[features]

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42) # Example: 3 clusters
data['cluster'] = kmeans.fit_predict(X_scaled)

# Analyze the clusters
print(data.groupby('cluster')[features].mean())
```

---

## Configuration Options

The primary configurable option is the number of clusters (`n_clusters`) in the K-Means algorithm. This can be adjusted within the `ML_project.ipynb` notebook. Experiment with different values to find the optimal number of clusters for your dataset. Also the features that are used for clustering are configurable.

```python
kmeans = KMeans(n_clusters=3, random_state=42) # change n_clusters
```

---

## Contributing Guidelines

Contributions are welcome! To contribute to this project:

1. Fork the repository.
2. Create a new branch for your feature or bug fix.
3. Make your changes and commit them with clear, descriptive messages.
4. Submit a pull request.

Please ensure your code adheres to Python coding standards and includes appropriate documentation.

---

## License Information

No license specified. All rights reserved by **AdityakumarDA**.

---

## Acknowledgments

* [Scikit-learn](https://scikit-learn.org/stable/) - For the K-Means implementation.
* [Pandas](https://pandas.pydata.org/) - For data manipulation and analysis. [Pandas](https://pandas.pydata.org/)

---

## Project Structure

```
kmeans-web-analytics/
├── ML_project.ipynb
├── website_traffic_data.csv
├── README.md
└── images/
├── trafficcost_vs_Search_volume.png
├── elbow_plot.png
└── cluster_scatter.png
```

---

## Visual Output Snapshots

### 1. 🔍 Search Volume vs Traffic Cost

![Search Volume vs Traffic Cost](ML_Images/trafficcost_vs_Search_volume.png)

This scatter plot visualizes how **Search Volume** impacts the **Traffic Cost** for various website keywords or landing pages. It helps identify outliers — e.g., terms with exceptionally high traffic costs or volume. This can assist in budget optimization for paid campaigns or SEO strategy.

---

### 2. 💡 Elbow Method to Determine Optimal Clusters

![Elbow Plot](ML_Images/elbow_plot.png)

The **Elbow Method** helps us decide the optimal number of clusters (`n_clusters`) for K-Means. It plots the number of clusters vs the clustering inertia (error). The 'elbow point' (highlighted with a red star) indicates the most efficient number of clusters — beyond which performance gain diminishes. In this project, 2 clusters were optimal.

---

### 3. 📊 K-Means Cluster Scatter Plot (Search Volume vs Traffic)

![Cluster Scatter Plot](ML_Images/cluster_scatter.png)

This plot displays final **K-Means clustering results**, where:
- Each point is a data sample (a keyword or page).
- Different colors indicate different user segments (clusters).
- **Red stars** mark the centroids (mean position of each cluster).

It provides intuitive insights into user groupings like high-volume, low-cost vs low-volume, high-cost clusters. This is essential for personalized targeting and marketing strategies.

---

## Future Enhancements

- Add features like session duration, pages per session
- Use silhouette score for better cluster selection
- Deploy via Streamlit/Flask for interactivity
- Add time-series or location-based segmentation

---

## About Me

I'm **Aditya Rajput**, a data analyst passionate about storytelling with data, unsupervised learning, and real-world analytics.

- [LinkedIn](https://www.linkedin.com/in/adityakumarda/)
- [GitHub](https://github.com/AdityakumarDA)
- [Tableau Public](https://public.tableau.com/app/profile/adityakumarda)

If you liked this project, please ⭐ the repo!