https://github.com/chouaib-629/customersegmentation

Hadoop-based Customer Segmentation project using the Online Retail Dataset. Implements MapReduce for processing and Python for preprocessing to uncover customer purchasing patterns for targeted marketing.
https://github.com/chouaib-629/customersegmentation

big-data customer-segmentation data-analysis data-science distributed-computing hadoop hadoop-mapreduce java mapreduce marketing-analytics python

Last synced: 8 months ago
JSON representation

Host: GitHub
URL: https://github.com/chouaib-629/customersegmentation
Owner: chouaib-629
Created: 2025-01-03T21:38:04.000Z (9 months ago)
Default Branch: main
Last Pushed: 2025-01-27T22:39:35.000Z (8 months ago)
Last Synced: 2025-01-27T23:29:00.365Z (8 months ago)
Topics: big-data, customer-segmentation, data-analysis, data-science, distributed-computing, hadoop, hadoop-mapreduce, java, mapreduce, marketing-analytics, python
Language: Jupyter Notebook
Homepage:
Size: 257 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Customer Segmentation Using Hadoop

This project uses Hadoop MapReduce to perform customer segmentation based on the dataset from an [Online Retail store](https://archive.ics.uci.edu/dataset/352/online+retail). The objective is to calculate three key metrics for each customer:

- Total Spend
- Frequency (number of purchases)
- Recency (time in days since the last purchase)

## Table of Contents

- [Features](#features)
- [Technologies Used](#technologies-used)
- [Getting Started](#getting-started)
- [Dataset Structure](#dataset-structure)
- [Usage](#usage)
- [Contributing](#contributing)
- [Contact Information](#contact-information)

## Features

- Hadoop-based scalable processing for large datasets.
- Key customer segmentation metrics: TotalSpend, Frequency, and Recency.
- Preprocessing script for dataset preparation.
- MapReduce implementation for distributed data processing.

## Technologies Used

- **Hadoop**: A framework for distributed storage and processing of big data.
- **Java**: The primary programming language used for writing the MapReduce logic.
- **HDFS**: The Hadoop Distributed File System for storing large datasets.
- **MapReduce**: The processing model used to process data.
- **Python**: For dataset preprocessing and analysis.
- **Libraries**: pandas, matplotlib, seaborn, scipy.stats
- **Linux (Ubuntu)**: Operating System.

## Dataset Structure

The dataset used in this project is the [Online Retail Dataset](https://archive.ics.uci.edu/dataset/352/online+retail). It contains all transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retailer. The company mainly sells unique all-occasion gifts, with many of its customers being wholesalers.

**Sample of the dataset:**

| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice (Sterling) | CustomerID | Country |
| --------- | --------- | ---------------------------------- | -------- | ------------------- | --------- | ---------- | -------------- |
| 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 01/12/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
| 536365 | 71053 | WHITE METAL LANTERN | 6 | 01/12/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 01/12/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
| 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |

## Getting Started

To get started with this project, follow these steps:

### Prerequisites

1. Install Hadoop on your local machine or use a cloud-based Hadoop cluster.
2. Ensure that Java (JDK 8 or later) is installed on your system.
3. Install Python and required libraries:

```bash
pip install pandas matplotlib seaborn scipy
```

4. Download the [Online Retail Dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail).

### Setup Instructions

1. Clone this repository:

```bash
git clone https://github.com/chouaib-629/CustomerSegmentation.git
```

2. Navigate to the project directory:

```bash
cd CustomerSegmentation
```

3. The downloaded dataset is in `Online Retail.xlsx` format. Save it as `online_retail.csv` using any spreadsheet tool.

4. Preprocess the dataset using the provided Python script:

- `.py` format:

```bash
python preprocessing/main.py
```

- `.ipynb` format:

Open and run `preprocessing/main.ipynb` in Jupyter Notebook.

5. Compile the Java classes:

```bash
javac -classpath `hadoop classpath` -d compiled_classes src/*.java
```

6. Package the classes into a JAR file:

```bash
jar cf CustomerSegmentation.jar -C compiled_classes/ .
```

## Usage

### Step 1: Upload Dataset to HDFS

1. Create a directory in HDFS to store the dataset:

```bash
hdfs dfs -mkdir /CustomerSegmentation
```

2. Upload the preprocessed dataset to HDFS:

```bash
hdfs dfs -put processed_online_retail.csv /CustomerSegmentation/
```

### Step 2: Run the MapReduce Job

Run the Hadoop job using the following command:

```bash
hadoop jar CustomerSegmentation.jar Driver /CustomerSegmentation/processed_online_retail.csv /CustomerSegmentation/output/
```

### Step 3: View the Output

To view the results of the MapReduce job, use the following command:

```bash
hdfs dfs -cat /CustomerSegmentation/output/part-r-00000
```

### **Optional:** Save the Output Locally

Copy the output file from HDFS to your local storage for further analysis:

```bash
hdfs dfs -get /CustomerSegmentation/output/part-r-00000 output/result.csv
```

## Contributing

Contributions are welcome! To contribute:

1. Fork the repository.
2. Create a new branch:

```bash
git checkout -b feature/feature-name
```

3. Commit your changes:

```bash
git commit -m "Add feature description"
```

4. Push to the branch:

```bash
git push origin feature/feature-name
```

5. Open a pull request.

## Contact Information

For questions or support, please contact [Me](mailto:chouaiba629@gmail.com).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chouaib-629/customersegmentation

Awesome Lists containing this project

README