https://github.com/chouaib-629/customersegmentation
Hadoop-based Customer Segmentation project using the Online Retail Dataset. Implements MapReduce for processing and Python for preprocessing to uncover customer purchasing patterns for targeted marketing.
https://github.com/chouaib-629/customersegmentation
big-data customer-segmentation data-analysis data-science distributed-computing hadoop hadoop-mapreduce java mapreduce marketing-analytics python
Last synced: 8 months ago
JSON representation
Hadoop-based Customer Segmentation project using the Online Retail Dataset. Implements MapReduce for processing and Python for preprocessing to uncover customer purchasing patterns for targeted marketing.
- Host: GitHub
- URL: https://github.com/chouaib-629/customersegmentation
- Owner: chouaib-629
- Created: 2025-01-03T21:38:04.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-01-27T22:39:35.000Z (8 months ago)
- Last Synced: 2025-01-27T23:29:00.365Z (8 months ago)
- Topics: big-data, customer-segmentation, data-analysis, data-science, distributed-computing, hadoop, hadoop-mapreduce, java, mapreduce, marketing-analytics, python
- Language: Jupyter Notebook
- Homepage:
- Size: 257 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Customer Segmentation Using Hadoop
This project uses Hadoop MapReduce to perform customer segmentation based on the dataset from an [Online Retail store](https://archive.ics.uci.edu/dataset/352/online+retail). The objective is to calculate three key metrics for each customer:
- Total Spend
- Frequency (number of purchases)
- Recency (time in days since the last purchase)## Table of Contents
- [Features](#features)
- [Technologies Used](#technologies-used)
- [Getting Started](#getting-started)
- [Dataset Structure](#dataset-structure)
- [Usage](#usage)
- [Contributing](#contributing)
- [Contact Information](#contact-information)## Features
- Hadoop-based scalable processing for large datasets.
- Key customer segmentation metrics: TotalSpend, Frequency, and Recency.
- Preprocessing script for dataset preparation.
- MapReduce implementation for distributed data processing.## Technologies Used
- **Hadoop**: A framework for distributed storage and processing of big data.
- **Java**: The primary programming language used for writing the MapReduce logic.
- **HDFS**: The Hadoop Distributed File System for storing large datasets.
- **MapReduce**: The processing model used to process data.
- **Python**: For dataset preprocessing and analysis.
- **Libraries**: pandas, matplotlib, seaborn, scipy.stats
- **Linux (Ubuntu)**: Operating System.## Dataset Structure
The dataset used in this project is the [Online Retail Dataset](https://archive.ics.uci.edu/dataset/352/online+retail). It contains all transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retailer. The company mainly sells unique all-occasion gifts, with many of its customers being wholesalers.
**Sample of the dataset:**
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice (Sterling) | CustomerID | Country |
| --------- | --------- | ---------------------------------- | -------- | ------------------- | --------- | ---------- | -------------- |
| 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 01/12/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
| 536365 | 71053 | WHITE METAL LANTERN | 6 | 01/12/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 01/12/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
| 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |## Getting Started
To get started with this project, follow these steps:
### Prerequisites
1. Install Hadoop on your local machine or use a cloud-based Hadoop cluster.
2. Ensure that Java (JDK 8 or later) is installed on your system.
3. Install Python and required libraries:```bash
pip install pandas matplotlib seaborn scipy
```4. Download the [Online Retail Dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail).
### Setup Instructions
1. Clone this repository:
```bash
git clone https://github.com/chouaib-629/CustomerSegmentation.git
```2. Navigate to the project directory:
```bash
cd CustomerSegmentation
```3. The downloaded dataset is in `Online Retail.xlsx` format. Save it as `online_retail.csv` using any spreadsheet tool.
4. Preprocess the dataset using the provided Python script:
- `.py` format:
```bash
python preprocessing/main.py
```- `.ipynb` format:
Open and run `preprocessing/main.ipynb` in Jupyter Notebook.
5. Compile the Java classes:
```bash
javac -classpath `hadoop classpath` -d compiled_classes src/*.java
```6. Package the classes into a JAR file:
```bash
jar cf CustomerSegmentation.jar -C compiled_classes/ .
```## Usage
### Step 1: Upload Dataset to HDFS
1. Create a directory in HDFS to store the dataset:
```bash
hdfs dfs -mkdir /CustomerSegmentation
```2. Upload the preprocessed dataset to HDFS:
```bash
hdfs dfs -put processed_online_retail.csv /CustomerSegmentation/
```### Step 2: Run the MapReduce Job
Run the Hadoop job using the following command:
```bash
hadoop jar CustomerSegmentation.jar Driver /CustomerSegmentation/processed_online_retail.csv /CustomerSegmentation/output/
```### Step 3: View the Output
To view the results of the MapReduce job, use the following command:
```bash
hdfs dfs -cat /CustomerSegmentation/output/part-r-00000
```### **Optional:** Save the Output Locally
Copy the output file from HDFS to your local storage for further analysis:
```bash
hdfs dfs -get /CustomerSegmentation/output/part-r-00000 output/result.csv
```## Contributing
Contributions are welcome! To contribute:
1. Fork the repository.
2. Create a new branch:```bash
git checkout -b feature/feature-name
```3. Commit your changes:
```bash
git commit -m "Add feature description"
```4. Push to the branch:
```bash
git push origin feature/feature-name
```5. Open a pull request.
## Contact Information
For questions or support, please contact [Me](mailto:chouaiba629@gmail.com).