{"id":23913288,"url":"https://github.com/chouaib-629/customersegmentation","last_synced_at":"2026-05-04T01:31:55.040Z","repository":{"id":270903752,"uuid":"911793794","full_name":"chouaib-629/CustomerSegmentation","owner":"chouaib-629","description":"Hadoop-based Customer Segmentation project using the Online Retail Dataset. Implements MapReduce for processing and Python for preprocessing to uncover customer purchasing patterns for targeted marketing.","archived":false,"fork":false,"pushed_at":"2025-01-27T22:39:35.000Z","size":263,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-27T23:29:00.365Z","etag":null,"topics":["big-data","customer-segmentation","data-analysis","data-science","distributed-computing","hadoop","hadoop-mapreduce","java","mapreduce","marketing-analytics","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chouaib-629.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-03T21:38:04.000Z","updated_at":"2025-01-27T22:39:38.000Z","dependencies_parsed_at":"2025-01-03T23:35:21.790Z","dependency_job_id":null,"html_url":"https://github.com/chouaib-629/CustomerSegmentation","commit_stats":null,"previous_names":["chouaib-629/customersegmentation"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chouaib-629%2FCustomerSegmentation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chouaib-629%2FCustomerSegmentation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chouaib-629%2FCustomerSegmentation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chouaib-629%2FCustomerSegmentation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chouaib-629","download_url":"https://codeload.github.com/chouaib-629/CustomerSegmentation/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240356403,"owners_count":19788548,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["big-data","customer-segmentation","data-analysis","data-science","distributed-computing","hadoop","hadoop-mapreduce","java","mapreduce","marketing-analytics","python"],"created_at":"2025-01-05T09:35:39.701Z","updated_at":"2026-05-04T01:31:50.019Z","avatar_url":"https://github.com/chouaib-629.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Customer Segmentation Using Hadoop\n\nThis project uses Hadoop MapReduce to perform customer segmentation based on the dataset from an [Online Retail store](https://archive.ics.uci.edu/dataset/352/online+retail). The objective is to calculate three key metrics for each customer:\n\n- Total Spend\n- Frequency (number of purchases)\n- Recency (time in days since the last purchase)\n\n## Table of Contents\n\n- [Features](#features)\n- [Technologies Used](#technologies-used)\n- [Getting Started](#getting-started)\n- [Dataset Structure](#dataset-structure)\n- [Usage](#usage)\n- [Contributing](#contributing)\n- [Contact Information](#contact-information)\n\n## Features\n\n- Hadoop-based scalable processing for large datasets.\n- Key customer segmentation metrics: TotalSpend, Frequency, and Recency.\n- Preprocessing script for dataset preparation.\n- MapReduce implementation for distributed data processing.\n\n## Technologies Used\n\n- **Hadoop**: A framework for distributed storage and processing of big data.\n- **Java**: The primary programming language used for writing the MapReduce logic.\n- **HDFS**: The Hadoop Distributed File System for storing large datasets.\n- **MapReduce**: The processing model used to process data.\n- **Python**: For dataset preprocessing and analysis.\n- **Libraries**: pandas, matplotlib, seaborn, scipy.stats\n- **Linux (Ubuntu)**: Operating System.\n\n## Dataset Structure\n\nThe dataset used in this project is the [Online Retail Dataset](https://archive.ics.uci.edu/dataset/352/online+retail). It contains all transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retailer. The company mainly sells unique all-occasion gifts, with many of its customers being wholesalers.\n\n**Sample of the dataset:**\n\n| InvoiceNo | StockCode | Description                        | Quantity | InvoiceDate         | UnitPrice (Sterling) | CustomerID | Country        |\n| --------- | --------- | ---------------------------------- | -------- | ------------------- | --------- | ---------- | -------------- |\n| 536365    | 85123A    | WHITE HANGING HEART T-LIGHT HOLDER | 6        | 01/12/2010 8:26 | 2.55      | 17850.0     | United Kingdom |\n| 536365    | 71053     | WHITE METAL LANTERN                | 6        | 01/12/2010 8:26 | 3.39      | 17850.0     | United Kingdom |\n| 536365    | 84406B    | CREAM CUPID HEARTS COAT HANGER     | 8        | 01/12/2010 8:26 | 2.75      | 17850.0     | United Kingdom |\n| 536365    | 84029G \t| KNITTED UNION FLAG HOT WATER BOTTLE | 6\t      | 12/1/2010 8:26\t | 3.39      | 17850.0    | United Kingdom |\n| 536365    |\t84029E   |\tRED WOOLLY HOTTIE WHITE HEART.   | 6         | 12/1/2010 8:26   | 3.39      | 17850.0     | United Kingdom |\n\n## Getting Started\n\nTo get started with this project, follow these steps:\n\n### Prerequisites\n\n1. Install Hadoop on your local machine or use a cloud-based Hadoop cluster.\n2. Ensure that Java (JDK 8 or later) is installed on your system.\n3. Install Python and required libraries:\n\n   ```bash\n   pip install pandas matplotlib seaborn scipy\n   ```\n\n4. Download the [Online Retail Dataset](https://archive.ics.uci.edu/ml/datasets/Online+Retail).\n\n### Setup Instructions\n\n1. Clone this repository:\n\n   ```bash\n   git clone https://github.com/chouaib-629/CustomerSegmentation.git\n   ```\n\n2. Navigate to the project directory:\n\n   ```bash\n   cd CustomerSegmentation\n   ```\n\n3. The downloaded dataset is in `Online Retail.xlsx` format. Save it as `online_retail.csv` using any spreadsheet tool.\n\n4. Preprocess the dataset using the provided Python script:\n\n   - `.py` format:\n\n     ```bash\n     python preprocessing/main.py\n     ```\n\n   - `.ipynb` format:\n\n     Open and run `preprocessing/main.ipynb` in Jupyter Notebook.\n\n5. Compile the Java classes:\n\n   ```bash\n   javac -classpath `hadoop classpath` -d compiled_classes src/*.java\n   ```\n\n6. Package the classes into a JAR file:\n\n   ```bash\n   jar cf CustomerSegmentation.jar -C compiled_classes/ .\n   ```\n\n## Usage\n\n### Step 1: Upload Dataset to HDFS\n\n1. Create a directory in HDFS to store the dataset:\n\n   ```bash\n   hdfs dfs -mkdir /CustomerSegmentation\n   ```\n\n2. Upload the preprocessed dataset to HDFS:\n\n   ```bash\n   hdfs dfs -put processed_online_retail.csv /CustomerSegmentation/\n   ```\n\n### Step 2: Run the MapReduce Job\n\nRun the Hadoop job using the following command:\n\n```bash\nhadoop jar CustomerSegmentation.jar Driver /CustomerSegmentation/processed_online_retail.csv /CustomerSegmentation/output/\n```\n\n### Step 3: View the Output\n\nTo view the results of the MapReduce job, use the following command:\n\n```bash\nhdfs dfs -cat /CustomerSegmentation/output/part-r-00000\n```\n\n### **Optional:** Save the Output Locally\n\nCopy the output file from HDFS to your local storage for further analysis:\n\n```bash\nhdfs dfs -get /CustomerSegmentation/output/part-r-00000 output/result.csv\n```\n\n## Contributing\n\nContributions are welcome! To contribute:\n\n1. Fork the repository.\n2. Create a new branch:\n\n   ```bash\n   git checkout -b feature/feature-name\n   ```\n\n3. Commit your changes:\n\n   ```bash\n   git commit -m \"Add feature description\"\n   ```\n\n4. Push to the branch:\n\n   ```bash\n   git push origin feature/feature-name\n   ```\n\n5. Open a pull request.\n\n## Contact Information\n\nFor questions or support, please contact [Me](mailto:chouaiba629@gmail.com).\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchouaib-629%2Fcustomersegmentation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchouaib-629%2Fcustomersegmentation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchouaib-629%2Fcustomersegmentation/lists"}