{"id":28798486,"url":"https://github.com/jim-by/personalized-recommendation-system","last_synced_at":"2025-07-11T14:38:42.183Z","repository":{"id":298504553,"uuid":"1000191741","full_name":"Jim-by/Personalized-Recommendation-System","owner":"Jim-by","description":"End-to-end personalised recommender system for e-commerce: synthetic data, PySpark, Delta Lake, model training, evaluation, monitoring, A/B test.","archived":false,"fork":false,"pushed_at":"2025-06-12T11:15:50.000Z","size":89,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-19T06:54:10.149Z","etag":null,"topics":["delta-lake","hadoop","jupyter-notebook","matplotlib","pandas","pyspark","python","scipy","seaborn-plots","sparksession"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Jim-by.png","metadata":{"files":{"readme":"Readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-11T12:02:20.000Z","updated_at":"2025-06-12T11:21:44.000Z","dependencies_parsed_at":"2025-06-11T13:24:49.618Z","dependency_job_id":"021b243e-a0f7-47b7-b5d1-23b9b8cbe875","html_url":"https://github.com/Jim-by/Personalized-Recommendation-System","commit_stats":null,"previous_names":["jim-by/personalized-recommendation-system"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Jim-by/Personalized-Recommendation-System","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jim-by%2FPersonalized-Recommendation-System","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jim-by%2FPersonalized-Recommendation-System/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jim-by%2FPersonalized-Recommendation-System/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jim-by%2FPersonalized-Recommendation-System/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Jim-by","download_url":"https://codeload.github.com/Jim-by/Personalized-Recommendation-System/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Jim-by%2FPersonalized-Recommendation-System/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264833293,"owners_count":23670617,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["delta-lake","hadoop","jupyter-notebook","matplotlib","pandas","pyspark","python","scipy","seaborn-plots","sparksession"],"created_at":"2025-06-18T05:39:26.964Z","updated_at":"2025-07-11T14:38:42.178Z","avatar_url":"https://github.com/Jim-by.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# E-commerce Personalization Engine: A Full-Cycle Recommendation System\r\n\r\n\r\n**1. Overview**\r\n\r\nThis project is an end-to-end demonstration of a personalized recommendation system for an e-commerce platform. It showcases the complete MLOps lifecycle, from synthetic data generation and processing to model training, offline evaluation, feature storage, monitoring, and A/B testing.\r\n\r\nThe primary goal is to simulate a real-world environment and demonstrate the skills required to build, evaluate, and maintain a robust recommendation engine using modern data engineering and machine learning tools.\r\n\r\n**2. Key Features**\r\n\r\nSynthetic Data Generation: Creates realistic user, product, and interaction data (views, purchases, searches).\r\n\r\nETL with Delta Lake: Processes raw data into a structured and reliable user-item interaction data mart using PySpark and Delta Lake.\r\n\r\nAdvanced Feature Engineering:\r\n\r\nCalculates temporal features (e.g., time since last interaction, day of the week).\r\n\r\nGenerates user embeddings using Word2Vec on interaction sequences.\r\n\r\n\r\nModel Training \u0026 Recommendation: Implements and trains multiple recommendation models (e.g., ALS-based implicit feedback).\r\n\r\nComprehensive Offline Evaluation:\r\n\r\nCompares models using metrics like Precision@K, Recall@K, MAP@K, and nDCG@K.\r\n\r\nAnalyzes recommendation diversity and catalog coverage.\r\n\r\nPerforms replay validation to test model performance over time.\r\n\r\n\r\n\r\nFeature Store Implementation: A Delta Lake-based feature store with versioning and support for both offline (training) and online (inference) scenarios.\r\n\r\nMonitoring: Uses Evidently AI to generate reports on data drift and quality.\r\n\r\nA/B Testing Simulation: Simulates an A/B test to compare the performance of different models in a pseudo-production environment.\r\n\r\n\r\n**3. Tech Stack**\r\n\r\nData Processing: Apache Spark, Delta Lake, Pandas\r\n\r\nMachine Learning: Spark MLlib (ALS, Word2Vec)\r\n\r\nData Quality \u0026 Monitoring: Evidently AI\r\n\r\nOrchestration \u0026 Workflow: Python scripts (can be orchestrated with tools like Airflow or Mage)\r\n\r\nData Visualization: Matplotlib, Seaborn\r\n\r\n\r\n**4. Project Architecture \u0026 Pipeline**\r\n\r\nThe project is structured as a sequential pipeline. Each step is represented by a script that performs a specific task.\r\n\r\n\r\n\r\nData Generation (src/1_data_generation/)\r\n\r\nSimulates user activity and generates raw CSV files for users, products, events, orders, and search logs.\r\n\r\n\r\n\r\n\r\nData Mart Construction (src/2_data_processing/)\r\n\r\nIngests raw data, cleans it, and builds a central user_item_interactions Delta table, serving as the single source of truth.\r\n\r\n\r\n\r\n\r\nExploratory Data Analysis (EDA) (notebooks/)\r\n\r\nAnalyzes the data mart to understand distributions, user behavior patterns, and data quality.\r\n\r\n\r\n\r\n\r\nFeature Engineering (src/3_feature_engineering/)\r\n\r\n\r\n\r\nTemporal Features: Enriches the data mart with time-based features.\r\n\r\nUser Embeddings: Creates vector representations of users based on their interaction history.\r\n\r\nFeature Store: Calculates and stores user and item features for model training and serving.\r\n\r\n\r\n\r\n\r\nModel Training (src/4_training/)\r\n\r\nTrains recommendation models (e.g., ALS for implicit feedback) on the processed data and saves the model artifacts and pre-computed recommendations.\r\n\r\n\r\n\r\n\r\nOffline Evaluation (src/5_evaluation/)\r\n\r\n\r\n\r\nMetric Calculation: Compares the trained models using various ranking and classification metrics.\r\n\r\nReplay Validation: Simulates model performance over historical time windows to ensure stability.\r\n\r\n\r\n\r\n\r\nMonitoring (src/6_monitoring/)\r\n\r\nGenerates reports on data drift in features and tracks the quality of model recommendations over time.\r\n\r\n\r\n\r\n\r\nA/B Testing (src/7_ab_testing/)\r\n\r\nRuns a simulated A/B test to compare a new model against a baseline and a control group, calculating key business metrics like CTR and Conversion Rate.\r\n\r\n\r\n\r\n**5. Setup and Installation**\r\n\r\nPrerequisites\r\n\r\nPython 3.9+\r\n\r\nJava 8 or 11\r\n\r\nApache Spark (correctly installed and configured)\r\n\r\nJAVA_HOME and HADOOP_HOME environment variables set. HADOOP_HOME should point to a directory containing the winutils.exe binary on Windows.\r\n\r\nInstallation\r\n\r\n\r\nClone the repository:\r\n\r\n\r\ngit clone https://github.com/your-username/ecommerce-personalization-engine.git\r\ncd ecommerce-personalization-engine\r\n\r\n\r\n\r\n\r\nCreate and activate a virtual environment (recommended):\r\n\r\n\r\npython -m venv venv\r\n\r\n*On Windows*\r\n\r\n```\r\n.\\venv\\Scripts\\activate\r\n```\r\n\r\n*On macOS/Linux*\r\n\r\n```\r\nsource venv/bin/activate\r\n```\r\n\r\n\r\n\r\n\r\nInstall the required dependencies:\r\n\r\n\r\n```\r\npip install -r requirements.txt\r\n```\r\n\r\n\r\n**6. How to Run the Pipeline**\r\n\r\nThe scripts are designed to be run in a specific order. Execute them from the root directory of the project.\r\n\r\n* Step 1: Generate synthetic data*\r\npython src/1_data_generation/data_generation.py\r\n\r\n* Step 2: Build the main data mart*\r\npython src/2_data_processing/build_interactions_mart.py\r\n\r\n* Step 3: Enrich data and create features*\r\npython src/3_feature_engineering/add_temporal_features.py\r\npython src/3_feature_engineering/generate_user_embeddings.py\r\npython src/3_feature_engineering/create_feature_store.py\r\n\r\n* Step 4: Train models and generate recommendations\r\npython src/4_training/train_models.py\r\n\r\n* Step 5: Evaluate the models\r\npython src/5_evaluation/evaluate_offline_metrics.py\r\npython src/5_evaluation/analyze_recommendations.py\r\npython src/5_evaluation/replay_validation.py\r\n\r\n* Step 6: Monitor data drift\r\npython src/6_monitoring/check_data_drift.py\r\n\r\n* Step 7: Run A/B test simulation\r\npython src/7_ab_testing/simulate_ab_test.py\r\n\r\n**7. Example Results**\r\n\r\nOffline Model Comparison\r\nThe offline evaluation script compares different models based on ranking metrics. The BPR_v1 model shows a slight improvement in MAP@10 and nDCG@10.\r\n\r\nOffline Metrics Comparison\r\n\r\nData Drift Report\r\nThe monitoring script generates a detailed HTML report using Evidently AI, highlighting any statistical drift between the reference and current data batches.\r\n\r\n**8. Future Work**\r\nThis project provides a solid foundation that can be extended in several ways:\r\n\r\n\r\nReal-time Pipeline: Migrate from batch processing to a real-time stream processing architecture using Kafka and Spark Streaming.\r\n\r\nAdvanced Models: Implement and evaluate more sophisticated models like LightFM, NCF (Neural Collaborative Filtering), or transformer-based sequential models.\r\n\r\nContainerization: Package the entire application using Docker and orchestrate the pipeline with Kubernetes or Airflow.\r\n\r\nCI/CD: Implementation a CI/CD pipeline using GitHub Actions to automate testing and deployment.\r\n\r\nAPI for Serving: Develop a REST API (e.g., with FastAPI) to serve recommendations online using the feature store.\r\n\r\n**9. License**\r\nThis project is licensed under the MIT License. See the LICENSE file for details.\r\n\r\n\r\n\r\n**Contacts**\r\n\r\nAuthor: Uladzimir Manulenka\r\n\r\nEmail: vlma@tut.by\r\n\r\nThis project is for demonstration purposes only. All data is synthetic.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjim-by%2Fpersonalized-recommendation-system","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjim-by%2Fpersonalized-recommendation-system","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjim-by%2Fpersonalized-recommendation-system/lists"}