{"id":30766929,"url":"https://github.com/mo-elshamy/machine-learning-practice","last_synced_at":"2026-02-14T14:30:54.025Z","repository":{"id":310026704,"uuid":"1038419314","full_name":"Mo-Elshamy/machine-learning-practice","owner":"Mo-Elshamy","description":"This repository serves as a collection of my work and learning in machine learning while my internship in Cellual-Technologies, including algorithm explanations, data preprocessing workflows, and two projects.","archived":false,"fork":false,"pushed_at":"2025-08-15T14:11:11.000Z","size":25958,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-04T19:59:21.157Z","etag":null,"topics":["data-analysis","data-science","dbscan","decision-trees","eda","gradient-boosting","gxboost","hierarchical-clustering","kmeans-clustering","knn-classification","linear-regression","logistic-regression","machine-learning","model","pca","polynomial-regression","preprocessing","random-forest","support-vector-machines","training"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Mo-Elshamy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-08-15T06:55:12.000Z","updated_at":"2025-08-15T14:11:14.000Z","dependencies_parsed_at":"2025-08-15T09:17:27.873Z","dependency_job_id":"8dd8071e-4a9d-4974-901a-b432820cac41","html_url":"https://github.com/Mo-Elshamy/machine-learning-practice","commit_stats":null,"previous_names":["mo-elshamy/machine-learning-practice"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Mo-Elshamy/machine-learning-practice","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mo-Elshamy%2Fmachine-learning-practice","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mo-Elshamy%2Fmachine-learning-practice/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mo-Elshamy%2Fmachine-learning-practice/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mo-Elshamy%2Fmachine-learning-practice/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Mo-Elshamy","download_url":"https://codeload.github.com/Mo-Elshamy/machine-learning-practice/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mo-Elshamy%2Fmachine-learning-practice/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29447184,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T14:10:32.461Z","status":"ssl_error","status_checked_at":"2026-02-14T14:09:49.945Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-science","dbscan","decision-trees","eda","gradient-boosting","gxboost","hierarchical-clustering","kmeans-clustering","knn-classification","linear-regression","logistic-regression","machine-learning","model","pca","polynomial-regression","preprocessing","random-forest","support-vector-machines","training"],"created_at":"2025-09-04T19:59:16.691Z","updated_at":"2026-02-14T14:30:54.019Z","avatar_url":"https://github.com/Mo-Elshamy.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Machine Learning Practice\n\nThis repository serves as a collection of my work and learning in machine learning while my internship in Cellual-Technologies, \nincluding algorithm explanations, data preprocessing workflows, and two projects.\n\n---\n## Table of Contents\n- [🛠 General Workflow](#-general-workflow)\n  - [1️⃣ Exploratory Data Analysis (EDA)](#️-exploratory-data-analysis-eda)\n  - [2️⃣ Data Preprocessing](#️-data-preprocessing)\n- [📚 Machine Learning Algorithms](#-machine-learning-algorithms)\n  - [II. Supervised Learning](#ii-supervised-learning)\n  - [II. Unsupervised Learning](#ii-unsupervised-learning)\n  - [III. Dimensionality Reduction](#iii-dimensionality-reduction)\n- [📊 Model Evaluation Metrics](#-model-evaluation-metrics)\n  - [1. Regression Metrics](#1-regression-metrics)\n  - [2. Classification Metrics](#2-classification-metrics)\n  - [3. Clustering Metrics](#3-clustering-metrics)\n- [Projects](#projects)\n  - [Project 1: Hotel Booking Cancellation Prediction](#project-1-hotel-booking-cancellation-prediction)\n  - [Project 2: NYC Taxi Fare Prediction](#project-2-nyc-taxi-fare-prediction)\n- [📌 Conclusion](#-conclusion)\n\nNote: The supervised learning section is incorrectly numbered as \"II\" instead of \"I\". This should be fixed.\n\n## 🛠 General Workflow\n\nBefore training any machine learning model, we go through **Exploratory Data Analysis (EDA)** and **Data Preprocessing**.  \nThese steps ensure that the dataset is clean, consistent, and ready for modeling.\n\n### 1️⃣ Exploratory Data Analysis (EDA)\nEDA helps understand the dataset’s structure, patterns, and potential issues.\n- **Understanding the data** – Checking data types, dimensions, and sample values.\n- **Statistical summary** – Using `describe()` to find mean, median, min, max, etc.\n- **Missing values** – Identifying and deciding how to handle NaN values.\n- **Data distribution** – Plotting histograms, KDE plots, and boxplots.\n- **Outlier detection** – Using visualization and statistical methods like IQR.\n- **Correlation analysis** – Finding relationships between variables with heatmaps.\n\n### 2️⃣ Data Preprocessing\nOnce we understand the data, preprocessing ensures it’s ready for algorithms.\n- **Handling missing values** – Imputation with mean/median/mode or removal.\n- **Encoding categorical variables** – One-hot encoding or label encoding.\n- **Feature scaling** – Normalization or Standardization for numerical features.\n- **Feature engineering** – Creating new features from existing ones.\n- **Splitting data** – Training/testing (and validation) sets to evaluate models.\n\n---\n```mermaid\nflowchart TD\n    A[Data Collection] --\u003e B[Exploratory Data Analysis]\n    B --\u003e C[Data Preprocessing]\n    C --\u003e D[Model Selection]\n    D --\u003e E[Training]\n    E --\u003e F[Evaluation]\n    F --\u003e G[Deployment]\n```\n\n---\n\n\n# 📚 Machine Learning Algorithms\n\nBelow is a categorized explanation of various algorithms.\n\n---\n### **II. Supervised Learning**\nAlgorithms that find patterns with labeled data.\n\n#### 1. Linear Regression\n- **Type**: Regression\n- **Use case**: Predicting continuous values.\n- **Concept**: Fits a straight line that minimizes the difference between predicted and actual values (using least squares method).\n- **Key points**:\n  - Assumes a linear relationship between variables.\n  - Sensitive to outliers.\n  - Example: Predicting house prices.\n\n#### 2. Polynomial Regression\n- **Type**: Regression\n- **Use case**: Modeling non-linear relationships.\n- **Concept**: Extends linear regression by adding polynomial terms (x², x³, …).\n- **Key points**:\n  - Fits a curve instead of a straight line.\n  - Risk of overfitting with high polynomial degree.\n\n#### 3. Logistic Regression\n- **Type**: Classification\n- **Use case**: Predicting binary or multi-class categories.\n- **Concept**: Uses a sigmoid function to output probabilities for class membership.\n- **Key points**:\n  - Despite its name, it’s a classification algorithm.\n  - Example: Spam detection.\n\n#### 4. K-Nearest Neighbors (KNN)\n- **Type**: Classification/Regression\n- **Use case**: Classifying data points based on their closest neighbors.\n- **Concept**: Looks at the “k” nearest data points and assigns the majority class (classification) or average (regression).\n- **Key points**:\n  - Simple, non-parametric method.\n  - Computationally expensive for large datasets.\n\n#### 5. Support Vector Machine (SVM)\n- **Type**: Classification/Regression\n- **Use case**: Separating data into distinct classes with the widest possible margin.\n- **Concept**: Finds an optimal hyperplane that maximizes the margin between classes.\n- **Key points**:\n  - Works well with high-dimensional data.\n  - Can use kernels for non-linear separation.\n\n#### 6. Naïve Bayes\n- **Type**: Classification\n- **Use case**: Text classification, spam filtering.\n- **Concept**: Based on Bayes’ theorem with the assumption of feature independence.\n- **Key points**:\n  - Fast and efficient.\n  - Works well with high-dimensional data (e.g., text).\n\n#### 7. Decision Tree\n- **Type**: Classification/Regression\n- **Use case**: Predicting classes or values by splitting data into branches.\n- **Concept**: Divides data based on feature values until reaching a decision.\n- **Key points**:\n  - Easy to interpret.\n  - Can overfit without pruning.\n\n#### 8. Random Forest\n- **Type**: Classification/Regression\n- **Use case**: More robust version of Decision Tree.\n- **Concept**: Combines multiple decision trees (ensemble) and averages results.\n- **Key points**:\n  - Reduces overfitting.\n  - Works well for a wide range of problems.\n\n---\n\n### **II. Unsupervised Learning**\nAlgorithms that find patterns without labeled data.\n\n#### 9. K-Means Clustering\n- **Type**: Clustering\n- **Use case**: Grouping similar data points.\n- **Concept**: Partitions data into “k” clusters by minimizing distances within clusters.\n- **Key points**:\n  - Requires specifying “k” in advance.\n  - Sensitive to initial centroids.\n\n#### 10. Hierarchical Clustering\n- **Type**: Clustering\n- **Use case**: Creating a hierarchy of clusters.\n- **Concept**: Builds nested clusters using a tree-like diagram (dendrogram).\n- **Key points**:\n  - No need to predefine number of clusters.\n  - Computationally intensive for large datasets.\n\n#### 11. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)\n- **Type**: Clustering\n- **Use case**: Identifying clusters of arbitrary shape.\n- **Concept**: Groups together points that are closely packed and marks outliers.\n- **Key points**:\n  - No need to specify number of clusters.\n  - Handles noise well.\n\n---\n\n### **III. Dimensionality Reduction**\n\n#### 12. Principal Component Analysis (PCA)\n- **Type**: Feature reduction\n- **Use case**: Reducing high-dimensional data while retaining variance.\n- **Concept**: Transforms features into new uncorrelated variables (principal components).\n- **Key points**:\n  - Speeds up computation.\n  - Useful for visualization of complex datasets.\n\n---\n## 📊 Model Evaluation Metrics\n\nEvaluating machine learning models is essential to understand how well they generalize to unseen data.  \nBelow are common evaluation metrics grouped by problem type.\n\n---\n\n### **1. Regression Metrics**\n\n#### **Mean Absolute Error (MAE)**\nMeasures the average magnitude of errors without considering their direction.\n\n`MAE = (1/n) * Σ |yᵢ - ŷᵢ|`\n\n- **Pros**: Easy to interpret, less sensitive to outliers than MSE.  \n- **Cons**: Does not penalize large errors as strongly.\n\n---\n\n#### **Mean Squared Error (MSE)**\nMeasures the average of squared differences between actual and predicted values.\n\n`MSE = (1/n) * Σ (yᵢ - ŷᵢ)²`\n\n- **Pros**: Penalizes large errors more than MAE.  \n- **Cons**: Sensitive to outliers.\n\n---\n\n#### **Root Mean Squared Error (RMSE)**\nSquare root of MSE, bringing it back to the same units as the target variable.\n\n`RMSE = √( (1/n) * Σ (yᵢ - ŷᵢ)² )`\n\n- **Pros**: More interpretable than MSE.  \n- **Cons**: Same sensitivity to outliers as MSE.\n\n---\n\n#### **R² Score (Coefficient of Determination)**\nMeasures the proportion of variance in the dependent variable explained by the model.\n\n`R² = 1 - [ Σ (yᵢ - ŷᵢ)² / Σ (yᵢ - ȳ)² ]`\n\n- **Range**: 0 to 1 (Higher is better, negative means worse than a horizontal line).  \n- **Pros**: Gives a percentage interpretation.  \n- **Cons**: Can be misleading for non-linear models.\n\n---\n\n### **2. Classification Metrics**\n\n#### **Accuracy**\nThe proportion of correct predictions out of total predictions.\n\n`Accuracy = (TP + TN) / (TP + TN + FP + FN)`\n\n---\n\n#### **Precision**\nMeasures the percentage of positive predictions that are actually correct.\n\n`Precision = TP / (TP + FP)`\n\n---\n\n#### **Recall (Sensitivity or True Positive Rate)**\nMeasures the percentage of actual positives correctly identified.\n\n`Recall = TP / (TP + FN)`\n\n---\n\n#### **F1 Score**\nHarmonic mean of Precision and Recall.\n\n`F1 = 2 * (Precision * Recall) / (Precision + Recall)`\n\n---\n\n#### **ROC-AUC (Area Under the Receiver Operating Characteristic Curve)**\n- **ROC Curve**: Plots True Positive Rate vs. False Positive Rate at different thresholds.  \n- **AUC**: Measures the overall ability of the model to discriminate between classes.  \n- **Range**: 0 to 1 (Higher is better).\n\n---\n\n### **3. Clustering Metrics**\n\n#### **Silhouette Score**\nMeasures how similar an object is to its own cluster compared to other clusters.\n\n`s = (b - a) / max(a, b)`  \n\nWhere:  \n- `a` = mean intra-cluster distance  \n- `b` = mean nearest-cluster distance  \n\n---\n\n#### **Davies–Bouldin Index**\nMeasures the average similarity between each cluster and its most similar cluster.\n\n`DB = (1/n) * Σ maxⱼ≠ᵢ [ (σᵢ + σⱼ) / d(cᵢ, cⱼ) ]`  \n\nWhere:  \n- `σ` = average distance between points in a cluster and the cluster centroid  \n- `d(cᵢ, cⱼ)` = distance between cluster centroids  \n\nLower DB index means better clustering.\n\n---\n\n✅ **Tip**: Always choose the metric based on the problem type and business goal. For example:  \n- Regression → MAE, RMSE, R²  \n- Classification → F1, Precision-Recall, ROC-AUC  \n- Clustering → Silhouette, Davies–Bouldin\n\n\n## Projects\n\n### [Project 1: Hotel Booking Cancellation Prediction](/Project_1/README.ipynb)\n\n  \u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/Project_1.gif\"\u003e\n\n### [Project 2: NYC Taxi Fare Prediction](/Project_2/README.ipynb)\n\n  \u003cp align=\"center\"\u003e\n  \u003cimg src=\"images/Project_2.gif\"\u003e\n\n## 📌 Conclusion\n\nThis repository combines theory and practice, providing algorithm explanations and real project implementations.\nIt can be used as a reference for machine learning studies and practical applications.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmo-elshamy%2Fmachine-learning-practice","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmo-elshamy%2Fmachine-learning-practice","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmo-elshamy%2Fmachine-learning-practice/lists"}