{"id":26522409,"url":"https://github.com/edisedis777/pyspark-ml-features","last_synced_at":"2026-04-13T20:31:44.894Z","repository":{"id":283497495,"uuid":"951966142","full_name":"edisedis777/PySpark-ML-Features","owner":"edisedis777","description":"A PySpark implementation of 6 lesser-known Scikit-Learn features optimized for Azure Databricks. This project translates powerful machine learning techniques from Scikit-Learn into PySpark's distributed computing framework.","archived":false,"fork":false,"pushed_at":"2025-03-20T14:21:31.000Z","size":9,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-20T15:31:26.894Z","etag":null,"topics":["azure","databricks","databricks-notebooks","large-scale","machine-learning","pyspark","python","scikit-learn","scikitlearn-machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/edisedis777.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-20T14:20:00.000Z","updated_at":"2025-03-20T14:23:21.000Z","dependencies_parsed_at":"2025-03-20T15:31:46.936Z","dependency_job_id":"794f166d-5f4e-4a3b-8bd5-35c25bc0fb80","html_url":"https://github.com/edisedis777/PySpark-ML-Features","commit_stats":null,"previous_names":["edisedis777/pyspark-ml-features"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FPySpark-ML-Features","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FPySpark-ML-Features/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FPySpark-ML-Features/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/edisedis777%2FPySpark-ML-Features/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/edisedis777","download_url":"https://codeload.github.com/edisedis777/PySpark-ML-Features/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244805166,"owners_count":20513228,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure","databricks","databricks-notebooks","large-scale","machine-learning","pyspark","python","scikit-learn","scikitlearn-machine-learning"],"created_at":"2025-03-21T13:26:54.111Z","updated_at":"2026-04-13T20:31:44.855Z","avatar_url":"https://github.com/edisedis777.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PySpark ML Features\n[![Visual Studio Code](https://custom-icon-badges.demolab.com/badge/Visual%20Studio%20Code-0078d7.svg?logo=vsc\u0026logoColor=white)](#)\n[![Python](https://img.shields.io/badge/Python-3.8+-blue.svg)](https://www.python.org/)\n[![Markdown](https://img.shields.io/badge/Markdown-%23000000.svg?logo=markdown\u0026logoColor=white)](#)\n\nA PySpark implementation of 6 lesser-known Scikit-Learn features optimized for Azure Databricks. This project translates powerful machine learning techniques from Scikit-Learn into PySpark's distributed computing framework, allowing you to apply these techniques to large-scale datasets in a cloud environment.\n\n\u003cimg width=\"451\" alt=\"Screenshot 2025-03-21 at 08 44 33\" src=\"https://github.com/user-attachments/assets/6442989a-30fe-4aae-857a-11aa418f5ef9\" /\u003e\n\n\n## 🚀 Features\n\nThis library implements PySpark equivalents of six powerful Scikit-Learn features:\n\n1. **Validation Curves**: Visualize model performance across hyperparameter values\n2. **Probability Prediction**: Alternative to calibration for classifier probability outputs\n3. **Robust Scaling**: Scale features using median and IQR to handle outliers\n4. **Feature Union**: Combine multiple feature transformations into one feature set\n5. **Feature Dimensionality Reduction**: Alternative to feature agglomeration using clustering\n6. **Predefined Split**: Use custom train-test splits for validation\n\n## 📋 Requirements\n\n- PySpark 3.0+\n- Azure Databricks Runtime 7.0+ (DBR with ML)\n- Python 3.6+\n- Matplotlib (for visualization)\n\n## 🔧 Installation\n\n1. Upload the `spark_ml_features.py` file to your Databricks workspace or DBFS\n2. Import it in your notebook:\n\n```python\n# Import the entire module\nfrom spark_ml_features import *\n\n# Or import specific functions\nfrom spark_ml_features import validation_curves, robust_scaling\n```\n\n## 📖 Usage\n\n### Demo All Features\n\nRun the demo function to see all features in action:\n\n```python\ndemo_all_features(spark)\n```\n\n### Individual Features\n\n#### 1. Validation Curves\n\n```python\ndf = load_sample_data(spark)\ndf_features = prepare_features(df)\n\n# Generate validation curves\nparam_range, metrics = validation_curves(\n    df_features, \n    param_name=\"regParam\", \n    param_range=np.logspace(-6, -1, 5),\n    num_folds=3\n)\n\n# Plot the results\nplot = plot_validation_curves(param_range, metrics)\ndisplay(plot)\n```\n![validation_curve](https://github.com/user-attachments/assets/dd5690fa-8ad4-470a-bbff-9ffea3157c96)\n\n\n#### 2. Probability Prediction\n\n```python\n# Get probability predictions\npredictions = probability_prediction(df_features)\ndisplay(predictions)\n```\n\u003cimg width=\"285\" alt=\"Screenshot 2025-03-21 at 08 45 26\" src=\"https://github.com/user-attachments/assets/22cf05eb-7465-4b7b-8d48-62d011974d01\" /\u003e\n\n#### 3. Robust Scaling\n\n```python\n# Scale features using median and IQR\ndf_scaled = robust_scaling(df, columns=[\"sepal_length\", \"sepal_width\"])\ndisplay(df_scaled)\n```\n\u003cimg width=\"476\" alt=\"Screenshot 2025-03-21 at 08 45 53\" src=\"https://github.com/user-attachments/assets/e7e250fc-ab3f-4ddf-9956-37a8eb82414d\" /\u003e\n\n#### 4. Feature Union\n\n```python\n# Combine multiple feature transformations\ndf_combined = feature_union(df_features)\ndisplay(df_combined)\n```\n\u003cimg width=\"443\" alt=\"Screenshot 2025-03-21 at 08 46 17\" src=\"https://github.com/user-attachments/assets/3fbf630b-7cff-4b51-ad29-9203695d88d9\" /\u003e\n\n#### 5. Feature Dimensionality Reduction\n\n```python\n# Reduce dimensions using KMeans\ndf_clustered = feature_dimensionality_reduction(df_features, method=\"kmeans\", k=2)\ndisplay(df_clustered)\n\n# Or use PCA\ndf_pca = feature_dimensionality_reduction(df_features, method=\"pca\", k=2)\ndisplay(df_pca)\n```\n\u003cimg width=\"313\" alt=\"Screenshot 2025-03-21 at 08 46 54\" src=\"https://github.com/user-attachments/assets/e5eb67e2-65af-4ed2-806a-5546aa152aaa\" /\u003e\n\n#### 6. Predefined Split\n\n```python\n# Add a split column\ndf_with_split = add_split_column(df_features, split_condition=\"custom\")\n\n# Use predefined split for validation\nmodel, train_df, test_df = predefined_split(df_with_split, split_col=\"is_train\")\n```\n\u003cimg width=\"286\" alt=\"Screenshot 2025-03-21 at 08 47 37\" src=\"https://github.com/user-attachments/assets/b39e241d-3a05-4087-8cd8-2240b0686f5b\" /\u003e\n\n## 🧰 Helper Functions\n\nThe library includes several helper functions to simplify common tasks:\n\n- **`load_sample_data(spark)`**: Load the Iris dataset from Databricks sample data\n- **`prepare_features(df, feature_cols, label_col)`**: Prepare feature vectors for modeling\n- **`plot_validation_curves(param_range, metrics)`**: Visualize validation curves\n- **`add_split_column(df_features, split_condition)`**: Add a column for predefined splits\n\n## 🔍 Detailed Examples\n\n### Validation Curves Example\n\n```python\n# Import necessary functions\nfrom spark_ml_features import load_sample_data, prepare_features, validation_curves, plot_validation_curves\n\n# Load data\ndf = load_sample_data(spark)\ndf_features = prepare_features(df)\n\n# Define hyperparameter range\nimport numpy as np\nparam_range = np.logspace(-6, -1, 5)\n\n# Generate validation curves\nparam_range, metrics = validation_curves(\n    df_features, \n    param_name=\"regParam\", \n    param_range=param_range,\n    label_col=\"species\", \n    num_folds=3\n)\n\n# Plot the results\nplot = plot_validation_curves(param_range, metrics, param_name=\"Regularization Parameter\")\ndisplay(plot)\n```\n\n### Robust Scaling Example\n\n```python\n# Import necessary functions\nfrom spark_ml_features import load_sample_data, robust_scaling\n\n# Load data\ndf = load_sample_data(spark)\n\n# Apply robust scaling to specific columns\ndf_scaled = robust_scaling(df, columns=[\"sepal_length\", \"sepal_width\"], quantile_error=0.01)\n\n# Display the results\ndisplay(df_scaled.select(\"sepal_length\", \"sepal_length_scaled\", \"sepal_width\", \"sepal_width_scaled\"))\n```\n\n## 🚢 Real-world Usage\n\nFor real-world data in Azure Databricks:\n\n```python\n# Load data from Azure Blob Storage\ndf = spark.read.format(\"csv\") \\\n    .option(\"header\", \"true\") \\\n    .option(\"inferSchema\", \"true\") \\\n    .load(\"wasbs://\u003ccontainer\u003e@\u003caccount\u003e.blob.core.windows.net/\u003cpath\u003e\")\n\n# Prepare features\ndf_features = prepare_features(df, feature_cols=[\"col1\", \"col2\", \"col3\"], label_col=\"target\")\n\n# Apply the techniques\ndf_scaled = robust_scaling(df, columns=[\"col1\", \"col2\"])\ndf_clustered = feature_dimensionality_reduction(df_features, method=\"kmeans\", k=3)\n```\n\n## 📝 Notes\n\n- This implementation is designed specifically for Azure Databricks environments\n- Some features (like validation curves) might require adaptation for extremely large datasets\n- The implementation prioritizes scalability and distributed computing over exact equivalence to Scikit-Learn\n\n## 🤝 Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## 📜 Credits\n\nThis project is inspired by Jason Brownlee's article: [6 Lesser-Known Scikit-Learn Features That Will Save You Time](https://machinelearningmastery.com/6-lesser-known-scikit-learn-features-that-will-save-you-time/)\n\n## 📄 License\n\nDistributed under the GNU Affero General Public License v3.0 License. See `LICENSE` for more information.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedisedis777%2Fpyspark-ml-features","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fedisedis777%2Fpyspark-ml-features","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fedisedis777%2Fpyspark-ml-features/lists"}