{"id":19056626,"url":"https://github.com/aishwaryahastak/ipl_analysis","last_synced_at":"2025-10-16T09:27:21.214Z","repository":{"id":250901364,"uuid":"833404522","full_name":"AishwaryaHastak/IPL_Analysis","owner":"AishwaryaHastak","description":"Analysis of IPL dataset using PySpark","archived":false,"fork":false,"pushed_at":"2024-08-21T02:56:57.000Z","size":2910,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-21T10:51:04.068Z","etag":null,"topics":["data-analysis","mllib","pyspark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AishwaryaHastak.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-25T01:42:56.000Z","updated_at":"2024-08-21T02:57:00.000Z","dependencies_parsed_at":"2024-07-30T18:56:34.811Z","dependency_job_id":"b369df6b-23f7-4e17-adf3-57cfa90ab2ce","html_url":"https://github.com/AishwaryaHastak/IPL_Analysis","commit_stats":null,"previous_names":["aishwaryahastak/ipl_analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AishwaryaHastak/IPL_Analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AishwaryaHastak%2FIPL_Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AishwaryaHastak%2FIPL_Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AishwaryaHastak%2FIPL_Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AishwaryaHastak%2FIPL_Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AishwaryaHastak","download_url":"https://codeload.github.com/AishwaryaHastak/IPL_Analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AishwaryaHastak%2FIPL_Analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279172815,"owners_count":26118982,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-16T02:00:06.019Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","mllib","pyspark"],"created_at":"2024-11-08T23:50:41.800Z","updated_at":"2025-10-16T09:27:21.198Z","avatar_url":"https://github.com/AishwaryaHastak.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Analysis of IPL Data using PySpark\n\nAnalyzing **IPL (Indian Premier League)** data and building a predictive model using **PySpark** and **Python**, cleaning and preprocessing the data, and performing feature engineering. This project aims to uncover performance patterns and provide valuable insights for team management and player selection.\n\n## 📊🔍📝👉 For a detailed walkthrough of the PySpark EDA process and results, check out this [article](https://aishwaryahastak.medium.com/ipl-analysis-using-pyspark-478a53ce9c98).\n\n# Introduction\n\nThe **IPL (Indian Premier League)** cricket data analysis project aims to uncover performance patterns and insights at both the player and team levels. Through refining **data types**, resolving **inconsistencies**, and performing **feature engineering**, this project seeks to deepen the understanding of factors influencing match outcomes and player performances.\n\nUtilizing **PySpark** in **Databricks**, the dataset has been transformed and enriched by creating new fields such as **partnership runs**, enhancing the analysis. Created visualizations using **Python** libraries like **matplotlib** and **seaborn**.\n\nThe insights gained from this analysis are expected to be valuable for developing strategies in **team management**, **player selection**, and **game planning**, contributing to a data-driven approach in the IPL. This could also benefit cricket enthusiasts in building their ideal teams.\n\n# 💻 Technical Tools Used:\n\n- **ETL Processes**: Extract, Transform, Load methodologies\n- **Transformation Functions**: `select`, `filter`, `groupBy`, `withColumn`, `selectExpr`\n- **Aggregation Functions**: `count`, `sum`, `avg`, and more\n- **Pivot Tables**\n- **Window Functions**: Functions such as `rank()`, `dense_rank()`, and `lag()`\n- **Visualization Libraries**: `matplotlib`, `seaborn` \n\n# 🎯 Key Objectives:\n\n- Analyze **IPL cricket data** to uncover performance patterns and insights at both player and team levels.\n- Refine and enrich the dataset through advanced **feature engineering** and **transformation functions** to improve prediction accuracy. \n- Provide actionable insights for **team management**, **player selection**, and strategic **game planning**.\n\n# 🔍 Key Insights:\n\n- **Top IPL Players with Most 'Player of the Match' Awards**: AB de Villiers leads with the highest number of 'Player of the Match' awards, showcasing his exceptional performance.\n  \n- **Performance Analysis of Top Batsmen Across IPL Overs**: CH Gayle displays solid performance throughout but particularly shines in the middle overs, indicating a steady scoring rate.\n\n- **Bowlers vs. Batsmen: Uncovering the Toughest Matchups in IPL**: Sunil Narine has dismissed Rohit Sharma the most times (9 wickets).\n\n- **IPL Teams' Performance Based on Toss Decisions and First Innings Batting**: Mumbai Indians, Kolkata Knight Riders, and Rajasthan Royals have high win percentages when batting first, indicating strong performance when leading.\n\n- **Top IPL Bowlers: Total Wickets and Average Wickets Per Match**: Lasith Malinga and Jasprit Bumrah are standout bowlers with the highest average wickets per match, each taking approximately 2 wickets per game.\n\n- **Patterns in Dismissal Types and Match Situations**:\n\n    - **\"Caught\"** is the most common dismissal type, especially in league matches.\n    \n    - **\"Run out\"** dismissals occur later in high-stakes matches like finals and eliminators, reflecting increased risk-taking in crucial moments.\n    \n    - Early wickets in qualifiers and finals are often due to **\"bowled\"** and **\"LBW\"** dismissals, highlighting effective bowling strategies in key games.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faishwaryahastak%2Fipl_analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faishwaryahastak%2Fipl_analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faishwaryahastak%2Fipl_analysis/lists"}