{"id":15203198,"url":"https://github.com/amiraflak/olympics-data-analysis","last_synced_at":"2026-01-05T12:19:08.116Z","repository":{"id":256789823,"uuid":"837930423","full_name":"AmirAflak/olympics-data-analysis","owner":"AmirAflak","description":"Spark-Driven Olympic Data Exploration","archived":false,"fork":false,"pushed_at":"2024-09-12T15:28:43.000Z","size":19430,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-10-11T21:40:53.560Z","etag":null,"topics":["apache-spark","etl-pipeline","olympics","postgres","scala","sql"],"latest_commit_sha":null,"homepage":"","language":"Scala","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AmirAflak.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-04T13:23:07.000Z","updated_at":"2024-09-12T15:28:45.000Z","dependencies_parsed_at":"2024-09-13T04:10:48.492Z","dependency_job_id":null,"html_url":"https://github.com/AmirAflak/olympics-data-analysis","commit_stats":null,"previous_names":["amiraflak/olympics-data-analysis"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmirAflak%2Folympics-data-analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmirAflak%2Folympics-data-analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmirAflak%2Folympics-data-analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AmirAflak%2Folympics-data-analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AmirAflak","download_url":"https://codeload.github.com/AmirAflak/olympics-data-analysis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":219858508,"owners_count":16556043,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-spark","etl-pipeline","olympics","postgres","scala","sql"],"created_at":"2024-09-28T04:42:01.415Z","updated_at":"2025-10-29T03:31:00.976Z","avatar_url":"https://github.com/AmirAflak.png","language":"Scala","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Olympics Data Analysis\n\n## Project Overview\n\nThis project implements an ETL pipeline using **Apache Spark** to process Olympic athlete data. After loading the data into a PostgreSQL database, a series of SQL analysis queries were performed to generate insights into athlete performances, country-wise statistics, and other key metrics.\n\n\n## Analysis Section\n\n### Total Medal Count by Country (NOC)\n```sql\nWITH medals AS (\n    SELECT *\n    FROM db.public.results\n    WHERE medal IS NOT NULL\n      AND event NOT LIKE '%(YOG)'\n),\nmedals_filtered AS (\n    SELECT DISTINCT year, type, discipline, noc, event, medal\n    FROM medals\n)\nSELECT noc, COUNT(medal) AS total_medals\nFROM medals_filtered\nGROUP BY noc\nORDER BY total_medals DESC\nLIMIT 5;\n```\n\n\n\n\n    ┌─────────┬──────────────┐\n    │   noc   │ total_medals │\n    │ varchar │    int64     │\n    ├─────────┼──────────────┤\n    │ USA     │         2997 │\n    │ URS     │         1197 │\n    │ GER     │         1078 │\n    │ GBR     │          989 │\n    │ FRA     │          939 │\n    └─────────┴──────────────┘\n\n\n\n### Count of Athletes by Country\n\n\n```sql\nSELECT born_country, COUNT(*) AS athlete_count\nFROM db.public.bios\nGROUP BY born_country\nORDER BY athlete_count DESC\nLIMIT 5;\n```\n\n\n\n\n    ┌──────────────┬───────────────┐\n    │ born_country │ athlete_count │\n    │   varchar    │     int64     │\n    ├──────────────┼───────────────┤\n    │              │         32864 │\n    │ USA          │          9641 │\n    │ GER          │          6891 │\n    │ GBR          │          5792 │\n    │ FRA          │          5143 │\n    └──────────────┴───────────────┘\n\n\n\n\n### Top Performing Athletes by Medal Count\n\n\n```sql\nSELECT b.name, COUNT(r.medal) AS medal_count\nFROM db.public.results r\nJOIN db.public.bios b ON r.athlete_id = b.athlete_id\nWHERE r.medal IS NOT NULL\nGROUP BY r.athlete_id, b.name\nORDER BY medal_count DESC\nLIMIT 10;\n```\n\n\n\n\n    ┌──────────────────────┬─────────────┐\n    │         name         │ medal_count │\n    │       varchar        │    int64    │\n    ├──────────────────────┼─────────────┤\n    │ Michael Phelps       │          28 │\n    │ Larisa Latynina      │          18 │\n    │ Emma McKeon          │          17 │\n    │ Marit Bjørgen        │          15 │\n    │ Nikolay Andrianov    │          15 │\n    │ Takashi Ono          │          13 │\n    │ Ireen Wüst           │          13 │\n    │ Ole Einar Bjørndalen │          13 │\n    │ Edoardo Mangiarotti  │          13 │\n    │ Boris Shakhlin       │          13 │\n    ├──────────────────────┴─────────────┤\n    │ 10 rows                  2 columns │\n    └────────────────────────────────────┘\n\n\n\n### Average Height and Weight of Athletes by Country:\n\n\n```sql\nWITH unique_participation AS (\n    SELECT DISTINCT year, type, discipline, noc, event, athlete_id\n    FROM db.public.results\n    WHERE event NOT LIKE '%(YOG)'\n)\nSELECT born_country, \n       AVG(height_cm) AS avg_height, \n       AVG(weight_kg) AS avg_weight\nFROM db.public.bios b\nJOIN unique_participation r ON b.athlete_id = r.athlete_id\nWHERE height_cm IS NOT NULL \n  AND weight_kg IS NOT NULL\nGROUP BY born_country\nORDER BY avg_height DESC\nLIMIT 5;\n```\n\n\n\n\n    ┌──────────────┬────────────────────┬────────────┐\n    │ born_country │     avg_height     │ avg_weight │\n    │   varchar    │       double       │   double   │\n    ├──────────────┼────────────────────┼────────────┤\n    │ Milde        │ 195.66666666666666 │      107.0 │\n    │ Prignitz     │              194.0 │       90.0 │\n    │ GIB          │              191.0 │       89.0 │\n    │ ANT          │              188.0 │       80.0 │\n    │ AGU          │              186.0 │       78.0 │\n    └──────────────┴────────────────────┴────────────┘\n\n### Medal Distribution by Discipline\n\n```sql\nWITH medals AS (\n    SELECT *\n    FROM db.public.results\n    WHERE medal IS NOT NULL\n      AND event NOT LIKE '%(YOG)'\n),\nmedals_filtered AS (\n    SELECT DISTINCT year, type, discipline, noc, event, medal\n    FROM medals\n)\nSELECT discipline, COUNT(medal) AS total_medals\nFROM medals_filtered\nGROUP BY discipline\nORDER BY total_medals DESC;\n```\n\n\n\n\n    ┌──────────────────────────────────┬──────────────┐\n    │            discipline            │ total_medals │\n    │             varchar              │    int64     │\n    ├──────────────────────────────────┼──────────────┤\n    │ Athletics                        │         3177 │\n    │ Swimming (Aquatics)              │         1785 │\n    │ Wrestling                        │         1359 │\n    │ Artistic Gymnastics (Gymnastics) │         1009 │\n    │ Boxing                           │          997 │\n    │ Shooting                         │          895 │\n    │ Rowing                           │          825 │\n    │ Fencing                          │          709 │\n    │ Weightlifting                    │          666 │\n    │ Speed Skating (Skating)          │          607 │\n    │            ·                     │            · │\n    │            ·                     │            · │\n    │            ·                     │            · │\n    │ Cycling BMX Freestyle (Cycling)  │            6 │\n    │ Racquets                         │            5 │\n    │ Lacrosse                         │            5 │\n    │ Roque                            │            3 │\n    │ Equestrian Driving (Equestrian)  │            3 │\n    │ Motorboating                     │            3 │\n    │ Military Ski Patrol (Skiing)     │            3 │\n    │ Art Competitions                 │            2 │\n    │ Cricket                          │            2 │\n    │ Basque pelota                    │            1 │\n    ├──────────────────────────────────┴──────────────┤\n    │ 80 rows (20 shown)                    2 columns │\n    └─────────────────────────────────────────────────┘\n\n\n### Countries with the Most Olympic Participation\n\n\n```sql\nWITH unique_participation AS (\n    SELECT DISTINCT year, type, discipline, noc, event, athlete_id\n    FROM db.public.results\n    WHERE event NOT LIKE '%(YOG)'\n)\nSELECT b.noc, COUNT(DISTINCT b.athlete_id) AS athlete_count\nFROM db.public.bios b\nJOIN unique_participation r ON b.athlete_id = r.athlete_id\nGROUP BY b.noc\nORDER BY athlete_count DESC;\n```\n\n\n\n\n    ┌────────────────────────────┬───────────────┐\n    │            NOC             │ athlete_count │\n    │          varchar           │     int64     │\n    ├────────────────────────────┼───────────────┤\n    │ United States              │          9921 │\n    │ Great Britain              │          6395 │\n    │ France                     │          6291 │\n    │ Canada                     │          5203 │\n    │ Italy                      │          5131 │\n    │ Germany                    │          4640 │\n    │ Japan                      │          4548 │\n    │ Australia                  │          4101 │\n    │ Sweden                     │          3875 │\n    │ People's Republic of China │          3081 │\n    │        ·                   │             · │\n    │        ·                   │             · │\n    │        ·                   │             · │\n    │ Argentina Spain            │             1 │\n    │ France Hungary             │             1 │\n    │ Great Britain South Africa │             1 │\n    │ Turkmenistan Türkiye       │             1 │\n    │ Cuba Mexico                │             1 │\n    │ Andorra Spain              │             1 │\n    │ Mozambique South Africa    │             1 │\n    │ Canada Russian Federation  │             1 │\n    │ Australia Croatia          │             1 │\n    │ Australia Belgium          │             1 │\n    ├────────────────────────────┴───────────────┤\n    │ 696 rows (20 shown)              2 columns │\n    └────────────────────────────────────────────┘\n\n\n\n### Athlete Participation Over the Years\n\n\n\n```sql\nWITH unique_participation AS (\n    SELECT DISTINCT year, type, discipline, noc, event, athlete_id\n    FROM db.public.results\n    WHERE event NOT LIKE '%(YOG)'\n)\nSELECT year, COUNT(DISTINCT athlete_id) AS athlete_count\nFROM unique_participation\nGROUP BY year\nORDER BY year ASC;\n```\n\n\n\n\n    ┌───────┬───────────────┐\n    │ year  │ athlete_count │\n    │ int32 │     int64     │\n    ├───────┼───────────────┤\n    │  1896 │           182 │\n    │  1900 │          1234 │\n    │  1904 │           667 │\n    │  1908 │          2112 │\n    │  1912 │          2447 │\n    │  1920 │          2689 │\n    │  1924 │          3494 │\n    │  1928 │          3427 │\n    │  1932 │          1627 │\n    │  1936 │          4661 │\n    │    ·  │            ·  │\n    │    ·  │            ·  │\n    │    ·  │            ·  │\n    │  2006 │          2505 │\n    │  2008 │         10956 │\n    │  2010 │          2537 │\n    │  2012 │         10572 │\n    │  2014 │          2765 │\n    │  2016 │         11210 │\n    │  2018 │          2799 │\n    │  2020 │         11125 │\n    │  2022 │          2470 │\n    │  NULL │          1026 │\n    ├───────┴───────────────┤\n    │  38 rows (20 shown)   │\n    └───────────────────────┘\n\n\n\n\n\n## Installation Guide\n\nTo run this project, follow the steps below:\n\n1. **Clone the repository**:\n   ```bash\n   git clone https://github.com/amiraflak/olympics-data-analysis\n   cd olympics-data-analysis\n   ```\n\n2. **Run Docker Compose**: This step sets up the PostgreSQL database where the data will be loaded.\n   ```bash\n   docker-compose up -d\n   ```\n\n3. **Load Data with Spark Scala**:\n   - Navigate to the `src/main/scala` directory.\n   - Run the `Main.scala` file to execute the ETL pipeline:\n     ```bash\n     scala Main.scala\n     ```\n\n4. **Run SQL Queries**:\n   - Once the data is loaded into PostgreSQL, you can connect to the database and run the analysis queries provided in the analysis section.\n\n\n\n## Contribution\n\nContributions are welcome! Feel free to open an issue or submit a pull request to improve the project.\n\n## References\n\n- Dataset: [Olympic Athlete Data](https://github.com/KeithGalli/Olympics-Dataset)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famiraflak%2Folympics-data-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Famiraflak%2Folympics-data-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Famiraflak%2Folympics-data-analysis/lists"}