{"id":19176534,"url":"https://github.com/leftcoastnerdgirl/big_data","last_synced_at":"2026-06-17T08:31:34.823Z","repository":{"id":230472662,"uuid":"779409911","full_name":"LeftCoastNerdGirl/Big_Data","owner":"LeftCoastNerdGirl","description":"This project uses PySpark and SQL to analyze Big Data.","archived":false,"fork":false,"pushed_at":"2024-08-25T20:48:19.000Z","size":46,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-23T01:13:59.816Z","etag":null,"topics":["jupyter-notebook","pandas-python","pyspark","spark-sql","sparksession","sql","structured-query-language"],"latest_commit_sha":null,"homepage":"https://extension.berkeley.edu/search/publicCourseSearchDetails.do?method=load\u0026courseId=35106003","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LeftCoastNerdGirl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-03-29T19:19:04.000Z","updated_at":"2024-08-25T20:48:22.000Z","dependencies_parsed_at":"2024-03-30T00:27:27.932Z","dependency_job_id":"cdd0406c-d635-4bda-a948-c181af0d5c06","html_url":"https://github.com/LeftCoastNerdGirl/Big_Data","commit_stats":null,"previous_names":["leftcoastnerdgirl/home_sales","leftcoastnerdgirl/big_data"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/LeftCoastNerdGirl/Big_Data","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeftCoastNerdGirl%2FBig_Data","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeftCoastNerdGirl%2FBig_Data/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeftCoastNerdGirl%2FBig_Data/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeftCoastNerdGirl%2FBig_Data/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LeftCoastNerdGirl","download_url":"https://codeload.github.com/LeftCoastNerdGirl/Big_Data/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeftCoastNerdGirl%2FBig_Data/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34441282,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-17T02:00:05.408Z","response_time":127,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jupyter-notebook","pandas-python","pyspark","spark-sql","sparksession","sql","structured-query-language"],"created_at":"2024-11-09T10:28:57.021Z","updated_at":"2026-06-17T08:31:34.806Z","avatar_url":"https://github.com/LeftCoastNerdGirl.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Using PySpark to analyze large data sets\n\n# Data prep\n\n- Imported the tools needed for the analysis.  \n- Created a Spark session to enable the work.  \n- Read the AWS data file and formatted in a dataframe.  \n\n# Using a temp view\n\n- Created a temp view to improve processing time of the large data set.  \n- Created 4 SQL queries to answer the following questions:  \n    - What is the average price for a four-bedroom house sold for each year? Round off your answer to two decimal places.  \n    - What is the average price of a home for each year the home was built, that has three bedrooms and three bathrooms? Round off your answer to two decimal places.  \n    - What is the average price of a home for each year the home was built, that has three bedrooms, three bathrooms, two floors, and is greater than or equal to 2,000 square feet? Round off your answer to two decimal places.  \n    - What is the average price of a home per \"view\" rating having an average home price greater than or equal to $350,000? Determine the run time for this query, and round off your answer to two decimal places.  \n\n![image](https://github.com/user-attachments/assets/d3226573-d135-4d66-accf-235a2992a87e)\n\n![image](https://github.com/user-attachments/assets/142fec24-1ac2-4640-b5a5-baaf02667679)\n\n\n# Compare options to decrease run time\n\nRan the same query 3 different ways to view differences in run time.  \nAdded a run time calculation for the 4th query above.  \n        -Time: 1.2482097148895264 seconds      \nCached the temporary table and verified.  \n        -Time: 0.529306173324585 seconds      \n-Partitioned the data.  \n        -Time: 0.9826006889343262 seconds  \n\n![image](https://github.com/user-attachments/assets/54084d54-5db6-4f6b-933e-e9678986d266)\n\n# Conclusion\n\nWe can see that caching the temporary table significantly decreased the run time. The cached data set ran in less than half the time of the uncached query.  \nThe expectation was that partitioning the data would further increase the speed of the query but that wasn't true in this case. \nPossible reasons:  \n    -Small data set (33,000 rows)  \n    -Data was partitioned on 'date built' but the was grouping by views and filtering by price.  \n\nNote that each time I've run the 3 test queries, the run time has been slightly different. My analysis is based on the output at the time I saved the notebook and downloaded it from collab. If the queries are run again, the time notations above will be different.  \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleftcoastnerdgirl%2Fbig_data","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleftcoastnerdgirl%2Fbig_data","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleftcoastnerdgirl%2Fbig_data/lists"}