Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kizman-23/queries
Providing Insights, understanding and processing Big Data using SQL and PySpark
https://github.com/kizman-23/queries
databricks pyspark structured-query-language
Last synced: about 1 month ago
JSON representation
Providing Insights, understanding and processing Big Data using SQL and PySpark
- Host: GitHub
- URL: https://github.com/kizman-23/queries
- Owner: KizMan-23
- Created: 2024-11-08T00:15:38.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-12-13T18:23:57.000Z (about 2 months ago)
- Last Synced: 2024-12-13T19:27:52.100Z (about 2 months ago)
- Topics: databricks, pyspark, structured-query-language
- Language: Jupyter Notebook
- Homepage:
- Size: 19.8 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Queries is a repository for sql and pyspark projects as frameworks used in querrying and processing big data given different conditons. sql is a widely used domain specific language used to process data stored in
relational databases.
PySpark is the Python API for Apache Spark, a powerful distributed computing framework designed for processing large-scale data. pyspark is widely used for data distributed across multiple storage and is prominent for its adaptation with sql, machine learning, pandas in natures of: Spark SQL, Mlib, structured streaming, Pandas API on Spark.[Employyees SQL](employees_query.sql) is a project which uses sql in a company_employee data setting to analyze problems, bringing solutions to questions that are important to understand the scope of the data. SQL is primarily used in providing answers to questions that surround business settings.
![employ sql 1](https://github.com/user-attachments/assets/7223501b-8821-4beb-a5aa-754800c0f1b5)
[Music Store Analysis](music_store_analysis.sql) just like music platforms like spotify, this projects showcases the use of sql to understand artists, albums, tracks and other related problems. The project follows a question and answerformat and also provides insight into understanding the complexites of performing complex sql querries for business solutions
![music store sql](https://github.com/user-attachments/assets/672d7f80-ff2a-4e8c-99e7-292f1f8107b1)
[sql-practice 1,2,3](sql-practice-1.com.json) are a json files of sql solutions i solved from sql_practice website.The Website offers business related problems and expects sql solutions for each problem, thus can help to understand
business the more and offer growth insights for the business.PySpark as an apache spark api is accessible through the data analytics platform, Databricks. All pyspark projects were carried out on the databricks workspace notebooks.
[Employee On PySpark](employees_in_pyspark.ipynb) is replication and re-purposing of the sql version of employee_sql problems where company_employees relation problems were sorted using pypsark applications. This was a project to show the similarites and difficulties between sql and pyspark in providing business solutions.
![emp on spark](https://github.com/user-attachments/assets/fd866a58-0180-4bf2-8d7e-6537c0869e90)
[spotify streams on pyspark](pyspark_on_spotify_streams.ipynb) is a typical analysis of track, artists and album data across different streaming platforms such as spotify, YouTube, TikTok etc. The project showcases the use of pyspark
as an analytical solution to provide understaning and metrics into the numbers surrounding streams of Tracks and Arists performances.![basic ml spar](https://github.com/user-attachments/assets/e402d84e-93fd-4534-902e-10bb5f25d86b)
[Basic ML on Pyspark](basic_ml_on_pyspark.ipynb) is a continued project on the capabilities of pyspark. utilizing Mlib functions of spark, classical regression and classification tasks and models can be performed on Resilient Distributed Datasets(RDD) which is a core structure for spark framework.
![basic ml spark 2](https://github.com/user-attachments/assets/d5a3123b-da8b-404c-8df8-b524acfddc20)