An open API service indexing awesome lists of open source software.

Projects in Awesome Lists tagged with pyspark-python

A curated list of projects in awesome lists tagged with pyspark-python .

https://github.com/asuiu/sparkorm

ORM for Apache Spark and DataFrames schema manager

orm pyspark pyspark-python python python3 spark spark-orm spark-sql sparkql sqlalchemy sqlalchemy-orm

Last synced: 07 May 2025

https://github.com/sarthak-1408/pyspark-tutorial

In this Repo, I create a tutorial of PySpark to better understand how to read and manage Big Data.

machine-learning pyspark pyspark-mllib pyspark-python pyspark-tutorial python3

Last synced: 14 Apr 2025

https://github.com/vigneshss-07/pyspark-acompleteguide

This repo explains pyspark modules in python. Used to deal with big data more practical handson.

pyspark pyspark-mllib pyspark-notebook pyspark-python pyspark-tutorial

Last synced: 13 Apr 2025

https://github.com/arturogonzalezm/convert_json_to_parquet

ETL (Extract, Transform, Load) job using PySpark - submodule

apache-spark etl etl-job etl-pipeline pyspark-python python python312

Last synced: 05 Mar 2026

https://github.com/camilajaviera91/pyspark-first-approach

This code demonstrates how to integrate PySpark with datasets and perform simple data transformations. It loads a sample dataset using PySpark's built-in functionalities or reads data from external sources and converts it into a PySpark DataFrame for distributed processing and manipulation.

curses fpdf google-oauth2 gspread kaggle kaggle-api matplotlib os pandas path pathlib pyspark-python pyspark-sql shutil sparksession

Last synced: 26 Jun 2025

https://github.com/abdelmajidlh/ml_diabet_predict_pyspark

Prédiction du diabète par régression logistique avec Python et PySpark

data-science logistic-regression machine-learning pyspark pyspark-mllib pyspark-python

Last synced: 22 Mar 2025

https://github.com/soumyadipta2020/pyspark-sample

Sample codes/functions of pyspark

pyspark pyspark-python python

Last synced: 28 Jul 2025

https://github.com/mananabbasi/data-science-complete-project-using-big-data-tools-techniques-

This repository contains Databricks projects utilizing RDDs, DataFrames, and SQL to process and analyze various real-world datasets. Data cleaning and analysis have been performed using PySpark functions to handle challenges such as inconsistent formats, missing values, and complex data structures. The project ensures efficient data transformation

azure databricks databricks-industry-solutions databricks-notebooks dataframe pyspark-mllib pyspark-notebook pyspark-python python-script rdd

Last synced: 23 Jan 2026

https://github.com/vladkozhuhov/mindbox_test

Тестовые задания для Mindbox

csharp-library pyspark-python

Last synced: 07 May 2025

https://github.com/coderjolly/pyspark-yelp-data-analysis

A comparative study to understand the computing efficiencies of Pyspark architectures vs python based distributed programming methodologies such as MPI, multi-threading or multi-processing on the Yelp kaggle dataset.

distributed-system-design distributed-systems-challenges mpi multiprocessing multithreading pyspark pyspark-python

Last synced: 27 Mar 2025

https://github.com/mohammadreza-mohammadi94/pyspark-analytics-hub

A PySpark repository for data analysis, machine learning projects, and hands-on exercises. Explore scalable data processing and advanced ML workflows with Spark.

large-scale-pretraining machine-learning pyspark pyspark-mllib pyspark-python python

Last synced: 22 Feb 2025

https://github.com/pixelbyaj/apache-spark

Start Apache Spark with Python - pyspark

apache-spark pyspark-python python spark winutils

Last synced: 13 Oct 2025