An open API service indexing awesome lists of open source software.

https://github.com/rizkipragustono/data_analysis_spark

Exploration: Data Analysis using Spark
https://github.com/rizkipragustono/data_analysis_spark

apache-spark data-analysis pyspark python spark-sql sql

Last synced: about 1 month ago
JSON representation

Exploration: Data Analysis using Spark

Awesome Lists containing this project

README

          

# Data Analysis using Spark
## Scenario
You have been tasked by the HR department of a company to create a data pipeline that can take in employee data in a CSV format. Your responsibilities include analyzing the data, applying any required transformations, and facilitating the extraction of valuable insights from the processed data.

Given your role as a data engineer, you've been requested to leverage Apache Spark components to accomplish the tasks.
## Project Overview
Create a DataFrame by loading data from a CSV file and apply transformations and actions using Spark SQL. This needs to be achieved by performing the following tasks:

- Task 1: Generate DataFrame from CSV data.
- Task 2: Define a schema for the data.
- Task 3: Display schema of DataFrame.
- Task 4: Create a temporary view.
- Task 5: Execute an SQL query.
- Task 6: Calculate Average Salary by Department.
- Task 7: Filter and Display IT Department Employees.
- Task 8: Add 10% Bonus to Salaries.
- Task 9: Find Maximum Salary by Age.
- Task 10: Self-Join on Employee Data.
- Task 11: Calculate Average Employee Age.
- Task 12: Calculate Total Salary by Department.
- Task 13: Sort Data by Age and Salary.
- Task 14: Count Employees in Each Department.
- Task 15: Filter Employees with the letter o in the Name.