https://github.com/pregismond/data-analysis-using-spark

Data Analysis using Spark
https://github.com/pregismond/data-analysis-using-spark

coursera ibm-skills-network pyspark python spark-sql

Last synced: 3 months ago
JSON representation

Data Analysis using Spark

Host: GitHub
URL: https://github.com/pregismond/data-analysis-using-spark
Owner: pregismond
License: apache-2.0
Created: 2024-04-27T21:08:37.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-05-02T11:48:53.000Z (about 1 year ago)
Last Synced: 2025-02-02T09:11:15.237Z (5 months ago)
Topics: coursera, ibm-skills-network, pyspark, python, spark-sql
Language: Jupyter Notebook
Homepage:
Size: 20.5 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Data Analysis using Spark

![Visitors](https://api.visitorbadge.io/api/visitors?path=https%3A%2F%2Fgithub.com%2Fpregismond%2Fdata-analysis-using-spark&label=Visitors&countColor=%230d76a8&style=flat&labelStyle=none)
[![License](https://img.shields.io/badge/License-Apache_2.0-0D76A8?style=flat)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.7.12](https://img.shields.io/badge/Python-3.7.12-green.svg)](https://shields.io/)

## Disclaimer

This repository contains my submission for the ***Final Project: Data Analysis using Spark***. The original files were provided by the IBM Skills Network as part of the **[Introduction to Big Data with Spark and Hadoop](https://www.coursera.org/learn/introduction-to-big-data-with-spark-hadoop)** course on Coursera. I have made modifications to fulfill the project requirements.

### Usage

* You are welcome to use this repository as a reference or starting point for your own project.

* If you choose to fork this repository, please ensure that you comply with the terms of the Apache License and give proper credit to the original authors.

## Project Scenario

As a data engineer, I’ve been assigned by our HR department to design a robust data pipeline capable of ingesting employee data in CSV format. My responsibilities include analyzing the data, implementing necessary transformations, and enabling the extraction of valuable insights from the processed data.

## Objectives

* Create a DataFrame from a CSV file
* Define a schema for the data
* Perform transformations and actions using Spark SQL

## Setup

Install the required libraries using the provided `requirements.txt` file. The command syntax is:

```bash
python3 -m pip install -r requirements.txt
```

Download the required employees csv file using the terminal command:

```bash
wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-BD0225EN-SkillsNetwork/data/employees.csv
```

## Learner

[Pravin Regismond](https://www.linkedin.com/in/pregismond)

## Acknowledgments

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pregismond/data-analysis-using-spark

Awesome Lists containing this project

README