https://github.com/essiencodecraft/blossom

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/essiencodecraft/blossom
Owner: EssienCodeCraft
Created: 2024-12-10T16:39:52.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-03T03:16:40.000Z (about 1 year ago)
Last Synced: 2025-02-13T09:36:31.739Z (about 1 year ago)
Language: Jupyter Notebook
Size: 18.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ## This Profile contains some of the projects implemented during Blossom Academy Data Engineering Cohort for Fall 2019.

### Project 1 : ETL Pipeline Using Pandas.

**Purpose:**

The aim of this project was to build a basic **ETL pipeline** to read data from **Amazon Simple Storage Service(S3)**, transform this data, then load the output into an S3 Bucket.

**Tools used :**

- Jupyter Notebook

- Pandas

- Amazon Simple Storage Service(S3)

**Task:**

1. Write a python script with the following features; 

2. Download the 7+ Million Dataset from S3 *[bucket: blossom-data-engs key:-project1/free-7-million-company-dataset.zip].*

3. Read the file with pandas.

4. Filter out companies without a domain name using pandas.

5. Write out the output in the following formats (parquet, Json)

6. Upload the resulting 3 file to your S3 buckets **blossom-data-eng-dennis**.

### Project 2 :  Batch Processing for Data Mining

**Purpose:**

The aim of this project is to investigate the top keywords companies within various cities in the US require data science candidates to have in their resume.

**Tools Used:**

- Jupyter Notebook

- Pandas

- Pyspark

- Amazon Simple Storage Service(S3)

 

**Task:**

1. Load the data scientist job market dataset and us stocks datasets from the s3 bucket *‘s3://blossom-data-engs’* onto your computer

2. Read the data with pyspark

3.  Read the alldata.csv from the data scientist datasets

4. Join the 2 datasets.

5. Write a function to generate n-grams (unigram & bigram) from a given text/description. 

6. Write another function which uses the function from (5) to create 2 spark data frames which have 3 columns in the order of frequency: 

{Ngram, City, Frequency}

{Ngram, Industry, Frequency}

### Project 3 : Basic End to End ETL Pipeline using Pyspark

**Purpose:**

The aim of this project is to Extract Data(CSV), Tranform and load into Postgres Database using Pyspark.

**Tools Used:**

- Pyspark

- Pandas

- SQL

- Postgres

- Dbeaver

**Task:**

***1. Data Extraction***

- Load Datasets(questions, answers and users) using Pyspark

***2. Data Transformation***

- Select users from only one country of your choosing.

- Extract the country and city into new columns.

- Join this with the questions and only pick questions with at least 20 view_counts.

- Join the answers to the results of step 3.

- Use this to return the minimum updated_at time.

***3. Data Loading***

- Create a new schema called *stackoverflow_filtered*.

- Create one table called results. 

- Use spark to write the results into this table with the snippet below.

- Create a btree index on the reputation column within the results table.

- Create a hash index on the display_name column within the results table.

- From the results table, create a view with the column names display_name, city, questions_id where the accepted_answer_id is not null. - Make sure this view goes into the right schema.

- Create a materialized view similar to (6). They should have different names.

- In your Jupyter notebook, state the difference between views and materialized views

***4. Data Manipulation***

- How many cities appeared more than twice in your results table?

- How many unique created_at dates(not datetime) are in the result table?

- If you were to give an award to one user, who will it be? And why?

### Project 4 : Webscrapping Using Beautiful Soup

**Purpose: The aim of this project is to scrape information from 	[Meqasa](https://www.meqasa.com)**

**Tools Used:**

- Python

- Jupyter Notebook

**Task:**

1. Write a python script that scrapes for info from meqasa and outputs a CSV file with the following structure:

- property 

- beds

- showers

- garages

- area

- description

- price

- currency

- rent_period

- url

- address

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/essiencodecraft/blossom

Awesome Lists containing this project

README