https://github.com/essiencodecraft/blossom
https://github.com/essiencodecraft/blossom
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/essiencodecraft/blossom
- Owner: EssienCodeCraft
- Created: 2024-12-10T16:39:52.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-03T03:16:40.000Z (about 1 year ago)
- Last Synced: 2025-02-13T09:36:31.739Z (about 1 year ago)
- Language: Jupyter Notebook
- Size: 18.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## This Profile contains some of the projects implemented during Blossom Academy Data Engineering Cohort for Fall 2019.
### Project 1 : ETL Pipeline Using Pandas.
**Purpose:**
The aim of this project was to build a basic **ETL pipeline** to read data from **Amazon Simple Storage Service(S3)**, transform this data, then load the output into an S3 Bucket.
**Tools used :**
- Jupyter Notebook
- Pandas
- Amazon Simple Storage Service(S3)
**Task:**
1. Write a python script with the following features;
2. Download the 7+ Million Dataset from S3 *[bucket: blossom-data-engs key:-project1/free-7-million-company-dataset.zip].*
3. Read the file with pandas.
4. Filter out companies without a domain name using pandas.
5. Write out the output in the following formats (parquet, Json)
6. Upload the resulting 3 file to your S3 buckets **blossom-data-eng-dennis**.
### Project 2 : Batch Processing for Data Mining
**Purpose:**
The aim of this project is to investigate the top keywords companies within various cities in the US require data science candidates to have in their resume.
**Tools Used:**
- Jupyter Notebook
- Pandas
- Pyspark
- Amazon Simple Storage Service(S3)
**Task:**
1. Load the data scientist job market dataset and us stocks datasets from the s3 bucket *‘s3://blossom-data-engs’* onto your computer
2. Read the data with pyspark
3. Read the alldata.csv from the data scientist datasets
4. Join the 2 datasets.
5. Write a function to generate n-grams (unigram & bigram) from a given text/description.
6. Write another function which uses the function from (5) to create 2 spark data frames which have 3 columns in the order of frequency:
{Ngram, City, Frequency}
{Ngram, Industry, Frequency}
### Project 3 : Basic End to End ETL Pipeline using Pyspark
**Purpose:**
The aim of this project is to Extract Data(CSV), Tranform and load into Postgres Database using Pyspark.
**Tools Used:**
- Pyspark
- Pandas
- SQL
- Postgres
- Dbeaver
**Task:**
***1. Data Extraction***
- Load Datasets(questions, answers and users) using Pyspark
***2. Data Transformation***
- Select users from only one country of your choosing.
- Extract the country and city into new columns.
- Join this with the questions and only pick questions with at least 20 view_counts.
- Join the answers to the results of step 3.
- Use this to return the minimum updated_at time.
***3. Data Loading***
- Create a new schema called *stackoverflow_filtered*.
- Create one table called results.
- Use spark to write the results into this table with the snippet below.
- Create a btree index on the reputation column within the results table.
- Create a hash index on the display_name column within the results table.
- From the results table, create a view with the column names display_name, city, questions_id where the accepted_answer_id is not null. - Make sure this view goes into the right schema.
- Create a materialized view similar to (6). They should have different names.
- In your Jupyter notebook, state the difference between views and materialized views
***4. Data Manipulation***
- How many cities appeared more than twice in your results table?
- How many unique created_at dates(not datetime) are in the result table?
- If you were to give an award to one user, who will it be? And why?
### Project 4 : Webscrapping Using Beautiful Soup
**Purpose: The aim of this project is to scrape information from [Meqasa](https://www.meqasa.com)**
**Tools Used:**
- Python
- Jupyter Notebook
**Task:**
1. Write a python script that scrapes for info from meqasa and outputs a CSV file with the following structure:
- property
- beds
- showers
- garages
- area
- description
- price
- currency
- rent_period
- url
- address