{"id":23318708,"url":"https://github.com/essiencodecraft/blossom","last_synced_at":"2025-04-07T05:19:32.000Z","repository":{"id":267506555,"uuid":"901431381","full_name":"EssienCodeCraft/Blossom","owner":"EssienCodeCraft","description":null,"archived":false,"fork":false,"pushed_at":"2025-02-03T03:16:40.000Z","size":19,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-13T09:36:31.739Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/EssienCodeCraft.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-10T16:39:52.000Z","updated_at":"2025-02-03T03:16:43.000Z","dependencies_parsed_at":"2024-12-10T19:22:41.243Z","dependency_job_id":"bde6cf94-685a-4ebd-9579-02d42c5594f2","html_url":"https://github.com/EssienCodeCraft/Blossom","commit_stats":null,"previous_names":["essiencodecraft/blossom"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EssienCodeCraft%2FBlossom","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EssienCodeCraft%2FBlossom/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EssienCodeCraft%2FBlossom/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/EssienCodeCraft%2FBlossom/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/EssienCodeCraft","download_url":"https://codeload.github.com/EssienCodeCraft/Blossom/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247595334,"owners_count":20963943,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-20T17:18:09.128Z","updated_at":"2025-04-07T05:19:31.977Z","avatar_url":"https://github.com/EssienCodeCraft.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"## This Profile contains some of the projects implemented during Blossom Academy Data Engineering Cohort for Fall 2019.\r\n\r\n### Project 1 : ETL Pipeline Using Pandas.\r\n\r\n**Purpose:**\r\nThe aim of this project was to build a basic **ETL pipeline** to read data from **Amazon Simple Storage Service(S3)**, transform this data, then load the output into an S3 Bucket.\r\n\r\n**Tools used :**\r\n- Jupyter Notebook\r\n- Pandas\r\n- Amazon Simple Storage Service(S3)\r\n\r\n**Task:**\r\n1. Write a python script with the following features; \r\n2. Download the 7+ Million Dataset from S3 *[bucket: blossom-data-engs key:-project1/free-7-million-company-dataset.zip].*\r\n3. Read the file with pandas.\r\n4. Filter out companies without a domain name using pandas.\r\n5. Write out the output in the following formats (parquet, Json)\r\n6. Upload the resulting 3 file to your S3 buckets **blossom-data-eng-dennis**.\r\n\r\n\r\n\r\n### Project 2 :  Batch Processing for Data Mining\r\n\r\n**Purpose:**\r\n\r\nThe aim of this project is to investigate the top keywords companies within various cities in the US require data science candidates to have in their resume.\r\n\r\n**Tools Used:**\r\n- Jupyter Notebook\r\n- Pandas\r\n- Pyspark\r\n- Amazon Simple Storage Service(S3)\r\n \r\n**Task:**\r\n\r\n1. Load the data scientist job market dataset and us stocks datasets from the s3 bucket *‘s3://blossom-data-engs’* onto your computer\r\n2. Read the data with pyspark\r\n3.  Read the alldata.csv from the data scientist datasets\r\n4. Join the 2 datasets.\r\n5. Write a function to generate n-grams (unigram \u0026 bigram) from a given text/description. \r\n6. Write another function which uses the function from (5) to create 2 spark data frames which have 3 columns in the order of frequency: \r\n{Ngram, City, Frequency}\r\n{Ngram, Industry, Frequency}\r\n\r\n\r\n\r\n### Project 3 : Basic End to End ETL Pipeline using Pyspark\r\n\r\n**Purpose:**\r\nThe aim of this project is to Extract Data(CSV), Tranform and load into Postgres Database using Pyspark.\r\n\r\n**Tools Used:**\r\n- Pyspark\r\n- Pandas\r\n- SQL\r\n- Postgres\r\n- Dbeaver\r\n\r\n**Task:**\r\n***1. Data Extraction***\r\n- Load Datasets(questions, answers and users) using Pyspark\r\n\r\n***2. Data Transformation***\r\n- Select users from only one country of your choosing.\r\n- Extract the country and city into new columns.\r\n- Join this with the questions and only pick questions with at least 20 view_counts.\r\n- Join the answers to the results of step 3.\r\n- Use this to return the minimum updated_at time.\r\n\r\n***3. Data Loading***\r\n- Create a new schema called *stackoverflow_filtered*.\r\n- Create one table called results. \r\n- Use spark to write the results into this table with the snippet below.\r\n- Create a btree index on the reputation column within the results table.\r\n- Create a hash index on the display_name column within the results table.\r\n- From the results table, create a view with the column names display_name, city, questions_id where the accepted_answer_id is not null. - Make sure this view goes into the right schema.\r\n- Create a materialized view similar to (6). They should have different names.\r\n- In your Jupyter notebook, state the difference between views and materialized views\r\n\r\n***4. Data Manipulation***\r\n- How many cities appeared more than twice in your results table?\r\n- How many unique created_at dates(not datetime) are in the result table?\r\n- If you were to give an award to one user, who will it be? And why?\r\n\r\n\r\n\r\n\r\n### Project 4 : Webscrapping Using Beautiful Soup\r\n\r\n**Purpose: The aim of this project is to scrape information from \t[Meqasa](https://www.meqasa.com)**\r\n\r\n**Tools Used:**\r\n- Python\r\n- Jupyter Notebook\r\n\r\n**Task:**\r\n1. Write a python script that scrapes for info from meqasa and outputs a CSV file with the following structure:\r\n- property \r\n- beds\r\n- showers\r\n- garages\r\n- area\r\n- description\r\n- price\r\n- currency\r\n- rent_period\r\n- url\r\n- address\r\n\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fessiencodecraft%2Fblossom","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fessiencodecraft%2Fblossom","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fessiencodecraft%2Fblossom/lists"}