Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/labrijisaad/chefclub-data-internship
Repository showcasing my Data Engineer / Scientist internship at Chefclub, contributing to data infrastructure enhancement and fostering data-driven insights.
https://github.com/labrijisaad/chefclub-data-internship
airflow chefclub data-engineering data-science gcp scikit-learn
Last synced: about 6 hours ago
JSON representation
Repository showcasing my Data Engineer / Scientist internship at Chefclub, contributing to data infrastructure enhancement and fostering data-driven insights.
- Host: GitHub
- URL: https://github.com/labrijisaad/chefclub-data-internship
- Owner: labrijisaad
- Created: 2023-08-12T11:46:32.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2023-08-19T19:32:07.000Z (about 1 year ago)
- Last Synced: 2024-01-27T18:42:13.142Z (9 months ago)
- Topics: airflow, chefclub, data-engineering, data-science, gcp, scikit-learn
- Homepage:
- Size: 12.5 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Chefclub Data Engineer / Data Scientist Internship
π’ Take a look at the End of Study Report for a detailed view of this internship!
---
**Welcome** to the repository that captures my journey as a **Data Engineer / Data Scientist** intern at **Chefclub**, a respected digital cooking brand π½οΈ based in Paris.
## Overview π
- **Company**: Chefclub
- **Duration**: January 2023 - July 2023 (6 months)
- **Location**: Paris, Γle-de-France, France
- **Type**: Data Engineer / Data Scientist Internship## Objective π
My main focus during this internship was to **strengthen Chefclub's data groundwork** by refining how we gather, store, and study information. This mission aimed to help the company effectively use data from social media. I got hands-on with a toolkit that included **Data Engineering, Cloud Computing, Data Science, and Data Analytics**.
## Accomplishments π
Throughout the internship, I achieved the following milestones:
1. **`YouTube Data Retrieval System`**: I conceptualized and implemented a YouTube data retrieval system, utilizing a range of technologies such as Airflow, Kubernetes, Docker, Python, GitHub, the YouTube Analytics API, and SQL. This system automates the collection and storage of performance data from Chefclub's YouTube channels, feeding it into BigQuery and Cloud Storage. Moreover, it generates dynamic reports in Looker Studio and financial reports in Google Sheets. The architecture of the system is visually depicted below:
Fig 1 : YouTube Analytics Data Retrieval Solution
2. **`Facebook Post Performance Analysis`**: I conducted an in-depth analysis of Chefclub's Facebook data to uncover valuable trends and insights. This analysis facilitated the identification of top-performing videos. The workflow utilized for this analysis is illustrated here:
Fig 2 : Data Analysis Workflow for Facebook Posts
3. **`Forecasting Facebook Posts Performance in Slack`**: I developed and integrated an automated machine learning model using Airflow, BigQuery, Jupyter Notebook, Scikit-Learn, and Docker. This model, updated daily, delivers forecasts of Facebook post performance based on historical page health. It is integrated with a Google Cloud Function for Slack communication. The workflow and integration process are presented below:
Fig 3 : Model Training Deployment Solution
Additionally, an overview of the model inference process via Slack is explained in the sequence diagram below:
Fig 4 : Sequence Diagram Model Inference via Slack
These solutions collectively enhanced Chefclub's decision-making capabilities and deepened their understanding of social media performance.
## Methodology π
For this internship, we followed an agile project management approach, specifically using the **SCRUM methodology**. This helped us work flexibly and efficiently, ensuring successful results.
## Tools and Technologies π οΈ
During the internship, I utilized a variety of tools and technologies, with the main ones being:
- **Google Cloud Platform (GCP)**: Used as the primary cloud provider.
- **Airflow**: Employed for scheduling and orchestrating jobs.
- **Docker**: Utilized to encapsulate custom Python code into containers.
- **Kubernetes**: Employed for efficient management of Docker containers, orchestrated by Airflow jobs.
- **Python**: The primary programming language used for developing our solutions.
- **Plotly**: Empowered us to create visually interactive graphs.
- **Scikit-learn**: A robust library used for implementing various Machine Learning algorithms.
- **GitHub**: The chosen platform for collaborative development and version control of our code.
## Author π€- π Feel free to connect with me on LinkedIn