Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/labrijisaad/chefclub-data-internship

Repository showcasing my Data Engineer / Scientist internship at Chefclub, contributing to data infrastructure enhancement and fostering data-driven insights.
https://github.com/labrijisaad/chefclub-data-internship

airflow chefclub data-engineering data-science gcp scikit-learn

Last synced: about 6 hours ago
JSON representation

Repository showcasing my Data Engineer / Scientist internship at Chefclub, contributing to data infrastructure enhancement and fostering data-driven insights.

Awesome Lists containing this project

README

        


Chefclub Logo


Chefclub Data Engineer / Data Scientist Internship


πŸ“’ Take a look at the End of Study Report for a detailed view of this internship!


---

**Welcome** to the repository that captures my journey as a **Data Engineer / Data Scientist** intern at **Chefclub**, a respected digital cooking brand 🍽️ based in Paris.

## Overview 🌟

- **Company**: Chefclub
- **Duration**: January 2023 - July 2023 (6 months)
- **Location**: Paris, Île-de-France, France
- **Type**: Data Engineer / Data Scientist Internship

## Objective πŸš€

My main focus during this internship was to **strengthen Chefclub's data groundwork** by refining how we gather, store, and study information. This mission aimed to help the company effectively use data from social media. I got hands-on with a toolkit that included **Data Engineering, Cloud Computing, Data Science, and Data Analytics**.

## Accomplishments πŸ†

Throughout the internship, I achieved the following milestones:

1. **`YouTube Data Retrieval System`**: I conceptualized and implemented a YouTube data retrieval system, utilizing a range of technologies such as Airflow, Kubernetes, Docker, Python, GitHub, the YouTube Analytics API, and SQL. This system automates the collection and storage of performance data from Chefclub's YouTube channels, feeding it into BigQuery and Cloud Storage. Moreover, it generates dynamic reports in Looker Studio and financial reports in Google Sheets. The architecture of the system is visually depicted below:



Fig 1 : YouTube Analytics Data Retrieval Solution


2. **`Facebook Post Performance Analysis`**: I conducted an in-depth analysis of Chefclub's Facebook data to uncover valuable trends and insights. This analysis facilitated the identification of top-performing videos. The workflow utilized for this analysis is illustrated here:



Fig 2 : Data Analysis Workflow for Facebook Posts


3. **`Forecasting Facebook Posts Performance in Slack`**: I developed and integrated an automated machine learning model using Airflow, BigQuery, Jupyter Notebook, Scikit-Learn, and Docker. This model, updated daily, delivers forecasts of Facebook post performance based on historical page health. It is integrated with a Google Cloud Function for Slack communication. The workflow and integration process are presented below:



Fig 3 : Model Training Deployment Solution


Additionally, an overview of the model inference process via Slack is explained in the sequence diagram below:



Fig 4 : Sequence Diagram Model Inference via Slack


These solutions collectively enhanced Chefclub's decision-making capabilities and deepened their understanding of social media performance.

## Methodology πŸ“ˆ

For this internship, we followed an agile project management approach, specifically using the **SCRUM methodology**. This helped us work flexibly and efficiently, ensuring successful results.

## Tools and Technologies πŸ› οΈ

During the internship, I utilized a variety of tools and technologies, with the main ones being:


Technologies


- **Google Cloud Platform (GCP)**: Used as the primary cloud provider.
- **Airflow**: Employed for scheduling and orchestrating jobs.
- **Docker**: Utilized to encapsulate custom Python code into containers.
- **Kubernetes**: Employed for efficient management of Docker containers, orchestrated by Airflow jobs.
- **Python**: The primary programming language used for developing our solutions.
- **Plotly**: Empowered us to create visually interactive graphs.
- **Scikit-learn**: A robust library used for implementing various Machine Learning algorithms.
- **GitHub**: The chosen platform for collaborative development and version control of our code.
## Author πŸ‘€

- πŸ”— Feel free to connect with me on LinkedIn