Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/faisal-fida/ecommerce-etl-pipeline
Ecommerce data into mobile search index (Data pipeline) using Python, Algolia, and Google Cloud for scalability and efficiency
https://github.com/faisal-fida/ecommerce-etl-pipeline
algolia data-pipeline etl google-cloud python web-scraping
Last synced: 1 day ago
JSON representation
Ecommerce data into mobile search index (Data pipeline) using Python, Algolia, and Google Cloud for scalability and efficiency
- Host: GitHub
- URL: https://github.com/faisal-fida/ecommerce-etl-pipeline
- Owner: faisal-fida
- Created: 2023-03-15T14:23:58.000Z (almost 2 years ago)
- Default Branch: master
- Last Pushed: 2024-08-18T06:26:40.000Z (5 months ago)
- Last Synced: 2024-11-10T21:16:07.696Z (about 2 months ago)
- Topics: algolia, data-pipeline, etl, google-cloud, python, web-scraping
- Language: Jupyter Notebook
- Homepage:
- Size: 605 KB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# Ecommerce Mobile Search Pipeline
In this project, I developed a robust system to scrape data from an ecommerce website [Thredup](https://www.thredup.com/), process it for quality, and integrate it into a mobile application for efficient searching. It utilizes various technologies such as Python for web scraping, Pandas for data preprocessing, Algolia for indexing and search capabilities, Google Cloud Firestore for data storage, and Google Cloud Run with Docker for deployment.## Architecture 🚀
Part 1: Web Scraping Architecture
![1](https://github.com/faisal-fida/Ecommerce-ETL-Pipeline/assets/69955157/911443a0-ed7f-4dbf-8853-7abe35366674)
Part 2: Data Processing and Validation Workflow
![2](https://github.com/faisal-fida/Ecommerce-ETL-Pipeline/assets/69955157/fc4477e8-8dfc-4f10-b18f-d056ec942e93)
Part 3: Data Pipeline Workflow
![3](https://github.com/faisal-fida/Ecommerce-ETL-Pipeline/assets/69955157/38a8b3c2-538c-4afd-8775-b402d38c3d40)
## Project Structure
The project repository contains the following directories and files:
- `data_processing/`: Contains scripts related to data processing and cleaning.
- `handle_database/`: Includes code for handling the product database and storage.
- `output/`: Stores output files generated during the data processing pipeline.
- `Dockerfile`: Defines the instructions to build a Docker image for this project.
- `Initial_Products_Scraper.ipynb`: Jupyter Notebook file containing the initial product scraping code.
- `run_image.py`: Script to run the Docker image on GCP.
- `scraping_list_product_modules.py`: Contains modules for scraping product listings.## Installation
To run this project locally, follow these steps:
1. Clone the repository:
```
git clone https://github.com/faisal-fida/Ecommerce-ETL-Pipeline
```2. Install the required Python dependencies using Pipenv:
```
pipenv install
```3. Set up any necessary configurations and environment variables.
4. Run the main script:
```
python main.py
```