{"id":26489110,"url":"https://github.com/tedoaba/kaim-w7","last_synced_at":"2026-05-20T10:41:38.523Z","repository":{"id":257903946,"uuid":"870001110","full_name":"tedoaba/KAIM-W7","owner":"tedoaba","description":"Building a Data Warehouse to Store Data on Ethiopian Medical Business Data Scraped from Telegram Channels","archived":false,"fork":false,"pushed_at":"2024-11-25T05:30:02.000Z","size":24434,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-11-25T06:25:16.301Z","etag":null,"topics":["data-scraping","data-warehousing","dbt","kaim","medical-business-data","postgresql","sqlalchemy","store-data","telegram","telegram-api","telethon","yolov5"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tedoaba.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-09T09:15:44.000Z","updated_at":"2024-11-25T05:32:19.000Z","dependencies_parsed_at":null,"dependency_job_id":"23fc5cff-26fe-4fdd-849a-a8a1a16f48c4","html_url":"https://github.com/tedoaba/KAIM-W7","commit_stats":null,"previous_names":["tedoaba/kaim-w7"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tedoaba%2FKAIM-W7","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tedoaba%2FKAIM-W7/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tedoaba%2FKAIM-W7/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tedoaba%2FKAIM-W7/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tedoaba","download_url":"https://codeload.github.com/tedoaba/KAIM-W7/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244566946,"owners_count":20473451,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-scraping","data-warehousing","dbt","kaim","medical-business-data","postgresql","sqlalchemy","store-data","telegram","telegram-api","telethon","yolov5"],"created_at":"2025-03-20T07:19:57.201Z","updated_at":"2026-05-20T10:41:33.482Z","avatar_url":"https://github.com/tedoaba.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Building a Data Warehouse to Store Data on Ethiopian Medical Business Data Scraped from Telegram Channels\n\n## KAIM Week 7 Challenges\n\n## Project Overview\n\nThis project aims to create a **data warehouse** for Ethiopian medical businesses by scraping relevant data from public **Telegram channels** and analyzing images through **object detection** using the **YOLO (You Only Look Once)** algorithm. The system includes processes for **data scraping**, **data cleaning**, **data transformation**, and **data storage**, as well as providing **API access** to the processed data.\n\n### Key Objectives:\n1. **Scraping Images from Telegram Channels**: Scrape images and metadata from specified channels using the Telegram API.\n2. **Data Warehousing**: Store scraped images and their metadata in a relational database.\n3. **Object Detection Preparation**: Set up data for object detection, ensuring proper storage and accessibility.\n4. **Data Transformation**: Use DBT (Data Build Tool) to transform the stored data for object detection and further processing.\n5. **API Development**: Develop an API to expose processed data for real-time insights and analysis.\n\n\n## Table of Contents\n\n- [Project Overview](#project-overview)\n- [Requirements](#requirements)\n- [Setup Instructions](#setup-instructions)\n- [Task Breakdown](#task-breakdown)\n  - [Task 1: Telegram Scraping](#task-1-telegram-scraping)\n  - [Task 2: Data Warehousing](#task-2-data-warehousing)\n  - [Task 3: Object Detection](#task-3-object-detection)\n  - [Task 4: Data Transformation with DBT](#task-4-data-transformation-with-dbt)\n  - [Task 5: API Development](#task-5-api-development)\n- [Project Structure](#project-structure)\n- [Challenges and Solutions](#challenges-and-solutions)\n\n## Requirements\n\n- **Python 3.x**\n- **Telethon** for Telegram API access\n- **SQLAlchemy** for database management\n- **PostgreSQL** or **SQLite** for data warehousing\n- **Pillow (PIL)** for image processing\n- **DBT (Data Build Tool)** for data transformation\n- **YOLOv5** (for object detection in future tasks)\n\n\n## Task Breakdown\n\n### Task 1: Telegram Scraping\n\n#### Overview:\nThis task focuses on scraping images from Telegram channels using the **Telethon** library. Images are downloaded into a local folder, and metadata is collected for each image, including:\n- File path\n- Source channel\n- Timestamp\n\n#### Key Files:\n- `scrape_telegram.py`: Handles the Telegram scraping and metadata extraction.\n\n### Task 2: Data Warehousing\n\n#### Overview:\nThis task stores the image metadata (from Task 1) into a relational database. The database helps manage image metadata, ensuring future scalability and accessibility for object detection.\n\n#### Database Schema:\n- **Table: images**\n    - `id`: Primary key (auto-increment).\n    - `file_path`: Path to the saved image.\n    - `source_channel`: The channel from where the image was scraped.\n    - `timestamp`: Time when the image was downloaded.\n\n#### Key Files:\n- `database.py`: Manages database operations, including storing image metadata.\n\n### Task 3: Object Detection\n\nIn the next phase, object detection will be performed on the scraped images using models like **YOLOv5**. This will involve:\n- Loading images from the database.\n- Running detection models on the images.\n- Storing results in the database.\n\n### Task 4: Data Transformation with DBT\n\n**DBT** will be used for transforming the data in the warehouse, ensuring it’s structured properly for object detection models. The transformations will include:\n- Cleaning and organizing metadata.\n- Generating datasets optimized for model input.\n\n### Task 5: API Development\n\nAn API will be developed to expose the processed data and object detection results for real-time insights. The API will be built using **Flask** or **FastAPI** and will provide endpoints for querying detection results and metadata.\n\n### Python Libraries\n\nThe main Python libraries required are listed below. Install them using `pip`:\n\n```bash\npip install telethon dbt opencv-python torch torchvision fastapi uvicorn pydantic sqlalchemy\n```\n\n## Project Structure\n\nHere’s a high-level overview of the project’s structure:\n\n```bash\n├── app/\n│   ├── templates/   \n│   ├── crud.py               \n│   └── database.py\n│   ├── main.py\n│   ├── models.py                \n│   └── schemas.py\n│   ├── telegram_scraper.py     \n│   ├── yolo_object_detection.py                                  \n├── data/\n├── images/\n├── logs/\n├── dbt_medical_data/\n│   ├── analaysis/     \n│   ├── macros              \n│   └── models/    \n│   ├── seeds/    \n│   ├── snapshots                \n│   └── tests/                          \n├── notebooks/\n│   ├── telegram_scraper.py     \n│   ├── utils.py                \n│   └── raw_data/               \n├── scripts/\n│   ├── __init__.py     \n│   ├── main.py                \n│   └── dbt_setup.py        \n├── src/\n│   ├── telegram_scraper.py     \n│   ├── utils.py                \n│   └── raw_data/               \n├── tests/\n│   ├── __init__.py     \n│   ├── test_data_loader.py                              \n├── yolov5/\n│   ├── models/\n│   ├── runs/               \n│   └── utils/\n│   ├── detect.py     \n│   ├── export.py                \n│   └── yolov5.pt                             \n├── .gitignore               \n├── requirements.txt\n└── README.md                   #\n```\n\n## Setup Instructions\n\n### 1. Data Scraping\n\n#### Description\nThe first step involves scraping textual and image data from public Telegram channels that focus on Ethiopian medical businesses. The data is collected using Python scripts and the **Telethon** library, which interfaces with Telegram's API.\n\n#### Telegram Channels Scraped\n- [DoctorsET](https://t.me/DoctorsET)\n- [Chemed Telegram Channel](https://t.me/lobelia4cosmetics)\n- [Yetenaweg](https://t.me/yetenaweg)\n- [EAHCI](https://t.me/EAHCI)\n- Additional channels from [https://et.tgstat.com/medicine](https://et.tgstat.com/medicine)\n\n#### Setup and Execution\n\n1. **Install Dependencies**:\n   ```bash\n   pip install telethon\n   ```\n\n2. **Run the Scraper**:\n   Before running, make sure to create a `.env` file with your Telegram API credentials (API ID, API hash, and phone number).\n   \n   Example `.env` file:\n   ```plaintext\n   API_ID=your_api_id\n   API_HASH=your_api_hash\n   PHONE=your_phone_number\n   ```\n\n   Execute the script:\n   ```bash\n   python src/message_scraper.py\n   ```\n\n3. **Output**:\n   - Text data and metadata will be saved in a local database.\n   - Image files will be stored in the `images/` directory.\n\n### 2. Data Cleaning and Transformation\n\n#### Description\nAfter scraping, the raw data is cleaned and transformed using **DBT** (Data Build Tool). This process involves removing duplicates, handling missing values, and standardizing formats for easy querying and analysis.\n\n#### Setup and Execution\n\n1. **Install DBT**:\n   Install DBT and initialize a new DBT project:\n   ```bash\n   pip install dbt\n   dbt init dtb_medical_data\n   ```\n\n2. **Define DBT Models**:\n   - Define SQL models in the `dbt_medical_data/models/` directory for cleaning and transforming data.\n   - Sample DBT model file:\n     ```sql\n     -- models/cleaned_telegram_data.sql\n     select\n         distinct message_id,\n         message_text,\n         timestamp::timestamp as message_time,\n         channel_name\n     from raw_data\n     where message_text is not null\n     ```\n\n3. **Run DBT Models**:\n   Apply the transformations by running the DBT models:\n   ```bash\n   dbt run\n   ```\n\n4. **Testing**:\n   Test data quality using DBT's built-in test features:\n   ```bash\n   dbt test\n   ```\n\n### 3. Object Detection using YOLO\n\n#### Description\nIn this task, we perform **object detection** on the scraped images using **YOLOv5** to detect medical equipment, promotional materials, and other objects related to Ethiopian medical businesses.\n\n#### Setup and Execution\n\n1. **Install YOLO Dependencies**:\n   Install PyTorch and YOLOv5:\n   ```bash\n   pip install torch torchvision\n   git clone https://github.com/ultralytics/yolov5.git\n   cd yolov5\n   pip install -r requirements.txt\n   ```\n\n2. **Prepare Images**:\n   Place the scraped images from the `images/` folder directory for object detection.\n\n3. **Run YOLO**:\n   Run the YOLOv5 object detection script:\n   ```bash\n   cd yolov5\n   python detect.py\n   ```\n\n4. **Store Detection Results**:\n   The detection results (bounding boxes, class labels, and confidence scores) will be saved in a structured format, which will later be loaded into the data warehouse.\n\n### 4. Data Warehouse Design and Implementation\n\n#### Description\nThe data warehouse stores all the cleaned, transformed, and enriched data, enabling efficient querying and analysis. The data includes textual Telegram posts, image metadata, and YOLO object detection results.\n\n#### Setup and Execution\n\n1. **Install PostgreSQL**:\n   Install and configure PostgreSQL, or alternatively, use SQLite for local testing.\n\n2. **Database Models**:\n   Define your database schema in `app/models.py` using SQLAlchemy:\n   ```python\n   from sqlalchemy import Column, Integer, String, ForeignKey\n   from sqlalchemy.orm import relationship\n\n   class ImageMetadata(Base):\n       __tablename__ = 'image_metadata'\n       id = Column(Integer, primary_key=True)\n       image_path = Column(String, nullable=False)\n       channel_name = Column(String, nullable=False)\n       timestamp = Column(String, nullable=False)\n\n   class ObjectDetection(Base):\n       __tablename__ = 'object_detection'\n       id = Column(Integer, primary_key=True)\n       image_id = Column(Integer, ForeignKey('image_metadata.id'))\n       bounding_box = Column(String, nullable=False)\n       confidence = Column(Float, nullable=False)\n       class_label = Column(String, nullable=False)\n\n       image = relationship(\"ImageMetadata\", back_populates=\"detections\")\n   ```\n\n3. **Migrate Database**:\n   Initialize and migrate the database to create the tables:\n   ```bash\n   python app/database.py\n   ```\n\n## FastAPI for Data Access\n\n#### Description\nTo expose the processed data via an API, **FastAPI** is used to create RESTful endpoints. These endpoints allow users to query the data warehouse for images, detections, and associated metadata.\n\n#### Setup and Execution\n\n1. **Install FastAPI**:\n   ```bash\n   pip install fastapi uvicorn\n   ```\n\n2. **Create FastAPI Application**:\n   - Define routes in `app/main.py`:\n     \n```python\n     from fastapi import FastAPI, Depends\n     from sqlalchemy.orm import Session\n     from .crud import get_detections\n     from .database import SessionLocal\n\n     app = FastAPI()\n\n     @app.get(\"/detections/{image_id}\")\n     def read_detections(image_id: int, db: Session = Depends(get_db)):\n         detections = get_detections(db, image_id=image_id\n\n)\n         return detections\n```\n\n3. **Run FastAPI**:\n   Start the FastAPI server:\n   ```bash\n   uvicorn app.main:app --reload\n   ```\n\n4. **Access the API**:\n   Visit `http://127.0.0.1:8000/` to explore the automatically generated API documentation.\n\n## Future Improvements\n\n1. **Data Enrichment**: Add more sources of data, such as public medical directories or customer reviews, to provide a richer dataset.\n2. **Machine Learning Models**: Build predictive models to analyze trends in medical products or promotional effectiveness.\n3. **Fine-tune YOLO**: Train the YOLO model on specific Ethiopian medical products and packaging to improve detection accuracy.\n\n\nBy following these steps, you can set up a fully operational data pipeline for scraping, cleaning, transforming, analyzing, and querying data on Ethiopian medical businesses.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftedoaba%2Fkaim-w7","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftedoaba%2Fkaim-w7","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftedoaba%2Fkaim-w7/lists"}