https://github.com/huggon1/multimodal-tagging-system
A public-safe multimodal tagging system combining OCR, captioning, and retrieval.
https://github.com/huggon1/multimodal-tagging-system
fastapi gradio information-retrieval multimodal ocr
Last synced: 1 day ago
JSON representation
A public-safe multimodal tagging system combining OCR, captioning, and retrieval.
- Host: GitHub
- URL: https://github.com/huggon1/multimodal-tagging-system
- Owner: huggon1
- Created: 2026-03-14T13:49:54.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-03-14T14:00:41.000Z (4 months ago)
- Last Synced: 2026-03-15T09:13:25.704Z (4 months ago)
- Topics: fastapi, gradio, information-retrieval, multimodal, ocr
- Language: Python
- Size: 12.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# multimodal-tagging-system
A cleaned-up public version of a graduation-project codebase for multimodal tag generation.
## Highlights
- Combines OCR, captioning, and embedding recall in one tagging flow
- Preserves a compact research-style implementation with a runnable demo UI
- Uses environment-based configuration instead of local hardcoded paths
- Ships with a reduced sample tag set suitable for public release
## What It Does
Given a post title, post content, and one or more images, the system combines:
- OCR over the images
- image captioning
- embedding-based tag recall
to produce a shortlist of recommended tags.
## Repository Layout
```text
multimodal-tagging-system/
app/
config.py
ocr_service.py
prompt_templates.py
schemas.py
tag_service.py
ui/
gradio_app.py
training/
train_triplet_embedding.py
data/
tags_new.txt
assets/
template.html
```
## Requirements
Install the base dependencies with:
```bash
pip install -r requirements.txt
```
The runtime still expects local caption and embedding model weights, which are intentionally not committed.
## Public Cleanup
This repository intentionally excludes:
- local model weights
- private database settings
- generated HTML reports
- local experiment caches
- old prototype folders
- the original full tag vocabulary; `data/tags_new.txt` now contains a compact sample tag set for demo use
## Services
### OCR service
```bash
uvicorn app.ocr_service:app --host 0.0.0.0 --port 8001
```
### Tag service
Before starting, point the environment variables to your local caption and embedding models if needed:
```bash
set MMTAG_CAPTION_MODEL_PATH=E:\\path\\to\\caption-model
set MMTAG_EMBED_MODEL_PATH=E:\\path\\to\\embedding-model
set MMTAG_DEVICE=cuda
```
Then run:
```bash
uvicorn app.tag_service:app --host 0.0.0.0 --port 8002
```
### Demo UI
```bash
python ui/gradio_app.py
```
The Gradio demo expects the OCR and tag services to already be running, so start those two service processes first.
## Training
`training/train_triplet_embedding.py` preserves the original triplet-loss tuning script for the embedding model. It expects a local `output.json` training file and local base model weights.
Example:
```bash
python training/train_triplet_embedding.py ^
--data training\output.json ^
--base-model E:\path\to\embedding-model
```
## Notes
- The code is preserved as a compact research-style implementation rather than a production package.
- Some original Chinese prompt text had encoding noise in the source folders, so this public version uses cleaned prompt templates for readability.
- The Gradio demo calls the OCR and tag services over HTTP, so start those services first.