An open API service indexing awesome lists of open source software.

https://github.com/huggon1/multimodal-tagging-system

A public-safe multimodal tagging system combining OCR, captioning, and retrieval.
https://github.com/huggon1/multimodal-tagging-system

fastapi gradio information-retrieval multimodal ocr

Last synced: 1 day ago
JSON representation

A public-safe multimodal tagging system combining OCR, captioning, and retrieval.

Awesome Lists containing this project

README

          

# multimodal-tagging-system

A cleaned-up public version of a graduation-project codebase for multimodal tag generation.

## Highlights

- Combines OCR, captioning, and embedding recall in one tagging flow
- Preserves a compact research-style implementation with a runnable demo UI
- Uses environment-based configuration instead of local hardcoded paths
- Ships with a reduced sample tag set suitable for public release

## What It Does

Given a post title, post content, and one or more images, the system combines:

- OCR over the images
- image captioning
- embedding-based tag recall

to produce a shortlist of recommended tags.

## Repository Layout

```text
multimodal-tagging-system/
app/
config.py
ocr_service.py
prompt_templates.py
schemas.py
tag_service.py
ui/
gradio_app.py
training/
train_triplet_embedding.py
data/
tags_new.txt
assets/
template.html
```

## Requirements

Install the base dependencies with:

```bash
pip install -r requirements.txt
```

The runtime still expects local caption and embedding model weights, which are intentionally not committed.

## Public Cleanup

This repository intentionally excludes:

- local model weights
- private database settings
- generated HTML reports
- local experiment caches
- old prototype folders
- the original full tag vocabulary; `data/tags_new.txt` now contains a compact sample tag set for demo use

## Services

### OCR service

```bash
uvicorn app.ocr_service:app --host 0.0.0.0 --port 8001
```

### Tag service

Before starting, point the environment variables to your local caption and embedding models if needed:

```bash
set MMTAG_CAPTION_MODEL_PATH=E:\\path\\to\\caption-model
set MMTAG_EMBED_MODEL_PATH=E:\\path\\to\\embedding-model
set MMTAG_DEVICE=cuda
```

Then run:

```bash
uvicorn app.tag_service:app --host 0.0.0.0 --port 8002
```

### Demo UI

```bash
python ui/gradio_app.py
```

The Gradio demo expects the OCR and tag services to already be running, so start those two service processes first.

## Training

`training/train_triplet_embedding.py` preserves the original triplet-loss tuning script for the embedding model. It expects a local `output.json` training file and local base model weights.

Example:

```bash
python training/train_triplet_embedding.py ^
--data training\output.json ^
--base-model E:\path\to\embedding-model
```

## Notes

- The code is preserved as a compact research-style implementation rather than a production package.
- Some original Chinese prompt text had encoding noise in the source folders, so this public version uses cleaned prompt templates for readability.
- The Gradio demo calls the OCR and tag services over HTTP, so start those services first.