https://github.com/huggon1/multimodal-tagging-system

A public-safe multimodal tagging system combining OCR, captioning, and retrieval.
https://github.com/huggon1/multimodal-tagging-system

fastapi gradio information-retrieval multimodal ocr

Last synced: 1 day ago
JSON representation

A public-safe multimodal tagging system combining OCR, captioning, and retrieval.

Host: GitHub
URL: https://github.com/huggon1/multimodal-tagging-system
Owner: huggon1
Created: 2026-03-14T13:49:54.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-14T14:00:41.000Z (4 months ago)
Last Synced: 2026-03-15T09:13:25.704Z (4 months ago)
Topics: fastapi, gradio, information-retrieval, multimodal, ocr
Language: Python
Size: 12.7 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # multimodal-tagging-system

A cleaned-up public version of a graduation-project codebase for multimodal tag generation.

## Highlights

- Combines OCR, captioning, and embedding recall in one tagging flow

- Preserves a compact research-style implementation with a runnable demo UI

- Uses environment-based configuration instead of local hardcoded paths

- Ships with a reduced sample tag set suitable for public release

## What It Does

Given a post title, post content, and one or more images, the system combines:

- OCR over the images

- image captioning

- embedding-based tag recall

to produce a shortlist of recommended tags.

## Repository Layout

```text

multimodal-tagging-system/

  app/

    config.py

    ocr_service.py

    prompt_templates.py

    schemas.py

    tag_service.py

  ui/

    gradio_app.py

  training/

    train_triplet_embedding.py

  data/

    tags_new.txt

  assets/

    template.html

```

## Requirements

Install the base dependencies with:

```bash

pip install -r requirements.txt

```

The runtime still expects local caption and embedding model weights, which are intentionally not committed.

## Public Cleanup

This repository intentionally excludes:

- local model weights

- private database settings

- generated HTML reports

- local experiment caches

- old prototype folders

- the original full tag vocabulary; `data/tags_new.txt` now contains a compact sample tag set for demo use

## Services

### OCR service

```bash

uvicorn app.ocr_service:app --host 0.0.0.0 --port 8001

```

### Tag service

Before starting, point the environment variables to your local caption and embedding models if needed:

```bash

set MMTAG_CAPTION_MODEL_PATH=E:\\path\\to\\caption-model

set MMTAG_EMBED_MODEL_PATH=E:\\path\\to\\embedding-model

set MMTAG_DEVICE=cuda

```

Then run:

```bash

uvicorn app.tag_service:app --host 0.0.0.0 --port 8002

```

### Demo UI

```bash

python ui/gradio_app.py

```

The Gradio demo expects the OCR and tag services to already be running, so start those two service processes first.

## Training

`training/train_triplet_embedding.py` preserves the original triplet-loss tuning script for the embedding model. It expects a local `output.json` training file and local base model weights.

Example:

```bash

python training/train_triplet_embedding.py ^

  --data training\output.json ^

  --base-model E:\path\to\embedding-model

```

## Notes

- The code is preserved as a compact research-style implementation rather than a production package.

- Some original Chinese prompt text had encoding noise in the source folders, so this public version uses cleaned prompt templates for readability.

- The Gradio demo calls the OCR and tag services over HTTP, so start those services first.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/huggon1/multimodal-tagging-system

Awesome Lists containing this project

README