{"id":50563442,"url":"https://github.com/Damkohler/CaptionForge","last_synced_at":"2026-06-21T08:00:42.913Z","repository":{"id":359807327,"uuid":"1247572469","full_name":"Damkohler/CaptionForge","owner":"Damkohler","description":"CaptionForge creates stronger local dataset captions by combining multiple image-caption witnesses, distilling their agreements and contradictions, validating the result with a VLM, and exporting auditable LoRA-ready captions.","archived":false,"fork":false,"pushed_at":"2026-06-20T22:08:59.000Z","size":2329,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-20T23:17:16.561Z","etag":null,"topics":["audit-trail","captioning","captioning-images","comfyui","comfyui-custom-nodes","dataset-captions","dataset-preparation","image-captioning","joy-caption","joycaption","jsonl","local-ai","lora","lora-training","multimodal-ai","ollama","qwen","qwen2-5","vision-language-models","vlm-validation"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Damkohler.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-23T13:50:43.000Z","updated_at":"2026-06-20T22:09:03.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Damkohler/CaptionForge","commit_stats":null,"previous_names":["damkohler/captionforge"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Damkohler/CaptionForge","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Damkohler%2FCaptionForge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Damkohler%2FCaptionForge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Damkohler%2FCaptionForge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Damkohler%2FCaptionForge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Damkohler","download_url":"https://codeload.github.com/Damkohler/CaptionForge/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Damkohler%2FCaptionForge/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34601662,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-21T02:00:05.568Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audit-trail","captioning","captioning-images","comfyui","comfyui-custom-nodes","dataset-captions","dataset-preparation","image-captioning","joy-caption","joycaption","jsonl","local-ai","lora","lora-training","multimodal-ai","ollama","qwen","qwen2-5","vision-language-models","vlm-validation"],"created_at":"2026-06-04T13:00:25.178Z","updated_at":"2026-06-21T08:00:42.908Z","avatar_url":"https://github.com/Damkohler.png","language":"Python","funding_links":[],"categories":["Workflows pushed in 7 days"],"sub_categories":[],"readme":"# CaptionForge\n\n**Accurate, auditable image captions for LoRA dataset preparation in ComfyUI.**\n\nCaptionForge is built around a simple idea: one captioner can be useful, but one captioner is also easy to fool. Instead of asking a single model to describe an image and hoping it gets everything right, CaptionForge can ask multiple independent captioning engines to produce separate “witness accounts” of the same image. Those accounts are then merged by a text-LLM distillation pass that looks for agreement, preserves useful details, and separates likely contradictions or unsupported claims. The resulting draft is checked against the image by a final vision-language model, which acts as the image-aware judge before the final captions are exported.\n\nThe goal is not magic, and it is not perfection. CaptionForge is meant for automated captioning of large image archives and LoRA training sets where hand-captioning would be too slow, but where the usual hallucinations, omissions, and inconsistencies from a single captioning model are still a problem. The pipeline is intentionally heavier than a normal caption node, so it is best used when caption quality, auditability, and consistency matter enough to justify the extra computation.\n\nThe current v0.1.x workflow is tuned primarily for character, fashion, portrait, doll/render, cosplay, pageant, glamour, and style-LoRA datasets, where visible details such as face, hair, eyes, expression, pose, body shape, clothing construction, accessories, colors, materials, lighting, background, framing, and visual style matter.\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/icons/jlc-comfyui-nodes_Logo-0512.png\" width=\"120\"\u003e\n  \u0026nbsp;\u0026nbsp;\u0026nbsp;\n  \u003cimg src=\"assets/icons/jlc-comfyui-nodes_Logo-Dark-0512.png\" width=\"120\"\u003e\n\u003c/p\u003e\n\n[![ComfyUI](https://img.shields.io/badge/ComfyUI-Custom%20Nodes-blue)]()\n[![License](https://img.shields.io/badge/license-MIT-green)]()\n![Status](https://img.shields.io/badge/status-v0.1.x%20preview-orange)\n![Version](https://img.shields.io/badge/version-0.1.0-orange)\n\n## Starter workflow\n\nA full workflow sample is included as a PNG with embedded ComfyUI workflow metadata:\n\n```text\nassets/workflows/CaptionForge_FullWorkflow.png\n```\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"assets/workflows/CaptionForge_FullWorkflow.png\"\u003eDownload workflow PNG\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"assets/workflows/CaptionForge_FullWorkflow.png\" alt=\"CaptionForge full starter workflow\" width=\"900\"\u003e\n\u003c/p\u003e\n\nA separate JSON export of the same workflow is also included:\n\n```text\nassets/workflows/CaptionForge_FullWorkflow.json\n```\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"assets/workflows/CaptionForge_FullWorkflow.json\"\u003eDownload workflow JSON\u003c/a\u003e\n\u003c/p\u003e\n\nIn ComfyUI, load the workflow by dragging either `CaptionForge_FullWorkflow.png` or `CaptionForge_FullWorkflow.json` onto the canvas.\n\n## Install\n\nClone CaptionForge into your ComfyUI custom nodes folder:\n\n```bash\ngit clone https://github.com/Damkohler/CaptionForge.git ComfyUI/custom_nodes/CaptionForge\n```\n\nOr copy the repository manually so the folder layout is:\n\n```text\nComfyUI/custom_nodes/CaptionForge/\n```\n\nThen restart ComfyUI.\n\nIf your ComfyUI environment does not already include the needed Python packages, install CaptionForge dependencies from inside your ComfyUI Python environment. The exact command depends on how your ComfyUI install is managed, but typical options are:\n\n```bash\ncd ComfyUI/custom_nodes/CaptionForge\npip install -e .\n```\n\nor, if you maintain dependencies manually:\n\n```bash\npip install torch transformers accelerate huggingface-hub pillow numpy safetensors qwen-vl-utils\n```\n\nOptional 8-bit loading may require:\n\n```bash\npip install bitsandbytes\n```\n\nOllama-backed stages require a working local Ollama installation and installed Ollama model tags.\n\nExample:\n\n```bash\nollama pull mistral-small:24b\nollama pull gemma4:26b\n```\n\nCaptionForge does **not** ship model weights. Joy, Qwen, and Ollama model downloads remain user-controlled.\n\n## What the workflow does\n\nCaptionForge's main pipeline is:\n\n```text\nPass A — raw witness captions\n  Joy Caption xN\n  Qwen Caption xN\n  optional Ollama VLM Caption xN\n\nPass B — text-LLM distillation\n  combine witness captions\n  preserve repeated and useful details\n  separate contradictions and weak claims\n  build a rich draft caption\n\nPass C — image-aware VLM validation\n  inspect the actual image\n  keep image-supported details\n  remove unsupported hallucinations\n  correct visible errors\n  produce the authoritative long caption\n\nPass D — deterministic export formatting\n  write the validated long caption\n  derive a shorter LoRA-length caption\n  derive a compact taggy caption\n  write TXT and JSONL audit records\n```\n\nThe important distinction is that the expensive semantic work should mostly end at the VLM-validated long caption. The short and taggy outputs are intentionally lighter recipe-style formatting steps derived from that validated caption, not new attempts to reinterpret the image.\n\n## Current status\n\nCaptionForge v0.1.0 is a working experimental preview for ComfyUI users and node developers who want to test a multi-pass captioning pipeline.\n\nIt is not presented as a universal replacement for a strong standalone captioner. If JoyCaption, Qwen, Florence, BLIP, WD14, or another captioning tool already gives you exactly what your dataset needs, you may not need CaptionForge. This project is aimed at cases where a single captioner is not accurate, complete, consistent, or auditable enough.\n\nExpected v0.1.x realities:\n\n- the workflow is computationally heavy\n- large models may be slow\n- model choices matter a lot\n- output schemas may still evolve\n- prompts and defaults may continue to be refined\n- not every dataset will benefit equally\n- comparison feedback is welcome\n\nThis is a heavy tool. Use it when the extra caption quality and audit trail of large automated jobs are worth the runtime cost.\n\n## Why use this instead of a standalone captioner?\n\nYou may want CaptionForge when:\n\n- one captioner notices the face but misses clothing details\n- another captioner notices clothing but misreads the pose\n- a third captioner catches style or material details the others miss\n- you want an LLM to consolidate agreement instead of merely accepting one model's wording\n- you want a final VLM to check the draft against the actual image\n- you want intermediate JSONL records for debugging and audit\n- you want final captions written as sidecars beside the source images\n- you need both long natural captions and compact LoRA-style derivatives\n\nThe project question is practical:\n\n\u003e Can independent caption witnesses plus text distillation plus image-aware validation produce better dataset captions than a single captioning model alone?\n\nFor some datasets, the answer may be yes. For others, a simpler captioner may be enough. CaptionForge is designed to make that comparison visible.\n\n## What CaptionForge tries to optimize\n\nCaptionForge currently favors captions that are:\n\n- rich enough for LoRA training\n- visually grounded\n- less hallucinated than unvalidated text-only synthesis\n- explicit about visible, trainable details\n- auditable through JSONL records\n- locally runnable\n- model-agnostic enough to improve as better captioners, distillers, and validators become available\n\nUseful caption details often include:\n\n- subject type and visible style\n- face shape and facial traits\n- hair color and hairstyle\n- eye color and makeup as separate details\n- expression and pose\n- hands and body position\n- body shape and visible proportions when relevant\n- clothing construction, layers, fit, and materials\n- accessories, jewelry, nails, props, and distinctive details\n- colors, textures, lighting, background, framing, and crop\n\nVisible glamour, swimwear, lingerie, revealing clothing, cleavage, side openings, exposed midriff, or similar styling may be described neutrally when it is actually visible and relevant to the dataset. CaptionForge prompts should not invent hidden anatomy, unseen clothing, explicit acts, or contradicted details.\n\n## Active node families\n\nNode categories are being normalized under:\n\n```text\nCaptioning/CaptionForge\n```\n\nwith active caption nodes under:\n\n```text\nCaptioning/CaptionForge/Caption Nodes\n```\n\n### JLC CaptionForge Pipeline Planner\n\nThe central planning node for normal runs.\n\nIt coordinates:\n\n- input image path or direct image passthrough\n- recursive folder traversal\n- filename glob filtering\n- output directory\n- run name\n- overwrite behavior\n- Pass A witness run counts\n- seed schedules\n- sampling schedules\n- max image size\n- max token budget\n- LoRA trigger word\n- user caption anchor\n- distiller settings\n- validator settings\n- final export settings\n- derived JSONL/TXT/config paths\n\n### JLC CaptionForge\n\nThe main capstone/orchestration node.\n\nIt consumes Pass A raw caption records, runs the distillation and validation stages, and exports final captions. The VLM-validated natural paragraph is the authoritative long caption. Formatting stages should not blindly rewrite that natural caption.\n\n### JLC CaptionForge Joy Caption\n\nPython/Hugging Face JoyCaption/LLaVA-family Pass A witness.\n\nJoy is treated as a first-class CaptionForge caption source and is often one of the strongest raw caption witnesses.\n\n### JLC CaptionForge Qwen Caption\n\nPython/Hugging Face Qwen-family Pass A witness.\n\nQwen is useful as a second independent captioning voice, especially when its behavior complements Joy. Optional 8-bit loading may be available where supported.\n\n### JLC CaptionForge Ollama Caption\n\nOllama-backed VLM Pass A witness.\n\nThis node delegates image-caption generation to a local Ollama server rather than loading Hugging Face/PyTorch weights inside ComfyUI. It can use configured Ollama VLM tags such as:\n\n```text\ngemma4:26b\nqwen3.6:35B-A3B\nhuihui_ai/gemma-4-abliterated:26b\n```\n\nIts purpose is to provide access to other raw-caption witness alternatives. It's function is parallel to the Joy Caption and Qwen Caption nodes, and should not be confused with the later VLM validator/capstone role.\n\n### JLC CaptionForge Template Options\n\nShared prompt-option sidecar for caption nodes.\n\nTemplate Options let one sidecar node feed consistent LoRA-relevant prompt modifiers into Joy, Qwen, Ollama, and later caption witnesses without duplicating the same option widgets on every caption node.\n\n## Model and memory behavior\n\nCaptionForge uses two model ecosystems:\n\n1. **Python / Hugging Face model folders** for Joy and Qwen witness engines.\n2. **Ollama models** for text-LLM distillation, image-aware VLM validation, optional formatting, and Ollama-backed caption witnesses.\n\nJoy and Qwen use Python/Hugging Face engines that integrate with the CaptionForge process-local model cache. Those engines manage Python model residency, reuse, and eviction before loading heavyweight caption models.\n\nOllama-facing stages are different. Ollama models live in the Ollama daemon, not inside the CaptionForge Python model cache. Before handing work to Ollama, the Ollama Caption node and the CaptionForge capstone clear any resident CaptionForge Python/HF caption models if needed. After that handoff, Ollama owns Ollama model residency.\n\nIn short:\n\n```text\nJoy/Qwen engines:\n  manage Python-hosted caption models through captionforge_model_cache\n\nOllama Caption and CaptionForge capstone:\n  clear Python-hosted models before calling the Ollama daemon\n\nOllama daemon:\n  owns Ollama model loading and residency\n```\n\n## Model locations\n\nLarge model weights are intentionally not stored in this repository.\n\nPython-based witness models are expected under ComfyUI model folders, for example:\n\n```text\nComfyUI/models/LLM/JLC_JoyCaption/\nComfyUI/models/LLM/JLC_QwenCaption/\n```\n\nOllama models must be installed and runnable through Ollama outside this repository.\n\nCaptionForge does not require every supported backend to be installed for every workflow. Users can test smaller subsets first.\n\n## Ollama model dropdown configuration\n\nThe file:\n\n```text\nconfig/captionforge_ollama_models.json\n```\n\ndefines user-editable Ollama model tags for dropdowns used by distiller, validator, formatter, and Ollama caption-witness nodes.\n\nExample:\n\n```json\n{\n  \"distiller_models\": [\n    \"mistral-small:24b\",\n    \"VladimirGav/gemma4-26b-16GB-VRAM-Uncensored\",\n    \"deepseek-r1:32b\",\n    \"tarruda/neuraldaredevil-8b-abliterated:fp16\",\n    \"gpt-oss:20b\"\n  ],\n  \"validator_models\": [\n    \"gemma4:26b\",\n    \"qwen3.6:35B-A3B\",\n    \"huihui_ai/gemma-4-abliterated:26b\"\n  ],\n  \"format_models\": [\n    \"mistral-small:24b\",\n    \"VladimirGav/gemma4-26b-16GB-VRAM-Uncensored\",\n    \"gpt-oss:20b\",\n    \"deepseek-r1:32b\"\n  ],\n  \"caption_models\": [\n    \"gemma4:26b\",\n    \"qwen3.6:35B-A3B\",\n    \"huihui_ai/gemma-4-abliterated:26b\"\n  ],\n  \"defaults\": {\n    \"distiller_model\": \"mistral-small:24b\",\n    \"validator_model\": \"gemma4:26b\",\n    \"format_model\": \"mistral-small:24b\",\n    \"caption_model\": \"gemma4:26b\"\n  },\n  \"include_custom\": true\n}\n```\n\nTerminology:\n\n```text\ndistiller_model   text-only LLM for Pass B distillation\nvalidator_model   image-aware VLM for Pass C validation\nformat_model      text-only LLM for formatting/taggy conversion when used\ncaption_model     Ollama-backed Pass A image-caption witness model\n```\n\nValues should be concrete Ollama model tags used exactly as written.\n\n## Output layout\n\nCaptionForge writes auditable run artifacts and final sidecars during planned runs.\n\nA typical planned run uses this structure:\n\n```text\n\u003coutput_root\u003e/\n  opt_images/\n    comfy_image_0000.png\n    comfy_image_0000_long.txt\n    comfy_image_0000_short.txt\n    comfy_image_0000_taggy.txt\n    comfy_image_0001.png\n    comfy_image_0001_long.txt\n    comfy_image_0001_short.txt\n    comfy_image_0001_taggy.txt\n\n  \u003crun_name\u003e__working/\n    \u003crun_name\u003e__A_RAW_CAPTIONS.jsonl\n    \u003crun_name\u003e__B_DISTILL.jsonl\n    \u003crun_name\u003e__B_DISTILL_readable.jsonl\n    \u003crun_name\u003e__B_DISTILL_readable.json\n    \u003crun_name\u003e__B_DISTILL_prompts.jsonl\n    \u003crun_name\u003e__C_VLM_VALIDATED.jsonl\n    \u003crun_name\u003e__C_VLM_VALIDATED_readable/\n    \u003crun_name\u003e__C_VLM_VALIDATOR_prompts.jsonl\n    \u003crun_name\u003e__D_FINAL_EXPORT.jsonl\n    \u003crun_name\u003e__output_paths.json\n    \u003crun_name\u003e__run_config.json\n```\n\nFolder-input images keep their source locations, and final TXT sidecars are written beside those original images.\n\nOptional direct `IMAGE` inputs are copied into a visible output-root folder:\n\n```text\n\u003coutput_root\u003e/opt_images/\n```\n\nFinal caption sidecars are written beside the resolved source image. For folder-input images, that means beside the original image. For optional direct images, that means beside the saved optional image inside `opt_images/`.\n\nFinal sidecars currently include:\n\n```text\n\u003cimage_stem\u003e_long.txt\n\u003cimage_stem\u003e_short.txt\n\u003cimage_stem\u003e_taggy.txt\n```\n\nMeaning:\n\n```text\n_long.txt    the authoritative VLM-validated natural caption\n_short.txt   a shorter LoRA-length caption derived from the long caption\n_taggy.txt   a compact comma-separated taggy caption derived from the long caption\n```\n\nLong captions are intentional in v0.1.x. The current release-candidate strategy favors preserving visible, trainable detail in the validated long caption, then deriving shorter and taggy outputs from that result.\n\nExact JSONL schemas may evolve during the preview phase.\n\n## Dependencies\n\nPython dependencies are declared in `pyproject.toml` where applicable.\n\nTypical local use may involve:\n\n```text\ntorch\ntransformers\naccelerate\nhuggingface-hub\npillow\nnumpy\nsafetensors\nqwen-vl-utils\n```\n\nOptional quantization support may involve:\n\n```text\nbitsandbytes\n```\n\nOllama-backed stages require a working local Ollama installation and installed Ollama model tags.\n\n## Hardware notes\n\nCaptionForge is designed for local workflows, but strong results may require large local models.\n\nPractical performance depends on:\n\n- GPU VRAM\n- system RAM\n- model size\n- quantization mode\n- Ollama version\n- context length\n- image size\n- number of Pass A witness runs\n- whether models are kept loaded or unloaded between runs\n\nThe author's active development environment includes an RTX 4090 Laptop GPU with 16 GB VRAM. Larger models may be slow, may require careful quantization, or may need more capable hardware.\n\n## Experimental branches\n\nSome experimental or unsupported code may exist in the repository for future A/B testing or research.\n\nExperimental branches should be:\n\n- clearly labeled\n- kept out of the normal ComfyUI registration path\n- not imported by `__init__.py`\n- not shown as mainline nodes unless deliberately enabled\n- treated as unsupported starting points rather than stable user features\n\nThe active public workflow should be the main Planner → Pass A witnesses → Distiller → VLM Validator → Export path.\n\n## Development principles\n\nCaptionForge currently prioritizes:\n\n- local execution\n- auditable intermediate records\n- JSONL sidecars\n- reusable engines separated from ComfyUI node wrappers\n- planner-driven workflows\n- model cache and VRAM hygiene\n- strong defaults for LoRA captioning\n- explicit prompt roles\n- model-agnostic backends\n- visible, trainable detail over generic caption prose\n- practical feedback from real datasets\n\n## Feedback wanted\n\nUseful feedback includes:\n\n- comparisons against standalone JoyCaption, Qwen, or other captioners\n- examples where CaptionForge improves caption quality\n- examples where CaptionForge makes captions worse\n- hallucination reports\n- missed-detail reports\n- model recommendations\n- prompt improvements\n- broken node reports\n- workflow usability feedback\n- VRAM/performance observations\n- JSONL/audit trail suggestions\n\nPlease include enough context to reproduce the issue or evaluate the result: selected nodes, model tags, relevant settings, whether the run used direct IMAGE input or a folder path, and a small sample of generated captions when possible.\n\n## Attribution \u0026 License\n\nConcept and implementation by **J. L. Córdova**, with development assistance from **ChatGPT (OpenAI)**.\n\nCaptionForge's Joy/template-option workflow is locally adapted and was inspired in part by the practical template interface pattern used by the public JoyCaption Beta One Hugging Face Space:\n\n```text\nhttps://huggingface.co/spaces/fffiloni/JoyCaption-Beta-One\n```\n\nCopyright (c) 2026 J. L. Córdova\n\nReleased under the **MIT License**. See [`LICENSE`](./LICENSE) for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDamkohler%2FCaptionForge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FDamkohler%2FCaptionForge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FDamkohler%2FCaptionForge/lists"}