https://github.com/fblissjr/ComfyUI-DatasetHelper
For more easily working with input datasets and passing into downstream nodes in ComfyUI
https://github.com/fblissjr/ComfyUI-DatasetHelper
Last synced: 3 months ago
JSON representation
For more easily working with input datasets and passing into downstream nodes in ComfyUI
- Host: GitHub
- URL: https://github.com/fblissjr/ComfyUI-DatasetHelper
- Owner: fblissjr
- License: apache-2.0
- Created: 2025-01-26T20:10:11.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-01-26T20:41:06.000Z (3 months ago)
- Last Synced: 2025-01-26T21:25:52.371Z (3 months ago)
- Language: Python
- Size: 19.5 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-comfyui - **ComfyUI Dataset Helper & Batch Node**
README
# ComfyUI Dataset Helper & Batch Node
This custom node set for ComfyUI provides a `DatasetBatchNode` for automated, sequential processing of datasets, particularly useful for iterative training or batched image/video generation workflows.
## Usage
**Key Features:**
* Loads datasets from local files (JSONL, CSV) or Hugging Face `datasets`.
* Processes individual rows sequentially, triggering a new job for each.
* `magic_number` input automatically tracks the row index across multiple jobs, ensuring seamless dataset iteration even across workflow restarts.
* Offers flexible prompt construction using `mixed_fields_config_json` to combine dataset fields and apply filters.
* *data utilities*: currently only includes `process_miradata.py` script to preprocess the MiraData dataset for this node, but really can use any dataset. These can be found in `./data_utils/` with a separate `README.md` on how to use that.**Node Configuration (`DatasetBatchNode`)**
* **`dataset_path`:** (Required) Path to your dataset file (JSONL, CSV) or a Hugging Face dataset identifier (e.g., `TencentARC/MiraData`).
* **`prompt_field`:** (Required) Name of the dataset field containing the base text prompt (e.g., `combined_caption`).
* **`num_rows`:** (Required) Total rows to process. Use `-1` to process the entire dataset.
* **`start_row`:** (Required) Row index to begin processing from (0-indexed).
* **`random_seed`:** (Required) Seed for dataset shuffling, if enabled.
* **`shuffle`:** (Required) Enable dataset shuffling before processing.
* **`magic_number`:** (Optional, but **connect a Constant node with value 0**). This input is automatically managed; connect a Constant Number node set to `0` and leave it connected (although you can probably just leave it disconnected too - need to test more).
* **`delimiter`:** (Optional) Separator used to join combined caption fields (default: newline `\n`).
* **`text_input`:** (Optional) Prepended text added to every generated prompt.
* **`mixed_fields_config_json`:** (Optional) JSON configuration for advanced prompt construction. See "Advanced Prompt Construction" below.**Important:** Connect an `INT Constant` node set to `0` to the `magic_number` input. This node is managed automatically by the extension and is crucial for sequential processing.
**Advanced Prompt Construction (`mixed_fields_config_json`)**
Use this optional JSON input for sophisticated prompt creation by combining and filtering different fields from your dataset.
**JSON Configuration Format:**
```json
[
{
"field": "field_name_1",
"filter": "filter_expression_1" // Optional filter
},
{
"field": "field_name_2",
"filter": "filter_expression_2" // Optional filter
},
// etc etc
]
```* **`field` (Required):** The name of a field in your dataset (e.g., `"dense_caption"`, `"short_caption"`).
* **`filter` (Optional):** A Python expression string used to conditionally include data from this field. The expression is evaluated for each row, with `example` representing the current dataset row (dictionary). If the expression evaluates to `True`, the field's value is included in the combined prompt.## JSON Examples:**
Assuming you are using the "TencentARC/MiraData" dataset, here are some practical examples:
### Example 1: Combine `dense_caption` and `style_caption`**
```json
[
{
"field": "dense_caption",
"filter": ""
},
{
"field": "style_caption",
"filter": ""
}
]
```This configuration will create prompts by concatenating the `dense_caption` and `style_caption` fields from each row, separated by the `delimiter` you specify in the node (default is newline).
### Example 2: Use `short_caption` for clip IDs >= 7.0, otherwise use `camera_caption`**
```json
[
{
"field": "short_caption",
"filter": "example['clip_id'] >= 7.0"
},
{
"field": "camera_caption",
"filter": "example['clip_id'] < 7.0"
}
]
```This example demonstrates conditional prompt creation. For dataset rows where `clip_id` is 7.0 or greater, the `short_caption` will be used. For rows with `clip_id` less than 7.0, the `camera_caption` will be used instead.
### Example 3: Combine all caption fields**
```json
[
{ "field": "dense_caption" },
{ "field": "main_object_caption" },
{ "field": "background_caption" },
{ "field": "camera_caption" },
{ "field": "style_caption" }
]
``````markdown
This configuration combines all five available caption fields from the MiraData dataset into a single, detailed prompt.**Important Notes:**
* The `DatasetBatchNode` is designed to work in conjunction with the provided `dataset_batch_automation.js` JavaScript extension. Ensure both the Python node and the JavaScript extension are installed correctly in your `ComfyUI/custom_nodes` directory.
* The `magic_number` input of the `DatasetBatchNode` should be connected to a Constant Number node, but should **not** be manually modified. The automation script manages this value.
```