{"id":27261508,"url":"https://github.com/cahlen/conversation-dataset-generator","last_synced_at":"2025-10-27T18:42:33.469Z","repository":{"id":287198535,"uuid":"963703790","full_name":"cahlen/conversation-dataset-generator","owner":"cahlen","description":"Craft conversational datasets (JSONL format with rich metadata) using LLMs. Specify parameters manually or use a creative brief for LLM-generated arguments with automatic topic/scenario variation. Optional web search improves persona grounding. Ideal for LoRA tuning, persona training, and creative writing. Includes Hugging Face Hub upload.","archived":false,"fork":false,"pushed_at":"2025-04-10T23:14:13.000Z","size":128,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-11T05:39:54.199Z","etag":null,"topics":["dataset-generation","dialogue-generation","fine-tuning","huggingface","jsonl","llm","lora","nlp","peft","persona","python","synthentic-data","transformers"],"latest_commit_sha":null,"homepage":"https://cahlen.github.io/conversation-dataset-generator/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cahlen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-10T04:56:09.000Z","updated_at":"2025-04-10T23:14:16.000Z","dependencies_parsed_at":"2025-04-12T09:15:32.302Z","dependency_job_id":null,"html_url":"https://github.com/cahlen/conversation-dataset-generator","commit_stats":null,"previous_names":["cahlen/conversation-dataset-generator"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cahlen/conversation-dataset-generator","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahlen%2Fconversation-dataset-generator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahlen%2Fconversation-dataset-generator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahlen%2Fconversation-dataset-generator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahlen%2Fconversation-dataset-generator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cahlen","download_url":"https://codeload.github.com/cahlen/conversation-dataset-generator/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cahlen%2Fconversation-dataset-generator/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":281323267,"owners_count":26481554,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-27T02:00:05.855Z","response_time":61,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dataset-generation","dialogue-generation","fine-tuning","huggingface","jsonl","llm","lora","nlp","peft","persona","python","synthentic-data","transformers"],"created_at":"2025-04-11T05:33:29.967Z","updated_at":"2025-10-27T18:42:33.449Z","avatar_url":"https://github.com/cahlen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca name=\"readme-top\"\u003e\u003c/a\u003e\n\n\u003c!-- PROJECT SHIELDS --\u003e\n[![MIT License][license-shield]][license-url]\n\u003c!-- Add other shields here if desired --\u003e\n\n\u003c!-- PROJECT LOGO --\u003e\n\u003cbr /\u003e\n\u003cdiv align=\"center\"\u003e\n  \u003ca href=\"https://github.com/cahlen/conversation-dataset-generator\"\u003e\n    \u003cimg src=\"https://www.svgrepo.com/show/28673/speech-bubble.svg\" alt=\"Conversation Icon\" width=\"80\" height=\"80\"\u003e\n  \u003c/a\u003e\n\n  \u003ch1 align=\"center\"\u003eConversation Dataset Generator ✨\u003c/h1\u003e\n\n  \u003cp align=\"center\"\u003e\n    Craft High-Quality Dialogue Data for Your LLMs.\n    \u003cbr /\u003e\n    \u003ca href=\"https://cahlen.github.io/conversation-dataset-generator/\"\u003e\u003cstrong\u003eView Project Page »\u003c/strong\u003e\u003c/a\u003e\n    \u003cbr /\u003e\n    \u003cbr /\u003e\n    \u003ca href=\"https://github.com/cahlen/conversation-dataset-generator/issues\"\u003eReport Bug\u003c/a\u003e\n    ·\n    \u003ca href=\"https://github.com/cahlen/conversation-dataset-generator/issues\"\u003eRequest Feature\u003c/a\u003e\n  \u003c/p\u003e\n\u003c/div\u003e\n\n\u003c!-- TABLE OF CONTENTS --\u003e\n\u003cdetails\u003e\n  \u003csummary\u003eTable of Contents\u003c/summary\u003e\n  \u003col\u003e\n    \u003cli\u003e\n      \u003ca href=\"#about-the-project\"\u003eAbout The Project\u003c/a\u003e\n      \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#built-with\"\u003eBuilt With\u003c/a\u003e\u003c/li\u003e\n      \u003c/ul\u003e\n    \u003c/li\u003e\n    \u003cli\u003e\n      \u003ca href=\"#getting-started\"\u003eGetting Started\u003c/a\u003e\n      \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#prerequisites\"\u003ePrerequisites\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#installation\"\u003eInstallation\u003c/a\u003e\u003c/li\u003e\n      \u003c/ul\u003e\n    \u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#usage\"\u003eUsage\u003c/a\u003e\u003c/li\u003e\n     \u003cul\u003e\n        \u003cli\u003e\u003ca href=\"#modes-of-operation\"\u003eModes of Operation\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#argument-reference\"\u003eArgument Reference\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#examples\"\u003eExamples\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#output-format\"\u003eOutput Format\u003c/a\u003e\u003c/li\u003e\n        \u003cli\u003e\u003ca href=\"#model--fine-tuning-notes\"\u003eModel \u0026 Fine-Tuning Notes\u003c/a\u003e\u003c/li\u003e\n      \u003c/ul\u003e\n    \u003cli\u003e\u003ca href=\"#roadmap\"\u003eRoadmap\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#contributing\"\u003eContributing\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#license\"\u003eLicense\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#contact\"\u003eContact\u003c/a\u003e\u003c/li\u003e\n    \u003cli\u003e\u003ca href=\"#acknowledgments\"\u003eAcknowledgments\u003c/a\u003e\u003c/li\u003e\n  \u003c/ol\u003e\n\u003c/details\u003e\n\n\u003c!-- ABOUT THE PROJECT --\u003e\n## About The Project\n\nEver wish you could generate *just* the right kind of conversational data? Whether you're fine-tuning a Large Language Model (LLM) for a specific **style** or **persona**, need dialogue for a creative project, or want to explore complex **topics** in a natural flow, the Conversation Dataset Generator is here to help!\n\nThis powerful and flexible Python script leverages Hugging Face's `transformers` library to put you in control. You can operate in two main modes:\n\n1.  **Manual Mode:** Specify everything – the exact `topic`, `personas` (with descriptions!), `scenario`, `style`, and even specific `keywords` to include.\n2.  **Creative Brief Mode:** Provide a high-level `creative-brief` (like *\"Sherlock Holmes explains TikTok trends to a confused Dr. Watson\"*) and let the script use an LLM to brainstorm the detailed parameters *for you*. This mode automatically generates **topic/scenario variations** for each example while keeping the core personas consistent, enhancing dataset diversity. Furthermore, you can optionally provide specific **web search terms** (`--persona1-search-term`, `--persona2-search-term`) to fetch real-time context about the personas, allowing the LLM to generate more accurate descriptions and dialogue even for individuals or characters not well-represented in its training data.\n\nEither way, the output is a clean **JSON Lines (`.jsonl`)** file, perfect for downstream tasks. Each line represents a single turn with a rich set of keys readily compatible with popular LLM training frameworks and NLP pipelines.\n\n**Why Use This Generator?**\n\nUnlock the potential of your LLMs or accelerate your creative process! This script empowers you to generate targeted datasets for various goals:\n\n*   **Style Specialization:** Train models to master specific conversational nuances (e.g., pirate speak, formal anchor).\n*   **Persona Embodiment:** Build believable characters, even niche ones using web search context.\n*   **Topic/Scenario Fluency:** Enhance a model's ability to discuss particular subjects naturally.\n*   **Instruction Adherence:** Train models to better follow constraints like including specific keywords.\n*   **Creative Content Generation:** Break writer's block and draft dialogue for scripts, stories, etc.\n*   **Dialogue Flow Analysis:** Study conversation progression using the structured output.\n\nBest of all, the code is fully open source under the MIT license, giving you the freedom to use, modify, and extend it however you see fit!\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n### Built With\n\nThis project relies on several key libraries:\n\n*   [![Python][Python.org]][Python-url]\n*   [![PyTorch][PyTorch.org]][PyTorch-url]\n*   [![Transformers][Transformers.co]][Transformers-url]\n*   [![Accelerate][Accelerate.co]][Accelerate-url]\n*   [![Datasets][Datasets.co]][Datasets-url]\n*   [![Huggingface Hub][Huggingface.co]][Huggingface-url]\n*   [![Pandas][Pandas.pydata]][Pandas-url]\n*   [![DuckDuckGo Search][DuckDuckGo-Search-pypi]][DuckDuckGo-Search-url] (Optional, for Brief Mode web search)\n*   [![BitsAndBytes][BitsAndBytes-pypi]][BitsAndBytes-url] (Optional, for LoRA examples)\n*   [![TQDM][TQDM-pypi]][TQDM-url] (For progress bars)\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- GETTING STARTED --\u003e\n## Getting Started\n\nTo get a local copy up and running follow these simple steps.\n\n### Prerequisites\n\n*   **Python:** Version 3.8+ is required.\n*   **GPU:** A powerful GPU with sufficient VRAM and CUDA support is *highly recommended*, especially for Creative Brief mode which involves multiple LLM calls per example. Manual mode is less demanding but still benefits from GPU acceleration.\n*   **CPU/Memory:** A capable CPU and adequate RAM are needed.\n*   **Internet Connection:** Required if using the `--personaX-search-term` arguments in Creative Brief mode for DuckDuckGo searches or for image searches.\n*   **Dependencies:** Install necessary Python packages as described below.\n\n### Installation\n\n1.  **Clone the repository (Optional):**\n    ```bash\n    git clone https://github.com/cahlen/conversation-dataset-generator.git\n    cd conversation-dataset-generator\n    ```\n2.  **Create \u0026 Activate Virtual Environment (Recommended):**\n    ```bash\n    python3 -m venv venv\n    source venv/bin/activate # On Windows use `venv\\\\Scripts\\\\activate`\n    ```\n3.  **Install Base Dependencies:**\n    ```bash\n    pip install -r requirements.txt\n    ```\n    *Note: If `torch` is included in `requirements.txt`, pip might install a CPU or older CUDA version. For optimal GPU usage, consider installing PyTorch separately first, matching your CUDA version - see step 4. The `requirements.txt` file also includes `tqdm` for progress bars.*\n4.  **Install Specific PyTorch Version (Optional but Recommended for GPU):**\n    Install PyTorch *after* other dependencies, matching your CUDA setup. Find the correct command for your system on the [official PyTorch website](https://pytorch.org/get-started/locally/).\n    ```bash\n    # Example for CUDA 12.8\n    pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128\n    ```\n    *Ensure your NVIDIA driver version supports your chosen CUDA version!*\n5.  **Install Optional Dependencies:**\n    *   For Brief Mode web search (`--personaX-search-term`):\n        ```bash\n        pip install duckduckgo-search\n        ```\n    *   For LoRA training examples/notes:\n        ```bash\n        pip install -U peft trl bitsandbytes\n        ```\n    *   For progress bars (likely already installed via `requirements.txt`):\n        ```bash\n        pip install tqdm\n        ```\n6.  **Login to Hugging Face Hub (Optional, for uploading):**\n    To use the `--upload-to-hub` feature, you need to log in:\n    ```bash\n    huggingface-cli login\n    # Follow prompts to enter your HF API token (read or write permissions needed)\n    ```\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- USAGE EXAMPLES --\u003e\n## Usage\n\nThis project provides two main scripts for generating conversational data:\n\n1.  `generate.py`: Generates a single dataset based on command-line arguments or a creative brief. Shows progress using `tqdm`.\n2.  `batch_generate.py`: Runs multiple `generate.py` processes based on a YAML configuration file, allowing for large-scale generation across different scenarios and modes.\n\n### Single Generation (`generate.py`)\n\nThe `generate.py` script can be run in several ways:\n\n**1. Manual Mode (Detailed Arguments)**\n\nProvide all conversation parameters explicitly on the command line.\n\n```bash\n# Activate your virtual environment first!\nsource venv/bin/activate \n\npython generate.py --persona1 \"Wizard\" --persona1-desc \"Grumpy, old, prone to muttering spells\" \\\n                   --persona2 \"Knight\" --persona2-desc \"Overly cheerful, oblivious to Wizard's mood\" \\\n                   --topic \"The best way to polish armor without magic\" \\\n                   --scenario \"Stuck in a dungeon waiting room with bad Muzak\" \\\n                   --style \"Comedic, bickering, contrasting personalities\" \\\n                   --num-examples 10 \\\n                   --output-file manual_wizard_knight.jsonl \\\n                   --model-id meta-llama/Meta-Llama-3-8B-Instruct \n```\n\n**2. Creative Brief Mode (Automatic Argument \u0026 Topic Variation)**\n\nProvide a high-level brief. The script generates detailed parameters (personas, topic, etc.) using the LLM, optionally incorporating **web search context** (via `--personaX-search-term`) and **image search** for the personas. It then creates topic/scenario variations for each example while keeping the personas constant.\n\n```bash\n# Without web search\npython generate.py --creative-brief \"A pirate captain trying to order coffee at a modern minimalist cafe\" \\\n                   --num-examples 15 \\\n                   --output-file brief_pirate.jsonl\n\n# With web search for specific personas\npython generate.py --creative-brief \"Conversation between Tech Lead Tina and Junior Dev Joe about effective code reviews\" \\\n                   --num-examples 10 \\\n                   --persona1-search-term \"Typical Tech Lead responsibilities personality traits communication\" \\\n                   --persona2-search-term \"Junior Developer challenges learning curve receiving feedback\" \\\n                   --output-file brief_tech_review.jsonl\n```\n\n**3. Fixed Persona + Variation Mode**\n\nDefine fixed personas and an initial context, then enable variation to generate diverse conversations with those same characters.\n\n```bash\npython generate.py \\\n  --enable-variation \\\n  --fixed-persona1 \"Mick Jagger\" \\\n  --fixed-persona1-desc \"Iconic frontman...\" \\\n  --fixed-persona2 \"Ozzy Osbourne\" \\\n  --fixed-persona2-desc \"The Prince of Darkness...\" \\\n  --initial-topic \"Modern rock music and reality TV\" \\\n  --initial-scenario \"Backstage at an awards show\" \\\n  --initial-style \"Amusing clash...\" \\\n  --num-examples 20 \\\n  --output-file fixed_jagger_ozzy.jsonl \\\n  --load-in-4bit \\\n  # Note: --include-points can also be used here if desired\n```\n\n**4. Random Pairings Mode (with Character Pools)**\n\nGenerate conversations using random pairs of characters selected from predefined character pools (YAML files). Each conversation will feature a different pairing from your character pools.\n\n```bash\npython generate.py \\\n  --random-pairings \\\n  --character-pool character-config/got_characters.yaml \\\n  --persona-desc-pool character-config/got_descriptions.yaml \\\n  --initial-topic \"Discussing the Iron Throne succession\" \\\n  --initial-scenario \"In the Great Hall of Winterfell\" \\\n  --initial-style \"Tense strategic conversation with occasional wit\" \\\n  --num-examples 10 \\\n  --output-file got_random_pairings.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n**5. Random Pairings with Variation**\n\nCombine random character pairings with topic/scenario variation for maximum diversity. This generates conversations with different characters AND different topics/scenarios for each example.\n\n```bash\npython generate.py \\\n  --random-pairings \\\n  --enable-variation \\\n  --character-pool character-config/avengers_chars.yaml \\\n  --persona-desc-pool character-config/avengers_desc.yaml \\\n  --initial-topic \"Planning a team-building exercise for the Avengers\" \\\n  --initial-scenario \"In the Avengers Tower common room\" \\\n  --initial-style \"Humorous and character-driven conversation with friendly banter\" \\\n  --num-examples 20 \\\n  --output-file avengers_random_varied.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n(See Argument Reference below for all available options for `generate.py`)\n\n### Batch Generation (`batch_generate.py`)\n\nFor generating multiple datasets with different configurations efficiently, use the `batch_generate.py` script along with a YAML configuration file.\n\n**1. Create a YAML Configuration File**\n\nDefine the runs you want to perform. Each run corresponds to one execution of `generate.py`. You can mix modes (manual, brief, fixed persona) within a single YAML file. See the `examples/` directory for detailed configuration examples like `examples/batch_mixed_modes.yaml` and `examples/batch_rockstars_celebs.yaml`.\n\n**Key YAML Structure:**\n\n```yaml\n# Top-level settings (optional)\noutput_directory: \"./batch_output\" # Base directory for all output files\n# upload_repo: \"YourUser/GlobalRepo\" # Optional: Default repo if not set per-run\nforce_upload: false # Optional: Global force upload flag\n\n# List of runs to execute\nruns:\n  # Run 1: Creative Brief Example\n  - id: \"unique_run_id_1\"             # Optional: Identifier for logging\n    output_file: \"run1_output.jsonl\" # REQUIRED: Specific output for this run\n    num_examples: 50\n    model_id: \"meta-llama/Meta-Llama-3-8B-Instruct\"\n    creative_brief: \"Scenario description...\"\n    upload_repo: \"YourUser/Run1Dataset\" # Optional: Per-run upload destination\n    load_in_4bit: true\n\n  # Run 2: Fixed Persona + Variation Example\n  - id: \"unique_run_id_2\"\n    output_file: \"run2_output.jsonl\"\n    num_examples: 75\n    enable_variation: true             # REQUIRED for this mode\n    fixed_personas:\n      persona1: \"Persona Name\"\n      persona1_desc: \"Description...\"\n      persona2: \"Another Persona\"\n      persona2_desc: \"Description...\"\n    initial_context:\n      topic: \"Seed topic\"\n      scenario: \"Seed scenario\"\n      style: \"Seed style\"\n      # include_points: \"optional,keywords\"\n    load_in_4bit: true\n\n  # Run 3: Manual Mode Example\n  - id: \"unique_run_id_3\"\n    output_file: \"run3_output.jsonl\"\n    num_examples: 25\n    manual_args:                  # REQUIRED for this mode\n      topic: \"Manual topic\"\n      persona1: \"Manual Persona 1\"\n      persona1_desc: \"Desc...\"\n      persona2: \"Manual Persona 2\"\n      persona2_desc: \"Desc...\"\n      scenario: \"Manual scenario\"\n      style: \"Manual style\"\n      # include_points: \"optional,keywords\"\n    # No upload specified for this run\n\n  # ... add more runs as needed\n```\n\n**2. Run the Batch Script**\n\nExecute `batch_generate.py` and point it to your YAML configuration file.\n\n```bash\n# Activate your virtual environment first!\nsource venv/bin/activate \n\npython batch_generate.py path/to/your/config.yaml\n\n# Example using one of the provided configs:\npython batch_generate.py examples/batch_rockstars_celebs.yaml \n```\n\nThe script will iterate through each run defined in the YAML, construct the appropriate `generate.py` command, execute it, and log the progress and results.\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n### Argument Reference (`generate.py`)\n\nTailor your generation precisely. Provide EITHER `--creative-brief` OR the set of detailed manual arguments. Use `--delete-repo` only for deleting repositories.\n\n**Mode Selection**\n\n*   `--creative-brief STR`: Provide a high-level concept (e.g., *\"Godzilla ordering takeout sushi\"*). The script uses the LLM specified by `--model-id` to first generate the detailed arguments (topic, personas, etc.) automatically, potentially informed by web context if search terms are provided (see below). It also performs an **image search** for the personas. Then, for each requested example, it generates a *new, related* topic/scenario variation while keeping the initially generated personas consistent. If you provide this, any manual or fixed-persona arguments are ignored.\n*   `--delete-repo USERNAME/REPO_ID [USERNAME/REPO_ID ...]`: **DANGER ZONE.** Use this argument *instead of* generation arguments to permanently delete one or more Hugging Face Hub dataset repositories. **THIS ACTION IS IRREVERSIBLE.** You will be asked for confirmation. Accepts multiple space-separated repository IDs.\n\n**Creative Brief Web Context (Optional - Only used with `--creative-brief`)**\n\n*   `--persona1-search-term STR`: If provided along with `--creative-brief`, the script will perform a web search (via DuckDuckGo) using this exact term. The fetched text snippets will be added as context to the prompt used for generating the main arguments (including `--persona1-desc`), helping the LLM create a more informed persona. Ideal for less common or specific characters/individuals. Requires `duckduckgo-search` library.\n*   `--persona2-search-term STR`: Same as above, but for Persona 2.\n\n**Detailed Arguments (Manual Mode)**\n\n*(Required if not using `--creative-brief`, `--delete-repo`, or Fixed Persona Mode)*\n\n*   `--topic STR`: Central topic/subject of the conversation.\n*   `--persona1 STR`: Name of the first speaker (this name will map to the `human` role in the output data). An **image search** will be performed using this name.\n*   `--persona1-desc STR`: Detailed description of the first speaker's personality, background, speech patterns, quirks, etc. (Crucial for generation quality!).\n*   `--persona2 STR`: Name of the second speaker (maps to the `gpt` role). An **image search** will be performed using this name.\n*   `--persona2-desc STR`: Detailed description of the second speaker.\n*   `--scenario STR`: The setting, situation, or context for the conversation.\n*   `--style STR`: Desired tone, mood, and linguistic style (e.g., \"formal debate\", \"casual chat\", \"Shakespearean insults\", \"valley girl slang\", \"hardboiled detective noir\").\n*   `--include-points STR`: Optional comma-separated list of keywords or talking points the conversation should try to naturally incorporate (e.g., `\"time travel paradox,grandfather,temporal mechanics\"`). (Default: `None`)\n\n**Fixed Persona + Variation Mode Arguments**\n\n*(Required if using `--enable-variation`. Cannot be used with `--creative-brief` or manual persona/topic arguments)*\n\n*   `--enable-variation`: **Must be set** to activate this mode. Enables topic/scenario/style variation based on initial context while keeping personas fixed.\n*   `--fixed-persona1 STR`: Fixed name for Persona 1. An **image search** will be performed using this name.\n*   `--fixed-persona1-desc STR`: Fixed description for Persona 1.\n*   `--fixed-persona2 STR`: Fixed name for Persona 2. An **image search** will be performed using this name.\n*   `--fixed-persona2-desc STR`: Fixed description for Persona 2.\n*   `--initial-topic STR`: Seed topic used for the first example and as a basis for variations.\n*   `--initial-scenario STR`: Seed scenario used for the first example and as a basis for variations.\n*   `--initial-style STR`: Seed style used for the first example and as a basis for variations.\n*   `--include-points STR`: Optional comma-separated list of keywords or talking points, same as in Manual Mode. (Default: `None`)\n\n**Random Pairings Mode Arguments**\n\n*(Required if using `--random-pairings`. Cannot be used with `--creative-brief`, `--persona1`, or Fixed Persona arguments)*\n\n*   `--random-pairings`: **Must be set** to activate this mode. Enables selection of random character pairs from pools for each conversation.\n*   `--character-pool STR`: Path to a YAML file containing a list of character names under a `characters` key. File should be in the `character-config` directory or include a full path.\n*   `--persona-desc-pool STR`: Path to a YAML file containing a dictionary of character names to descriptions under a `descriptions` key. File should be in the `character-config` directory or include a full path.\n*   `--enable-variation`: Optional flag that, when combined with `--random-pairings`, enables topic/scenario/style variation for each conversation.\n*   `--initial-topic STR`: Base topic used for conversations (or as a seed for variations if `--enable-variation` is set).\n*   `--initial-scenario STR`: Base scenario used for conversations (or as a seed for variations if `--enable-variation` is set).\n*   `--initial-style STR`: Base style used for conversations (or as a seed for variations if `--enable-variation` is set).\n*   `--include-points STR`: Optional comma-separated list of keywords or talking points, same as in Manual Mode. (Default: `None`)\n\n**General Arguments (Applicable to Generation Modes)**\n\n*   `--num-examples INT`: How many distinct conversation examples to generate. (Default: 3)\n*   `--output-file PATH`: Path to save the output JSON Lines (`.jsonl`) file. (Default: `generated_data.jsonl`)\n*   `--model-id STR`: Hugging Face model ID (e.g., `meta-llama/Meta-Llama-3-8B-Instruct`). **Crucially, this model is used for BOTH the conversation generation AND the argument/variation generation steps.** Choose a strong instruction-following model. (Default: `meta-llama/Meta-Llama-3-8B-Instruct`)\n*   `--max-new-tokens INT`: Max tokens the LLM can generate in the main conversation step. Adjust based on desired conversation length and model limits. (Default: 768)\n*   `--upload-to-hub STR`: Your Hugging Face Hub repository ID (e.g., `YourUsername/YourDatasetName`) to upload the results to. The script will create the repo if it doesn't exist. Requires prior login. (Default: None)\n*   `--force-upload`: Skip the confirmation prompt when uploading to the Hub. Use with caution! (Default: False)\n*   `--validate-local-save`: Perform basic checks on the locally saved `.jsonl` file after writing. (Currently placeholder, no checks implemented). (Default: False)\n*   `--load-in-4bit`: Enable 4-bit quantization (NF4) using `bitsandbytes` for model loading. Reduces memory usage and can speed up inference, especially on consumer GPUs. Requires the `bitsandbytes` library to be installed. (Default: False)\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n### Examples\n\nHere are various examples demonstrating how to generate data for specific goals using both modes:\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 1: Training a \"Sitcom Banter\" Style LoRA (Manual Mode)\u003c/summary\u003e\n\n*Goal: Create a LoRA that makes an LLM generate witty, observational dialogue reminiscent of a classic sitcom.*\n\n```bash\npython generate.py \\\n  --num-examples 1000 \\\n  --topic \"the absurdity of everyday errands\" \\\n  --persona1 \"Alex\" \\\n  --persona1-desc \"slightly neurotic, prone to overthinking, often uses rhetorical questions\" \\\n  --persona2 \"Sam\" \\\n  --persona2-desc \"more laid-back, often amused by Alex's antics, responds with dry wit\" \\\n  --scenario \"waiting in line at the post office\" \\\n  --style \"observational, witty, fast-paced banter, slightly absurd, like Seinfeld\" \\\n  --include-points \"long lines, confusing forms, questionable package handling, passive aggression\" \\\n  --output-file sitcom_style_dataset.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n*Explanation: High volume (`--num-examples 1000`), consistent detailed parameters (especially `--style` and descriptive personas) focus the data on the target style.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 2: Training a \"Helpful Coding Mentor\" Persona LoRA (Manual Mode)\u003c/summary\u003e\n\n*Goal: Fine-tune a model to act as a patient, encouraging coding mentor.*\n\n```bash\npython generate.py \\\n  --num-examples 500 \\\n  --topic \"debugging a common Python error (e.g., IndexError)\" \\\n  --persona1 \"MentorBot\" \\\n  --persona1-desc \"a patient, knowledgeable, and encouraging Python tutor AI. Uses analogies, asks guiding questions rather than giving direct answers, celebrates small successes.\" \\\n  --persona2 \"Learner\" \\\n  --persona2-desc \"a beginner programmer feeling slightly stuck but eager to learn, expresses confusion clearly.\" \\\n  --scenario \"working through a coding problem together online via chat\" \\\n  --style \"supportive, clear, step-by-step, educational, positive reinforcement\" \\\n  --include-points \"traceback, variable scope, print debugging, list index, off-by-one, debugging process\" \\\n  --output-file mentor_persona_dataset.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n*Explanation: Focus is on detailed `--persona1-desc` capturing the desired mentor traits (patience, guiding questions) and a supportive `--style` to shape the mentor's voice.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 3: Training a Topic-Focused LoRA (\"Explaining Quantum Computing Simply\") (Manual Mode)\u003c/summary\u003e\n\n*Goal: Make the LLM more fluent and natural when explaining a complex topic conversationally.*\n\n```bash\npython generate.py \\\n  --num-examples 750 \\\n  --topic \"basic concepts of quantum computing\" \\\n  --persona1 \"QuantumGuru\" \\\n  --persona1-desc \"an expert simplifying quantum concepts using everyday analogies (like coin flips for superposition). Patient and enjoys teaching.\" \\\n  --persona2 \"CuriousChris\" \\\n  --persona2-desc \"intelligent but new to quantum, asks clarifying questions, tries to relate concepts to familiar things.\" \\\n  --scenario \"a casual conversation over coffee trying to understand new tech trends\" \\\n  --style \"simplified, analogy-driven, patient, engaging, avoiding deep jargon where possible\" \\\n  --include-points \"qubit, superposition, entanglement, potential applications, uncertainty, classical vs quantum\" \\\n  --output-file quantum_topic_dataset.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n*Explanation: Teaches the *conversational flow* of explaining the specific `--topic`, reinforced by simplifying personas, analogy-driven descriptions, and style.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 4: Enhancing Instruction Adherence (Specific Constraints) (Manual Mode)\u003c/summary\u003e\n\n*Goal: Train the model to better incorporate specific keywords or constraints during generation.*\n\n```bash\npython generate.py \\\n  --num-examples 800 \\\n  --topic \"benefits of renewable energy sources\" \\\n  --persona1 \"EcoAdvocate\" \\\n  --persona1-desc \"passionate environmental scientist, presents facts and figures clearly, optimistic tone.\" \\\n  --persona2 \"SkepticSam\" \\\n  --persona2-desc \"concerned about costs and grid reliability, asks challenging but fair questions, slightly pessimistic tone.\" \\\n  --scenario \"a public town hall meeting discussion about local energy policy\" \\\n  --style \"informative but persuasive debate, addressing counterarguments respectfully\" \\\n  --include-points \"solar panel efficiency, wind turbine placement, grid stability, battery storage, long-term cost savings, carbon emissions, job creation\" \\\n  --output-file instruction_adherence_dataset.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n*Explanation: Training on data where specific `--include-points` were required reinforces the model's ability to follow constraints within a natural dialogue structure.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 5: Generating Data for Creative Writing (Sci-Fi Pilot Scene) (Manual Mode)\u003c/summary\u003e\n\n*Goal: Draft dialogue for a specific scene in a science fiction TV pilot.*\n\n```bash\npython generate.py \\\n  --num-examples 20 \\\n  --topic \"analyzing strange readings from an unknown alien artifact\" \\\n  --persona1 \"Captain Eva Rostova\" \\\n  --persona1-desc \"experienced, cautious starship captain, focused on procedure and crew safety. Speaks formally.\" \\\n  --persona2 \"Dr. Aris Thorne\" \\\n  --persona2-desc \"brilliant but impulsive xeno-archaeologist, eager for discovery, sometimes disregards protocol. Speaks excitedly, uses technical jargon.\" \\\n  --scenario \"on the bridge of the starship 'Odyssey' examining scan results displayed on a large viewscreen\" \\\n  --style \"tense, suspenseful, professional sci-fi dialogue, sense of wonder mixed with potential danger\" \\\n  --include-points \"unknown energy signature, unusual material composition, potential risks, isolation, first contact protocol\" \\\n  --output-file scifi_scene_dialogue.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n*Explanation: All parameters work together to create dialogue for a specific fictional moment. Lower `--num-examples` is suitable for drafting multiple variations of the scene.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 6: Generating Varied Historical Banter from a Brief (Creative Brief Mode)\u003c/summary\u003e\n\n*Goal: Quickly generate diverse dialogue between consistent historical figures without defining all details manually.*\n\n```bash\npython generate.py \\\n  --creative-brief \"A philosophical debate between Leonardo da Vinci and Marie Curie about the nature of discovery.\" \\\n  --num-examples 25 \\\n  --output-file brief_historical_debate.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct \\\n  --upload-to-hub YourUser/VariedHistoricalDebate\n```\n\n*Explanation: The script uses the LLM to interpret the `--creative-brief`, generate initial detailed parameters (personas, topic, etc.). Then, for each of the 25 examples, it generates a *new, related* topic/scenario (e.g., discussing specific inventions, the ethics of science, the role of observation) while keeping the da Vinci/Curie personas consistent.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 7: Generating Dialogue for a Specific Person using Web Search (Creative Brief Mode)\u003c/summary\u003e\n\n*Goal: Create dialogue involving a specific, possibly less famous individual by providing web search terms for context.*\n\n```bash\npython generate.py \\\n  --creative-brief \"Generate a conversation between tech reviewer Marques Brownlee (MKBHD) and legendary filmmaker Stanley Kubrick about the design philosophy of smartphones vs. cinema cameras.\" \\\n  --num-examples 5 \\\n  --persona1-search-term \"Marques Brownlee MKBHD tech review style personality\" \\\n  --persona2-search-term \"Stanley Kubrick filmmaker personality directing style meticulous\" \\\n  --output-file mkbhd_kubrick_web_terms_5.jsonl\n```\n\n*Explanation: The script uses the LLM for the overall brief interpretation and topic variation. However, it uses the provided `--personaX-search-term` arguments to fetch context from DuckDuckGo. This context helps the LLM generate more accurate `--personaX-desc` arguments, enabling conversations involving specific individuals the base model might not know well.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 8: Generating Fantasy Dialogue from a Brief (Creative Brief Mode)\u003c/summary\u003e\n\n*Goal: Create diverse dialogue for a fantasy setting from a simple concept.*\n\n```bash\npython generate.py \\\n  --creative-brief \"An ancient, wise dragon trying to explain magic to a skeptical, pragmatic dwarf blacksmith.\" \\\n  --num-examples 50 \\\n  --output-file brief_fantasy_talk.jsonl \\\n  --validate-local-save\n```\n\n*Explanation: The script generates initial parameters from the brief, then varies the topic/scenario (e.g., explaining different types of magic, the cost of spells, magical artifacts vs. forged items) for each of the 50 examples, keeping the dragon and dwarf personas.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 9: Generating Absurdist Comedy Variations from a Brief (Creative Brief Mode)\u003c/summary\u003e\n\n*Goal: Generate surreal, varied dialogue based on an unusual pairing.*\n\n```bash\npython generate.py \\\n  --creative-brief \"A sentient existentialist toaster discussing the meaning of crumbs with a flock of nihilistic pigeons in a park.\" \\\n  --num-examples 10 \\\n  --output-file brief_toaster_pigeons.jsonl\n```\n\n*Explanation: Perfect for highly imaginative scenarios! The script generates varied crumb-related topics/scenarios (e.g., the futility of sweeping, the beauty of decay, pigeons judging bread types) for the toaster and pigeons across 10 examples.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 10: Generating Specific Genre Dialogue (Noir) from a Brief (Creative Brief Mode)\u003c/summary\u003e\n\n*Goal: Quickly generate dialogue fitting a specific genre like Noir using only a brief.*\n\n```bash\npython generate.py \\\n  --creative-brief \"A hardboiled detective interrogating a nervous informant about a stolen artifact in a smoky, rain-slicked alley.\" \\\n  --num-examples 10 \\\n  --output-file brief_noir_interrogation.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n*Explanation: The `--creative-brief` provides strong genre cues (hardboiled detective, nervous informant, smoky alley). The LLM generates appropriate personas, topics, scenarios, and a noir style, varying the specifics (e.g., the nature of the artifact, the informant's specific fear) across the examples.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 11: Generating Dialogue for Specific Historical Figures using Web Search (Creative Brief Mode)\u003c/summary\u003e\n\n*Goal: Create dialogue between specific, potentially niche historical figures by providing web search terms for context.*\n\n```bash\npython generate.py \\\n  --creative-brief \"Conversation between pioneering computer scientist Grace Hopper and minimalist artist Donald Judd about optimizing naval logistics vs. arranging metal boxes.\" \\\n  --num-examples 5 \\\n  --persona1-search-term \"Grace Hopper admiral computer scientist personality nickname Amazing Grace COBOL\" \\\n  --persona2-search-term \"Donald Judd artist minimalism Marfa Texas personality meticulous\" \\\n  --output-file hopper_judd_web_search.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n*Explanation: The brief sets the stage. The `--personaX-search-term` arguments guide the LLM's argument generation step by providing specific web context for Grace Hopper and Donald Judd, helping capture their distinct personalities and fields, even if they aren't strongly represented in the base model's training.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 12: Generating Dialogue for Specific Fictional Characters using Web Search (Creative Brief Mode)\u003c/summary\u003e\n\n*Goal: Create dialogue between well-known but perhaps less common fictional characters using web search to solidify their personas.*\n\n```bash\npython generate.py \\\n  --creative-brief \"A discussion between the AI assistant Clippy and the philosophical robot Marvin the Paranoid Android about the inherent suffering of existence vs. offering unsolicited help.\" \\\n  --num-examples 8 \\\n  --persona1-search-term \"Microsoft Clippy paperclip assistant personality annoying helpful interruption\" \\\n  --persona2-search-term \"Marvin the Paranoid Android Hitchhiker's Guide personality depressed intelligent brain the size of a planet\" \\\n  --output-file clippy_marvin_web_search.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct\n```\n\n*Explanation: Similar to the historical example, the brief provides the core idea, while the `--personaX-search-term` arguments provide specific context scraped from the web about Clippy and Marvin, ensuring their iconic (and contrasting) personalities are captured accurately during the initial argument generation, leading to more authentic dialogue.*\n\n\u003c/details\u003e\n\n**Leveraging Web Search for Current Events \u0026 Trending Topics**\n\nOne powerful application of Creative Brief mode with `--personaX-search-term` is generating dialogue grounded in current events, recent news, or ongoing public discussions involving specific individuals. By providing relevant search terms, you can create datasets reflecting timely controversies, collaborations, or statements.\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 13: Generating Dialogue around a Celebrity Controversy (Creative Brief + Search)\u003c/summary\u003e\n\n*Goal: Create varied conversations reflecting the public discourse surrounding a recent celebrity feud or controversial statement.*\n\n```bash\npython generate.py \\\n  --creative-brief \"A discussion between Mickey Rourke and JoJo Siwa about the trending controversy on Celebrity Big Brother UK following homophobic remarks and subsequent apologies.\" \\\n  --persona1-search-term \"Mickey Rourke Celebrity Big Brother homophobic comments\" \\\n  --persona2-search-term \"JoJo Siwa response apology homophobic remark\" \\\n  --num-examples 100 \\\n  --output-file trending_MickeyRourke_JoJoSiwa_100.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct \\\n  --load-in-4bit\n```\n\n*Explanation: This uses a specific, current controversy as the brief. The `--personaX-search-term` arguments pull in recent context about the remarks and responses, enabling the LLM to generate varied, relevant conversations reflecting the situation.* \n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 14: Generating Dialogue around On-Set Tensions (Creative Brief + Search)\u003c/summary\u003e\n\n*Goal: Generate conversations reflecting reported tensions or rumors between actors on a popular show.*\n\n```bash\npython generate.py \\\n  --creative-brief \"A discussion between Jason Isaacs and Walton Goggins about the trending on-set tensions and feud rumors during the filming of 'White Lotus'.\" \\\n  --persona1-search-term \"Jason Isaacs White Lotus arguments on set\" \\\n  --persona2-search-term \"Walton Goggins feud rumors White Lotus\" \\\n  --num-examples 100 \\\n  --output-file trending_JasonIsaacs_WaltonGoggins_100.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct \\\n  --load-in-4bit\n```\n\n*Explanation: Focuses on reported on-set dynamics. The search terms help ground the personas in the context of the show and the alleged feud, allowing for varied speculative conversations.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 15: Generating Dialogue around a Business/Legal Dispute (Creative Brief + Search)\u003c/summary\u003e\n\n*Goal: Create conversations reflecting a high-profile trademark battle or legal dispute between public figures.*\n\n```bash\npython generate.py \\\n  --creative-brief \"A discussion between Katy Perry and Katie Jane Taylor about the trending trademark battle over a clothing brand and intellectual property rights.\" \\\n  --persona1-search-term \"Katy Perry trademark battle clothing\" \\\n  --persona2-search-term \"Katie Jane Taylor trademark dispute Katy Perry\" \\\n  --num-examples 100 \\\n  --output-file trending_KatyPerry_KatieJaneTaylor_100.jsonl \\\n  --model-id meta-llama/Meta-Llama-3-8B-Instruct \\\n  --load-in-4bit\n```\n\n*Explanation: Uses a specific business dispute. Search terms provide context on the legal battle, enabling the LLM to generate conversations about intellectual property, brand identity, and the specifics of the case.*\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n  \u003csummary\u003eExample 16: Generating a Progressive AI Course Curriculum (Batch + Creative Brief + Search)\u003c/summary\u003e\n\n*Goal: Create a series of datasets representing levels in an AI programming course, where the learner persona evolves.* \n\n*Approach: Use `batch_generate.py` with a YAML config. Each run defines a course level using Creative Brief mode. A consistent tutor persona (`EnfuseBot`) guides the learner. The crucial part is using `--persona2-search-term` to simulate the learner's increasing knowledge and likely points of confusion at each level.* \n\n*YAML Configuration (`examples/ai_course_curriculum.yaml`):*\n```yaml\n# ai_course_curriculum.yaml\noutput_directory: \"ai_course_datasets\"\nforce_upload: true\n\nruns:\n  # Level 1: Intro\n  - id: \"level1_intro\"\n    output_file: \"ai_course_level1_intro.jsonl\"\n    upload_repo: \"cahlen/AICourse-Level1-Intro\"\n    num_examples: 500\n    model_id: \"meta-llama/Meta-Llama-3-8B-Instruct\"\n    load_in_4bit: true\n    creative_brief: \"EnfuseBot introduces fundamental AI/ML concepts...\"\n    persona2_search_term: \"Beginner Python programmer confused about AI...\"\n  # Level 2: Scikit-learn\n  - id: \"level2_sklearn\"\n    output_file: \"ai_course_level2_sklearn.jsonl\"\n    upload_repo: \"cahlen/AICourse-Level2-Sklearn\"\n    num_examples: 500\n    model_id: \"meta-llama/Meta-Llama-3-8B-Instruct\"\n    load_in_4bit: true\n    creative_brief: \"EnfuseBot explains core ML concepts and Scikit-learn...\"\n    persona2_search_term: \"Learner starting Scikit-learn confused about supervised...\"\n  # Level 3: Deep Learning\n  - id: \"level3_deeplearning\"\n    output_file: \"ai_course_level3_deeplearning.jsonl\"\n    upload_repo: \"cahlen/AICourse-Level3-DeepLearning\"\n    num_examples: 500\n    model_id: \"meta-llama/Meta-Llama-3-8B-Instruct\"\n    load_in_4bit: true\n    creative_brief: \"EnfuseBot introduces Deep Learning fundamentals...\"\n    persona2_search_term: \"Student confused about neural networks activation...\"\n  # Level 4: Computer Vision\n  - id: \"level4_computervision\"\n    output_file: \"ai_course_level4_computervision.jsonl\"\n    upload_repo: \"cahlen/AICourse-Level4-ComputerVision\"\n    num_examples: 500\n    model_id: \"meta-llama/Meta-Llama-3-8B-Instruct\"\n    load_in_4bit: true\n    creative_brief: \"EnfuseBot explains Computer Vision fundamentals...\"\n    persona2_search_term: \"Learner asking about Computer Vision CNNs...\"\n  # Level 5: NLP\n  - id: \"level5_nlp\"\n    output_file: \"ai_course_level5_nlp.jsonl\"\n    upload_repo: \"cahlen/AICourse-Level5-NLP\"\n    num_examples: 500\n    model_id: \"meta-llama/Meta-Llama-3-8B-Instruct\"\n    load_in_4bit: true\n    creative_brief: \"EnfuseBot covers basic NLP concepts...\"\n    persona2_search_term: \"Student learning NLP text representation embeddings...\"\n  # Level 6: Training\n  - id: \"level6_training\"\n    output_file: \"ai_course_level6_training.jsonl\"\n    upload_repo: \"cahlen/AICourse-Level6-Training\"\n    num_examples: 500\n    model_id: \"meta-llama/Meta-Llama-3-8B-Instruct\"\n    load_in_4bit: true\n    creative_brief: \"EnfuseBot guides Learner through training models...\"\n    persona2_search_term: \"Learner questions about model training loops evaluation...\"\n```\n\n*Command to Run:* \n```bash\npython batch_generate.py examples/ai_course_curriculum.yaml\n```\n\n*Explanation: This batch job generates 6 datasets, each simulating a stage in an AI course. By adjusting the `creative_brief` and `persona2_search_term` for each run, the conversations adapt to the expected learner level, creating targeted data for training level-specific chatbot LoRAs.* \n\n\u003c/details\u003e\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n### Output Format\n\nUnderstanding where your data goes:\n\n1.  **Local File (`.jsonl`):** The script always saves the generated data locally first to the path specified by `--output-file`. This is a **JSON Lines** file: each line is a complete JSON object representing a single turn.\n\n    ```json\n    {\"conversation_id\": 0, \"turn_number\": 0, \"role\": \"human\", \"speaker_name\": \"Alex\", \"topic\": \"the absurdity of everyday errands\", \"scenario\": \"waiting in line at the post office\", \"style\": \"observational, witty, fast-paced banter, slightly absurd, like Seinfeld\", \"include_points\": \"long lines, confusing forms, questionable package handling, passive aggression\", \"content\": \"Seriously, Sam, look at this line. Is time moving slower in here? Are we in some kind of bureaucratic vortex?\"}\n    {\"conversation_id\": 0, \"turn_number\": 1, \"role\": \"gpt\", \"speaker_name\": \"Sam\", \"topic\": \"the absurdity of everyday errands\", \"scenario\": \"waiting in line at the post office\", \"style\": \"observational, witty, fast-paced banter, slightly absurd, like Seinfeld\", \"include_points\": \"long lines, confusing forms, questionable package handling, passive aggression\", \"content\": \"Only if the vortex requires triplicate forms for entry. And possibly a blood sample. Did you fill out the 7B/Stroke-6 form for *existing* in the line?\"}\n    {\"conversation_id\": 1, \"turn_number\": 0, \"role\": \"human\", \"speaker_name\": \"Alex\", \"topic\": \"the existential dread of choosing coffee beans\", \"scenario\": \"staring blankly at a shelf in a grocery store\", \"style\": \"observational, witty, fast-paced banter, slightly absurd, like Seinfeld\", \"include_points\": \"origin, roast level, ethical sourcing, paralysis by analysis\", \"content\": \"Single origin Ethiopian Yirgacheffe... or the house blend... medium roast... dark roast... Sam, how do people *choose*?\"}\n    ```\n\n    Each row has the following keys:\n\n    *   `conversation_id` (int64): Identifier grouping turns within the dataset (0-indexed).\n    *   `turn_number` (int64): The sequence number of the turn within its conversation (0-indexed).\n    *   `role` (string): Speaker role (`human` or `gpt`, mapping from Persona 1 and Persona 2 respectively).\n    *   `speaker_name` (string): The actual name of the speaker for this turn (e.g., 'Alex', 'Sam').\n    *   `topic` (string): The specific topic generated/used for this conversation.\n    *   `scenario` (string): The specific scenario generated/used for this conversation.\n    *   `style` (string): The specific style generated/used for this conversation.\n    *   `include_points` (string): Comma-separated list of keywords requested for inclusion in this conversation (or empty string if none).\n    *   `content` (string): The text content of the turn.\n\n2.  **Hugging Face Hub Upload (Optional):** If you provide a repo ID via `--upload-to-hub`, the script performs a two-step upload after generation (and optional local validation):\n    *   **Step 1: Load \u0026 Push Dataset:** It loads the local `.jsonl` file into a Hugging Face `DatasetDict` object (`datasets.load_dataset('json', ...)`), ensuring features like `conversation_id` and `turn_number` are correctly typed (as `int64`). It then generates a detailed dataset card (README) using the run parameters (based on the *last successfully generated example* when using topic variation), including **any found persona images** and a description of the **generation mode used**, and attaches it to the `DatasetInfo`. Crucially, the `DatasetInfo` includes the `Features` definition matching the full schema. Finally, it pushes the `DatasetDict` to your Hub repository using `push_to_hub()`.\n    *   **Step 2: Upload Custom README:** It retrieves the generated dataset card content from the `DatasetInfo`, encodes it to bytes (`utf-8`), and uploads these bytes directly as the `README.md` file using `HfApi.upload_file`. This ensures your repository displays a rich, informative dataset card reflecting the generation parameters, found images, generation mode, and the full data schema.\n\nThe final dataset on the Hub will have the full `conversation_id`, `turn_number`, `role`, `speaker_name`, `topic`, `scenario`, `style`, `include_points`, `content` structure and should display correctly in the dataset previewer.\n\n**Loading the Dataset from the Hub:**\n\nOnce uploaded, the dataset can be easily loaded using the Hugging Face `datasets` library:\n\n```python\nfrom datasets import load_dataset\n\n# Replace with your actual repository ID\n# Ensure you are logged in (`huggingface-cli login`) if the dataset is private\ndataset_repo_id = \"YourUsername/YourDatasetName\" \nds = load_dataset(dataset_repo_id)\n\n# Access the data (e.g., the 'train' split)\nprint(ds['train'][0]) \n```\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n### Model \u0026 Fine-Tuning Notes\n\nLeveraging the generated data:\n\n*   **Generation Model:** Uses `meta-llama/Meta-Llama-3-8B-Instruct` by default for both argument generation (in brief mode) and conversation generation. You can change this with `--model-id` to any compatible Hugging Face text-generation model (results may vary!). Using larger/more capable models might yield better results, especially for complex briefs or nuanced styles.\n*   **Fine-Tuning Suitability:** This data is ideal for Parameter-Efficient Fine-Tuning (PEFT) methods like **LoRA**. You can create specialized LoRA adapters for style, persona, or topic without the cost of retraining the entire base model.\n*   **LoRA Benefits:** Smaller footprint, faster training, modular (mix and match adapters!), easily shareable.\n*   **Base Models:** For best results when fine-tuning, start with strong instruction-following base models like Llama 3 Instruct, Mistral Instruct, Mixtral Instruct, Qwen2 Instruct, Gemma Instruct, etc.\n*   **LoRA Training Example Dependencies:** If you plan to train LoRAs based on examples, note the required libraries: `peft`, `trl`, `bitsandbytes`. Install them with `pip install -U peft trl bitsandbytes`.\n*   **4-bit Quantization Dependency:** Using the `--load-in-4bit` flag requires the `bitsandbytes` library. Install it with `pip install bitsandbytes`.\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- ROADMAP --\u003e\n## Roadmap\n\n*   [x] Add batch generation script (`batch_generate.py`) with YAML configuration. (Done)\n*   [x] Add various generation examples to documentation (Manual, Brief, Fixed Persona, Batch). (Done)\n*   [ ] Implement `--validate-local-save` checks (currently placeholder).\n*   [ ] Explore adding more sophisticated topic/scenario variation techniques.\n*   [ ] Add option for different output formats (e.g., conversational JSON).\n*   [ ] Improve error handling and reporting in the batch script.\n\nSee the [open issues](https://github.com/cahlen/conversation-dataset-generator/issues) for a full list of proposed features (and known issues).\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- CONTRIBUTING --\u003e\n## Contributing\n\nContributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.\n\nIf you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag \"enhancement\".\nDon't forget to give the project a star! Thanks again!\n\n1.  Fork the Project\n2.  Create your Feature Branch (`git checkout -b feature/AmazingFeature`)\n3.  Commit your Changes (`git commit -m 'Add some AmazingFeature'`)\n4.  Push to the Branch (`git push origin feature/AmazingFeature`)\n5.  Open a Pull Request\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- LICENSE --\u003e\n## License\n\nDistributed under the MIT License. See `LICENSE` file for more information.\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- CONTACT --\u003e\n## Contact\n\nCahlen Humphreys - [GitHub Profile](https://github.com/cahlen)\n\nProject Link: [https://github.com/cahlen/conversation-dataset-generator](https://github.com/cahlen/conversation-dataset-generator)\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- ACKNOWLEDGMENTS --\u003e\n## Acknowledgments\n\n*   This README format is based on the [Best-README-Template](https://github.com/othneildrew/Best-README-Template) by Othneil Drew.\n*   [Hugging Face](https://huggingface.co/) for the `transformers`, `datasets`, `accelerate`, and `hub` libraries.\n*   [PyTorch](https://pytorch.org/)\n*   [Pandas](https://pandas.pydata.org/)\n*   [DuckDuckGo Search Library](https://pypi.org/project/duckduckgo-search/)\n*   [Img Shields](https://shields.io)\n*   [TQDM](https://github.com/tqdm/tqdm)\n\n\u003cp align=\"right\"\u003e(\u003ca href=\"#readme-top\"\u003eback to top\u003c/a\u003e)\u003c/p\u003e\n\n\u003c!-- MARKDOWN LINKS \u0026 IMAGES --\u003e\n\u003c!-- https://www.markdownguide.org/basic-syntax/#reference-style-links --\u003e\n[license-shield]: https://img.shields.io/github/license/cahlen/conversation-dataset-generator.svg?style=for-the-badge\n[license-url]: https://github.com/cahlen/conversation-dataset-generator/blob/main/LICENSE\n[Python.org]: https://img.shields.io/badge/Python-3.8+-3776AB?style=for-the-badge\u0026logo=python\u0026logoColor=white\n[Python-url]: https://www.python.org/\n[PyTorch.org]: https://img.shields.io/badge/PyTorch-%23EE4C2C.svg?style=for-the-badge\u0026logo=PyTorch\u0026logoColor=white\n[PyTorch-url]: https://pytorch.org/\n[Transformers.co]: https://img.shields.io/badge/transformers-%F0%9F%A4%97-FFD000.svg?style=for-the-badge\n[Transformers-url]: https://huggingface.co/docs/transformers/index\n[Accelerate.co]: https://img.shields.io/badge/accelerate-%F0%9F%A4%97-brightgreen.svg?style=for-the-badge\n[Accelerate-url]: https://huggingface.co/docs/accelerate/index\n[Datasets.co]: https://img.shields.io/badge/datasets-%F0%9F%A4%97-blue.svg?style=for-the-badge\n[Datasets-url]: https://huggingface.co/docs/datasets/index\n[Huggingface.co]: https://img.shields.io/badge/HuggingFace%20Hub-🤗-yellow?style=for-the-badge\n[Huggingface-url]: https://huggingface.co/\n[Pandas.pydata]: https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge\u0026logo=pandas\u0026logoColor=white\n[Pandas-url]: https://pandas.pydata.org/\n[DuckDuckGo-Search-pypi]: https://img.shields.io/badge/DuckDuckGo%20Search-optional-grey?style=for-the-badge\u0026logo=duckduckgo\n[DuckDuckGo-Search-url]: https://pypi.org/project/duckduckgo-search/\n[BitsAndBytes-pypi]: https://img.shields.io/badge/bitsandbytes-optional-purple?style=for-the-badge\n[BitsAndBytes-url]: https://pypi.org/project/bitsandbytes/\n[TQDM-pypi]: https://img.shields.io/badge/tqdm-✓-green?style=for-the-badge\u0026logo=python\n[TQDM-url]: https://pypi.org/project/tqdm/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcahlen%2Fconversation-dataset-generator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcahlen%2Fconversation-dataset-generator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcahlen%2Fconversation-dataset-generator/lists"}