{"id":31753732,"url":"https://github.com/servicenow/fm2ds","last_synced_at":"2025-10-09T17:53:58.524Z","repository":{"id":267365247,"uuid":"899691865","full_name":"ServiceNow/FM2DS","owner":"ServiceNow","description":null,"archived":false,"fork":false,"pushed_at":"2025-09-17T02:37:37.000Z","size":198,"stargazers_count":7,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-17T04:24:03.540Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ServiceNow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2024-12-06T19:59:15.000Z","updated_at":"2025-09-17T02:37:41.000Z","dependencies_parsed_at":null,"dependency_job_id":"7c3e7f16-76a9-4a4a-a762-08ec870603b8","html_url":"https://github.com/ServiceNow/FM2DS","commit_stats":null,"previous_names":["servicenow/fm2ds"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ServiceNow/FM2DS","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FFM2DS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FFM2DS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FFM2DS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FFM2DS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ServiceNow","download_url":"https://codeload.github.com/ServiceNow/FM2DS/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FFM2DS/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279001805,"owners_count":26083197,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-09T02:00:07.460Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-10-09T17:53:52.738Z","updated_at":"2025-10-09T17:53:58.518Z","avatar_url":"https://github.com/ServiceNow.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cp align=\"center\"\u003eFM\u003csup\u003e2\u003c/sup\u003eDS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cbr\u003e\n  \u003ca href=\"https://www.arxiv.org/abs/2412.07030\"\u003e\u003cimg alt=\"Paper\" src=\"https://img.shields.io/badge/📃-Paper-808080\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://fm2ds.github.io/\"\u003e\u003cimg alt=\"Website\" src=\"https://img.shields.io/badge/%F0%9F%8C%90-Website-008080\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://huggingface.co/datasets/AmirhosseinAbaskohi/M2QA_Bench\"\u003e\u003cimg alt=\"Huggingface\" src=\"https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Benchmark-yellow\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n## Abstract\nMultimodal multihop question answering is a complex task that requires reasoning over multiple sources of information, such as images and text, to answer questions. While there has been significant progress in visual question answering,\nthe multihop setting remains unexplored due to the lack of high-quality datasets. Current methods focus on single-hop question answering or a single modality,\nwhich makes them unsuitable for real-world scenarios such as analyzing multimodal educational materials, summarizing lengthy academic articles, or interpreting scientific studies that combine charts, images,\nand text. To address this gap, we propose a novel methodology, introducing the first framework for creating a high-quality dataset that enables training models for multimodal multihop question answering.\nOur approach consists of a 5-stage pipeline that involves acquiring relevant multimodal documents from Wikipedia, synthetically generating high-level questions and answers, and validating them through rigorous criteria to ensure quality data.\nWe evaluate our methodology by training models on our synthesized dataset and testing on two benchmarks, our results demonstrate that, with an equal sample size,\nmodels trained on our synthesized data outperform those trained on human-collected data by 1.9 in exact match (EM) on average.\nWe believe our data synthesis method will serve as a strong foundation for training and evaluating multimodal multihop question answering models.\n\n\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/user-attachments/assets/d1c8fde6-d02f-4e6b-b224-e327acab7c93\" alt=\"image\"\u003e\n\u003c/div\u003e\n\n\nIn contrast to traditional datasets that depend on human annotators, templates, and information snippets as sources, FM\u003csup\u003e2\u003c/sup\u003eDS is a fully automated approach that utilizes complete documents as its sources.\nFM\u003csup\u003e2\u003c/sup\u003eDS incorporates validation steps to ensure that the generated questions are answerable, multimodal, and multihop.\n\n\n## FM\u003csup\u003e2\u003c/sup\u003eDS\n![image](https://github.com/user-attachments/assets/4c3c9afb-f4f6-43c3-8cf3-8a12fb2a1776)\n\nThe Five-Stage Pipeline for FM\u003csup\u003e2\u003c/sup\u003eDS. First we retrieve relevant documents from the Wikipedia dataset to create a pool of related documents based on hyperlinks and topics (Stage 1).\nIn Stage2, we select the few-shot samples from MultiModalQA (MMQA in the figure). Stage 3 focuses on generating and validating questions to make sure they are answerable, multihop, and multimodal.\nIn Stage 4, answers are generated and validated. Finally, in Stage 5 we generate queries related to the documents, which are also validated to ensure relevance and accuracy.\n\n## M\u003csup\u003e2\u003c/sup\u003eQA-Bench\nWe also propose a benchmark, M\u003csup\u003e2\u003c/sup\u003eQA, to assess the LVLMs performance on a more complicated MMQA task with full documents. M\u003csup\u003e2\u003c/sup\u003eQA consists of 500 Q\u0026A pairs,\neach designed to challenge the model's ability to perform a complex reasoning task. The questions are not templated into a specific structure (as in some existing works like MultimodalQA),\ninstead, they are diverse and challenging. Additionally, answering the questions require access to full documents, where both information extraction and reasoning across different modalities (e.g., images and tables) are essential.\n\n![image](https://github.com/user-attachments/assets/7a666536-3c5e-4f8d-b391-cee77a771476)\n\nMultimodal multihop reasoning example from M\u003csup\u003e2\u003c/sup\u003eQA-Bench where the model compares the release dates of two albums, \"Music from Big Pink\" and \"Imagine,\"\nusing textual and visual cues. The documents are connected through their shared topic, \"music,\" and the answer is determined as the title of the earlier-released album.\n\nYou can use this [link](https://github.com/ServiceNow/FM2DS/blob/main/M2QA_Bench.json) to access this benchmark.\n\n## How to Run\n\nThis guide provides step-by-step instructions for running the FM²DS pipeline to synthesize multimodal multihop question answering data.\n\n### Overview\n\n**Important Note**: This project is designed specifically for **data synthesis**. The generated dataset can be used to train various multimodal models, but the actual model training is not included in this repository. For model training, please refer to each model's specific training approaches and documentation.\n\n### Prerequisites\n\n#### System Requirements\n- Python 3.8+\n- CUDA-compatible GPU (recommended for LVLM inference, especially for local Llama models)\n- Sufficient storage space for datasets (~50GB+)\n- 16GB+ RAM recommended for processing large datasets\n\n#### Installation\n\n1. **Clone the repository:**\n   ```bash\n   git clone https://github.com/ServiceNow/FM2DS.git\n   cd FM2DS\n   ```\n\n2. **Create and activate a virtual environment (recommended):**\n   ```bash\n   python -m venv fm2ds_env\n   source fm2ds_env/bin/activate  # On Windows: fm2ds_env\\Scripts\\activate\n   ```\n\n3. **Install dependencies:**\n   ```bash\n   pip install -r requirements.txt\n   ```\n\n4. **Download required language model:**\n   ```bash\n   python -m spacy download en_core_web_sm\n   ```\n\n#### Dependencies\n\n##### Quick Installation\nInstall all required dependencies using the provided requirements file:\n\n```bash\npip install -r requirements.txt\n```\n\n##### Manual Installation\nAlternatively, install the core dependencies manually:\n\n```bash\n# Core ML and data processing libraries\npip install datasets\u003e=2.14.0 transformers\u003e=4.30.0 torch\u003e=2.0.0 tensorflow\u003e=2.10.0\npip install numpy\u003e=1.21.0 scikit-learn\u003e=1.0.0 beautifulsoup4\u003e=4.9.0 requests\u003e=2.25.0\n\n# Natural Language Processing\npip install spacy\u003e=3.4.0\n\n# Download spaCy language model\npython -m spacy download en_core_web_sm\n```\n\n##### Model API Dependencies\nFor specific model APIs, ensure you have the appropriate packages:\n- **OpenAI GPT**: `pip install openai\u003e=1.0.0`\n- **Anthropic Claude**: `pip install anthropic\u003e=0.7.0`\n- **Local Llama**: `pip install vllm\u003e=0.2.0` (requires CUDA-compatible GPU)\n\n##### Optional Dependencies\nFor development and enhanced functionality:\n```bash\npip install jupyter\u003e=1.0.0 matplotlib\u003e=3.5.0 pillow\u003e=8.0.0\n```\n\n#### Troubleshooting Installation\n\n**Common Issues:**\n\n1. **PyTorch CUDA compatibility**: If you have a CUDA-compatible GPU, install PyTorch with CUDA support:\n   ```bash\n   pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118\n   ```\n\n2. **TensorFlow GPU support**: For GPU acceleration with TensorFlow:\n   ```bash\n   pip install tensorflow[and-cuda]\n   ```\n\n3. **vLLM installation issues**: vLLM requires specific CUDA versions. Check the [vLLM installation guide](https://docs.vllm.ai/en/latest/getting_started/installation.html) for your system.\n\n4. **Memory issues**: If you encounter out-of-memory errors during dataset processing, consider:\n   - Reducing batch sizes in the configuration\n   - Using a machine with more RAM\n   - Processing smaller subsets of the data initially\n\n### Setup and Data Preparation\n\n#### Step 0: Install Dependencies and Download Models\n\nInstall all dependencies and download the required spaCy language model:\n\n```bash\npip install -r requirements.txt\npython -m spacy download en_core_web_sm\n```\n\n#### Step 1: Download Required Datasets\n\n##### 1.1 Download WikiWeb2M Dataset\n```bash\ncd data/\nbash download_wikiweb2m.sh\n```\n\n##### 1.2 Download MultiModalQA Training Data\n```bash\ncd create_few_shot_samples/\nbash download_mmqa_train.sh\n```\n\n#### Step 2: Parse and Prepare Base Dataset\n\n```bash\n# Parse WikiWeb2M dataset and save as HuggingFace format\npython data/parse_and_save_dataset.py\n```\n\n#### Step 3: Create Few-Shot Examples\n\n```bash\n# Create few-shot examples from MultiModalQA\npython create_few_shot_samples/create_few_shot_from_multimodalqa.py\n```\n\n#### Step 4: Create Document Pool\n\n```bash\n# Create pools of related documents for multihop reasoning\npython data/create_document_pool.py\n```\n\n### Running Data Synthesis\n\n#### Configure Model Settings\n\nChoose one of the following language models:\n\n##### Option 1: OpenAI GPT (Recommended)\nSet your OpenAI API key:\n```bash\nexport OPENAI_API_KEY=\"your-api-key-here\"\n```\n\n##### Option 2: Anthropic Claude\nSet your Anthropic API key:\n```bash\nexport ANTHROPIC_API_KEY=\"your-api-key-here\"\n```\n\n##### Option 3: Local Llama Model\nStart the Llama server:\n```bash\n# For Llama 3.1\nbash lvlm/llama/host_llama_3_1.sh\n\n# For Llama 3.2\nbash lvlm/llama/host_llama_3_2.sh\n```\n\n#### Generate Synthetic Dataset\n\nRun the main data synthesis pipeline:\n\n```bash\npython src/create_dataset.py \\\n    --model gpt \\\n    --num-few-shot 1 \\\n    --num-examples 5000 \\\n    --output-dataset FM2DS/data/generated_data/synth\n```\n\n**Parameters:**\n- `--model`: Choose from `gpt`, `claude`, or `llama`\n- `--num-few-shot`: Number of few-shot examples (default: 1)\n- `--num-examples`: Total number of examples to generate (default: 5000)\n- `--output-dataset`: Output directory for generated dataset\n\n### Generated Data Format\n\nThe synthesized dataset contains the following structure:\n\n```json\n{\n    \"question\": \"Which country is ranked lower in EuroCup Basketball Performance...\",\n    \"answer\": \"France\",\n    \"documents\": [\n        {\n            \"title\": \"Document Title\",\n            \"content\": [\n                {\"type\": \"text\", \"value\": \"Text content here...\"},\n                {\"type\": \"image\", \"value\": \"http://example.com/image.jpg\"}\n            ]\n        }\n    ],\n    \"query\": [\"step-by-step reasoning process\", \"explanation of answer derivation\"]\n}\n```\n\n### Using the Data for Model Training\n\n#### Important Training Considerations\n\n**⚠️ Critical for Model Training**: When training multimodal models with this data, include **both the question-answer pairs AND the generated queries**. The queries contain step-by-step reasoning that is essential for teaching models multihop reasoning capabilities.\n\n#### Data Conversion Scripts\n\nBelow are example Python scripts to convert the FM²DS data format for specific model training:\n\n##### Example: Converting for InternVL2 Training\n\n```python\n# convert_for_internvl2.py\nimport json\nfrom datasets import load_from_disk\n\ndef convert_fm2ds_to_internvl2(input_dataset_path, output_file):\n    \"\"\"\n    Convert FM2DS dataset to InternVL2 training format\n    \"\"\"\n    dataset = load_from_disk(input_dataset_path)\n    converted_data = []\n    \n    for example in dataset:\n        # Extract images from documents\n        images = []\n        text_content = \"\"\n        \n        for doc in example['documents']:\n            for content in doc['content']:\n                if content['type'] == 'image':\n                    images.append(content['value'])\n                elif content['type'] == 'text':\n                    text_content += content['value'] + \" \"\n        \n        # Create InternVL2 format with question, answer, and reasoning\n        reasoning_steps = \" \".join(example['query']) if isinstance(example['query'], list) else example['query']\n        \n        internvl_example = {\n            \"id\": f\"fm2ds_{len(converted_data)}\",\n            \"image\": images[0] if images else None,  # InternVL2 typically uses single image\n            \"conversations\": [\n                {\n                    \"from\": \"human\",\n                    \"value\": f\"Context: {text_content.strip()}\\n\\nQuestion: {example['question']}\\n\\nPlease provide step-by-step reasoning and then the final answer.\"\n                },\n                {\n                    \"from\": \"gpt\", \n                    \"value\": f\"Reasoning: {reasoning_steps}\\n\\nAnswer: {example['answer']}\"\n                }\n            ]\n        }\n        converted_data.append(internvl_example)\n    \n    # Save in JSONL format\n    with open(output_file, 'w') as f:\n        for item in converted_data:\n            f.write(json.dumps(item) + '\\n')\n    \n    print(f\"Converted {len(converted_data)} examples to {output_file}\")\n\n# Usage\nconvert_fm2ds_to_internvl2(\"FM2DS/data/generated_data/synth\", \"internvl2_training_data.jsonl\")\n```\n\n##### Example: Converting for Generic VLM Training\n\n```python\n# convert_for_generic_vlm.py\nimport json\nfrom datasets import load_from_disk\n\ndef convert_fm2ds_to_generic_vlm(input_dataset_path, output_file):\n    \"\"\"\n    Convert FM2DS dataset to generic VLM training format\n    \"\"\"\n    dataset = load_from_disk(input_dataset_path)\n    converted_data = []\n    \n    for example in dataset:\n        # Prepare multimodal input\n        multimodal_input = {\n            \"text_documents\": [],\n            \"images\": [],\n            \"question\": example['question'],\n            \"reasoning_steps\": example['query'],\n            \"answer\": example['answer']\n        }\n        \n        for doc in example['documents']:\n            text_parts = []\n            for content in doc['content']:\n                if content['type'] == 'text':\n                    text_parts.append(content['value'])\n                elif content['type'] == 'image':\n                    multimodal_input['images'].append({\n                        \"url\": content['value'],\n                        \"caption\": \"\"  # Add caption if available\n                    })\n            \n            if text_parts:\n                multimodal_input['text_documents'].append({\n                    \"title\": doc['title'],\n                    \"content\": \" \".join(text_parts)\n                })\n        \n        converted_data.append(multimodal_input)\n    \n    with open(output_file, 'w') as f:\n        json.dump(converted_data, f, indent=2)\n    \n    print(f\"Converted {len(converted_data)} examples to {output_file}\")\n\n# Usage\nconvert_fm2ds_to_generic_vlm(\"FM2DS/data/generated_data/synth\", \"generic_vlm_training_data.json\")\n```\n\n### Training Recommendations\n\n1. **Include Reasoning Steps**: Always incorporate the generated queries/reasoning steps in your training data\n2. **Multimodal Alignment**: Ensure your model can process both text and images from the documents\n3. **Multihop Training**: Structure training to encourage step-by-step reasoning across multiple documents\n4. **Validation**: Use the provided M²QA-Bench (`M2QA_Bench.json`) for evaluation\n\n### Evaluation\n\nUse the M²QA-Bench for evaluating trained models:\n\n```python\nimport json\n\n# Load benchmark\nwith open('M2QA_Bench.json', 'r') as f:\n    benchmark = json.load(f)\n\n# Each item contains:\n# - question: The question to answer\n# - answer: Ground truth answer  \n# - modalities: Required modalities (text, image, table)\n# - pages: Source Wikipedia pages\n```\n\n#### Performance Tips\n\n- Use `--num-few-shot 3` for better generation quality\n- Start with smaller `--num-examples` for testing\n- Monitor validation success rates in the generation logs\n\n## Citation\n\n```\n@inproceedings{\n  abaskohi2025fmds,\n  title={{FM}2{DS}: Few-Shot Multimodal Multihop Data Synthesis with Knowledge Distillation for Question Answering},\n  author={Amirhossein Abaskohi and Spandana Gella and Giuseppe Carenini and Issam H. Laradji},\n  booktitle={The 2025 Conference on Empirical Methods in Natural Language Processing},\n  year={2025},\n  url={https://openreview.net/forum?id=esIjdsJQtC}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fservicenow%2Ffm2ds","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fservicenow%2Ffm2ds","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fservicenow%2Ffm2ds/lists"}