{"id":28722327,"url":"https://github.com/docling-project/docling-sdg","last_synced_at":"2025-09-15T09:27:54.516Z","repository":{"id":284204010,"uuid":"946031952","full_name":"docling-project/docling-sdg","owner":"docling-project","description":"A set of tools to create synthetically-generated data from documents","archived":false,"fork":false,"pushed_at":"2025-05-08T08:32:58.000Z","size":3634,"stargazers_count":15,"open_issues_count":5,"forks_count":5,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-09T09:35:32.686Z","etag":null,"topics":["ai","documents","llm-as-a-judge","question-answering","sdg"],"latest_commit_sha":null,"homepage":"https://docling-project.github.io/docling/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/docling-project.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-10T14:00:59.000Z","updated_at":"2025-06-03T02:11:01.000Z","dependencies_parsed_at":"2025-03-24T18:25:58.460Z","dependency_job_id":"79b39a6b-ad13-4f6c-972a-98ca9a57c2c9","html_url":"https://github.com/docling-project/docling-sdg","commit_stats":null,"previous_names":["docling-project/docling-sdg"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/docling-project/docling-sdg","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-sdg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-sdg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-sdg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-sdg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/docling-project","download_url":"https://codeload.github.com/docling-project/docling-sdg/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/docling-project%2Fdocling-sdg/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259942797,"owners_count":22935330,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","documents","llm-as-a-judge","question-answering","sdg"],"created_at":"2025-06-15T08:08:58.214Z","updated_at":"2025-06-15T08:08:59.175Z","avatar_url":"https://github.com/docling-project.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/docling-project/docling-sdg\"\u003e\n    \u003cimg loading=\"lazy\" alt=\"Docling\" src=\"https://github.com/docling-project/docling-sdg/raw/main/docs/assets/docling-sdg-pic.png\" width=\"40%\"/\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n# Docling SDG\n\n[![Platforms](https://img.shields.io/badge/platform-macos%20|%20linux%20|%20windows-blue)](https://github.com/docling-project/docling-parse/)\n[![PyPI version](https://img.shields.io/pypi/v/docling-sdg)](https://pypi.org/project/docling-sdg/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/docling-sdg)](https://pypi.org/project/docling-sdg/)\n[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://docs.pydantic.dev/latest/contributing/#badges)\n[![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit\u0026logoColor=white)](https://github.com/pre-commit/pre-commit)\n[![License MIT](https://img.shields.io/github/license/docling-project/docling-parse)](https://opensource.org/licenses/MIT)\n[![PyPI Downloads](https://static.pepy.tech/badge/docling-sdg/month)](https://pepy.tech/projects/docling-sdg)\n[![LF AI \u0026 Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation\u0026logoColor=fff\u0026color=0094ff\u0026labelColor=003778)](https://lfaidata.foundation/projects/)\n\nDocling for Synthetic Data Generation (SDG) provides a set of tools to create artificial data from documents, leveraging generative AI and Docling's parsing capabilities.\n\n## Features\n\n* 🧬 Generation of question-answering pairs from passages of [multiple document formats][supported_formats] including \nPDF, HTML, or DOCX, leveraging Docling's parsing capabilities\n* ⚖️ LLM as a judge for high quality question-answering pairs\n* 💻 Simple and convenient CLI\n\n### Coming soon\n\n* 📝 Integrations with Llama Stack and vLLM\n* 📝 SDG on tabular data\n* 📝 Documentation\n\n## Installation\n\nTo use Docling SDG, simply install `docling-sdg` from your package manager, e.g., pip:\n\n```bash\npip install docling-sdg\n```\n\nAlternatively, you can clone this repository and use [uv](https://docs.astral.sh/uv) for\ncreating a virtual environment, installing the packages, and running the project commands.\n\n```bash\ngit clone git@github.com:docling-project/docling-sdg.git\ncd docling-sdg\nuv sync\n```\n\n## Getting started\n\nYou can create synthetically-generated questions and answers from relevant parts of one or several documents.\nThese question-answer pairs may be used in AI applications, such as evaluating a RAG application or generating\nground truth to train a language model.\n\n\n### Sample\n\nGenerating and judging data with LLMs may be computationally intense. Since document collections may be large,\nyou may want to chunk the documents into passages, filter them based on length and content criteria, and sample\na bunch of them to have a manageable dataset.\n\n```python\nfrom docling_sdg.qa.sample import PassageSampler\n\nsource = \"https://en.wikipedia.org/wiki/Duck\"\npassage_sampler = PassageSampler()\nprint(passage_sampler.sample(source))\n```\n\nBy default, the results will be exported to the file `docling_sdg_sample.jsonl`. Every line represents a document passage.\n\n### Generate\n\nFor each passage created in the previous step, we can prompt an LLM and generate 3 different questions of the following\ntypes: _simple fact_, _summary_, and _reasoning_.\n\nThe `GenerateOptions` class controls which model provider is used for Q\u0026A generation by setting the `provider` attribute, as shown below. Three options are available:\n\n* `LlmProvider.WATSONX` for [watsonx.ai](https://www.ibm.com/products/watsonx-ai);, you will need to provide a watsonx.ai instance ID and an API key.\n* `LlmProvider.OPENAI` for OpenAI; you will need to provide an OpenAI API key\n* `LlmProvider.OPENAI_LIKE` for any model provider with OpenAI compatible APIs; if no API key is needed (such as when running against `ollama` locally), set `api_key` to any string, e.g. `\"fake\"`\n\n```python\nimport os\nfrom docling_sdg.qa.base import GenerateOptions, LlmProvider\nfrom docling_sdg.qa.generate import Generator\nfrom pathlib import Path\n\noptions = GenerateOptions(\n    provider=LlmProvider.WATSONX,\n    project_id=os.environ.get(\"WATSONX_PROJECT_ID\"),\n    api_key=os.environ.get(\"WATSONX_APIKEY\"),\n    url=os.environ.get(\"WATSONX_URL\"),\n)\n\ngenerator = Generator(generate_options=options)\nprint(generator.generate_from_sample(Path(\"docling_sdg_sample.jsonl\")))\n```\n\nBy default, the results will be exported to the file `docling_sdg_generated_qac.jsonl`. Every line represents a generated\nquestion-answer-context item with additional information like the question type.\n\n\n### Critique\n\nCertain applications may require certain quality in the generated data. The last step consists of using an LLM to judge\nthe generated data and provide both qualitative and quantiative evaluations of the question-answer-context items. Using\nthose evaluations, we can filter the generated dataset to the required quality levels.\n\n```python\nimport os\nfrom docling_sdg.qa.base import CritiqueOptions, LlmProvider\nfrom docling_sdg.qa.critique import Judge\nfrom pathlib import Path\n\noptions = CritiqueOptions(\n    provider=LlmProvider.WATSONX,\n    project_id=os.environ.get(\"WATSONX_PROJECT_ID\"),\n    api_key=os.environ.get(\"WATSONX_APIKEY\"),\n    url=os.environ.get(\"WATSONX_URL\"),\n)\n\njudge = Judge(critique_options=options)\nprint(judge.critique(Path(\"docling_sdg_generated_qac.jsonl\")))\n```\n\nBy default, the results will be exported to the file `docling_sdg_critiqued_qac.jsonl`. The file content is similar to \nthe one created in the [Generate](#generate) step, but it additionally contains the critique evaluation on several dimensions such as\n_question to context groundness_, _question feasibility_ or _context usefulness_.\n\n\n## CLI\n\nDocling SDG has a built-in CLI to run the 3 steps of the question-answering data generation.\n\n```bash\ndocling-sdg qa sample https://en.wikipedia.org/wiki/Duck\ndocling-sdg qa generate docling_sdg_sample.jsonl\ndocling-sdg qa critique docling_sdg_generated.jsonl\n```\n\nFind out more about optional parameters with the help argument. For instance:\n\n```bash\ndocling-sdg qa generate --help\n```\n\n## Get help and support\n\nPlease feel free to connect with us using the [discussion section](https://github.com/docling-project/docling/discussions).\n\n## Technical report\n\nFor more details on Docling SDG's inner workings, check out the paper [Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG System](https://aclanthology.org/2025.coling-industry.4.pdf), as well as [Docling Technical Report](https://arxiv.org/abs/2408.09869).\n\n## Contributing\n\nPlease read [Contributing to Docling SDG](https://github.com/docling-project/docling-sdg/blob/main/CONTRIBUTING.md) for details.\n\n## References\n\nIf you use Docling SDG in your projects, please consider citing the following:\n\n```bib\n@inproceedings{teixeira-de-lima-etal-2025-know,\n    title={Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems}, \n    author={Rafael Teixeira de Lima and Shubham Gupta and Cesar Berrospi and Lokesh Mishra and Michele Dolfi and Peter Staar and Panagiotis Vagenas},\n    year={2025},\n    month={jan},\n    booktitle={Proceedings of the 31st International Conference on Computational Linguistics: Industry Track},\n    publisher={Association for Computational Linguistics},\n    url={https://aclanthology.org/2025.coling-industry.4/}\n}\n```\n\n## License\n\nThe Docling SDG codebase is under MIT license.\nFor individual model usage, please refer to the model licenses found in the original packages.\n\n## LF AI \u0026 Data\n\nDocling is hosted as a project in the [LF AI \u0026 Data Foundation](https://lfaidata.foundation/projects/).\n\n### IBM ❤️ Open Source AI\n\nThe project was started by the AI for knowledge team at IBM Research Zurich.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocling-project%2Fdocling-sdg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdocling-project%2Fdocling-sdg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdocling-project%2Fdocling-sdg/lists"}