{"id":28208444,"url":"https://github.com/vincentkoc/tiny_qa_benchmark_pp","last_synced_at":"2025-06-12T08:31:33.148Z","repository":{"id":293678143,"uuid":"984769683","full_name":"vincentkoc/tiny_qa_benchmark_pp","owner":"vincentkoc","description":"Tiny QA Benchmark++ a micro-benchmark suite (52-item gold + on-demand multilingual synthetic packs), generator CLI, and CI-ready eval harness for ultra-fast LLM smoke-testing \u0026 regression-catching.","archived":false,"fork":false,"pushed_at":"2025-05-20T12:28:40.000Z","size":313,"stargazers_count":34,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-05T10:53:50.803Z","etag":null,"topics":["benchmark","dataset","evaluation","huggingface-datasets","litellm","llm","llm-testing","llmops","qa-dataset","smoke-test","synthetic-data","tinybenchmarks"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vincentkoc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":"vincentkoc","buy_me_a_coffee":"vincentkoc"}},"created_at":"2025-05-16T13:30:19.000Z","updated_at":"2025-05-27T09:24:43.000Z","dependencies_parsed_at":"2025-05-17T06:15:48.293Z","dependency_job_id":null,"html_url":"https://github.com/vincentkoc/tiny_qa_benchmark_pp","commit_stats":null,"previous_names":["vincentkoc/tinyqa-benchmark-pp","vincentkoc/tinyqa_benchmark_pp"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/vincentkoc/tiny_qa_benchmark_pp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentkoc%2Ftiny_qa_benchmark_pp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentkoc%2Ftiny_qa_benchmark_pp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentkoc%2Ftiny_qa_benchmark_pp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentkoc%2Ftiny_qa_benchmark_pp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vincentkoc","download_url":"https://codeload.github.com/vincentkoc/tiny_qa_benchmark_pp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vincentkoc%2Ftiny_qa_benchmark_pp/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259429759,"owners_count":22856128,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","dataset","evaluation","huggingface-datasets","litellm","llm","llm-testing","llmops","qa-dataset","smoke-test","synthetic-data","tinybenchmarks"],"created_at":"2025-05-17T14:12:49.440Z","updated_at":"2025-06-12T08:31:33.143Z","avatar_url":"https://github.com/vincentkoc.png","language":"Python","funding_links":["https://github.com/sponsors/vincentkoc","https://buymeacoffee.com/vincentkoc"],"categories":[],"sub_categories":[],"readme":"\u003c!-- SPDX-License-Identifier: Apache-2.0 OR CC BY 4.0 OR other --\u003e\n\u003cdiv align=\"center\"\u003e\u003cb\u003e\u003ca href=\"README.md\"\u003eEnglish\u003c/a\u003e | \u003ca href=\"README_zh.md\"\u003e简体中文\u003c/a\u003e | \u003ca href=\"README_ja.md\"\u003e日本語\u003c/a\u003e | \u003ca href=\"README_es.md\"\u003eEspañol\u003c/a\u003e | \u003ca href=\"README_fr.md\"\u003eFrançais\u003c/a\u003e\u003c/b\u003e\u003c/div\u003e\n\n\u003ch1 align=\"center\" style=\"border: none\"\u003e\n    \u003cdiv style=\"border: none\"\u003e\n        \u003c!-- If you have a logo, you can add it here. Example:\n        \u003ca href=\"YOUR_PROJECT_LINK\"\u003e\u003cpicture\u003e\n            \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"PATH_TO_DARK_LOGO.svg\"\u003e\n            \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"PATH_TO_LIGHT_LOGO.svg\"\u003e\n            \u003cimg alt=\"Project Logo\" src=\"PATH_TO_LIGHT_LOGO.svg\" width=\"200\" /\u003e\n        \u003c/picture\u003e\u003c/a\u003e\n        \u003cbr\u003e\n        --\u003e\n        Tiny QA Benchmark++ (TQB++)\n    \u003c/div\u003e\n\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\nAn ultra-lightweight evaluation dataset and synthetic generator \u003cbr\u003eto expose critical LLM failures in seconds, ideal for CI/CD and LLMOps.\n\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n    \u003ca href=\"https://pypi.org/project/tinyqabenchmarkpp/\"\u003e\u003cimg alt=\"PyPI version\" src=\"https://img.shields.io/pypi/v/tinyqabenchmarkpp\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://github.com/vincentkoc/tiny_qa_benchmark_pp/blob/main/LICENSE\"\u003e\u003cimg alt=\"License\" src=\"https://img.shields.io/badge/Apache-2.0-green\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark_pp\"\u003e\u003cimg alt=\"Hugging Face Dataset\" src=\"https://img.shields.io/badge/🤗%20Dataset-Tiny%20QA%20Benchmark%2B%2B-blue\"\u003e\u003c/a\u003e\n    \u003ca href=\"https://arxiv.org/abs/2505.12058\"\u003e\u003cimg alt=\"arXiv\" src=\"https://img.shields.io/badge/arXiv-2505.12058-b31b1b.svg\"\u003e\u003c/a\u003e\n    \u003c!-- Consider adding a GitHub Actions workflow badge if you have CI configured --\u003e\n    \u003c!-- e.g., \u003ca href=\"YOUR_WORKFLOW_LINK\"\u003e\u003cimg alt=\"Build Status\" src=\"YOUR_WORKFLOW_BADGE_SVG_LINK\"\u003e\u003c/a\u003e --\u003e\n\u003c/div\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/vincentkoc/tiny_qa_benchmark_pp\"\u003e\u003cb\u003eGitHub\u003c/b\u003e\u003c/a\u003e •\n    \u003ca href=\"https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark_pp\"\u003e\u003cb\u003eHugging Face Dataset\u003c/b\u003e\u003c/a\u003e •\n    \u003ca href=\"https://arxiv.org/abs/2505.12058\"\u003e\u003cb\u003ePaper\u003c/b\u003e\u003c/a\u003e •\n    \u003ca href=\"https://pypi.org/project/tinyqabenchmarkpp/\"\u003e\u003cb\u003ePyPI\u003c/b\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003chr\u003e\n\u003c!-- Optional: If you have a project thumbnail image, you can add it here --\u003e\n\u003c!-- \u003cp align=\"center\"\u003e\u003cimg alt=\"TQB++ Thumbnail\" src=\"path/to/your/thumbnail.png\" width=\"700\"\u003e\u003c/p\u003e --\u003e\n\n**Tiny QA Benchmark++ (TQB++)** is an ultra-lightweight evaluation suite and python package designed to expose critical failures in Large Language Model (LLM) systems within seconds. It serves as the LLM software unit tests, ideal for rapid CI/CD checks, prompt engineering, and continuous quality assurance in modern LLMOps to be used alongside existing LLM evaluation tooling such as [Opik](https://github.com/comet-ml/opik/).\n\nThis repository contains the official implementation and synthetic datasets for the paper: *Tiny QA Benchmark++: Micro Gold Dataset with Synthetic Multilingual Generation for Rapid LLMOps Smoke Tests*.\n\n- **Paper:** [arXiv:2505.12058](https://arxiv.org/abs/2505.12058)\n- **Hugging Face Hub:** [datasets/vincentkoc/tiny_qa_benchmark_pp](https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark_pp)\n- **GitHub Repository:** [vincentkoc/tiny_qa_benchmark_pp](https://github.com/vincentkoc/tiny_qa_benchmark_pp)\n\n## Key Features\n\n*   **Immutable Gold Standard Core:** A 52-item hand-crafted English Question-Answering (QA) dataset (`core_en`) for deterministic regression testing from early [datasets/vincentkoc/tiny_qa_benchmark](https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark).\n*   **Synthetic Customisation Toolkit:** A Python script (`tools/generator`) using LiteLLM to generate bespoke micro-benchmarks on demand for any language, topic, or difficulty.\n*   **Standardised Metadata:** Artefacts packaged in Croissant JSON-LD format (`metadata/`) for discoverability and auto-loading by tools and search engines.\n*   **Open Science:** All code (generator, evaluation scripts) and the core English dataset are released under the Apache-2.0 license. Synthetically generated data packs have a custom evaluation-only license.\n*   **LLMOps Alignment:** Designed for easy integration into CI/CD pipelines, prompt-engineering workflows, cross-lingual drift detection, and observability dashboards.\n*   **Multilingual Packs:** Pre-built packs for languages including English, French, Spanish, Portuguese, German, Chinese, Japanese, Turkish, Arabic, and Russian.\n\n## Using the `tinyqabenchmarkpp` Python Package\n\nThe core synthetic generation capabilities of TQB++ are available as a Python package, `tinyqabenchmarkpp`, which can be installed from PyPI.\n\n### Installation\n\n```bash\npip install tinyqabenchmarkpp\n```\n\n(Note: Ensure you have Python 3.8+ and pip installed. If this command doesn't work, please check the official [PyPI project page](https://pypi.org/project/tinyqabenchmarkpp/) for the correct package name.)\n\n### Generating Synthetic Datasets via CLI\n\nOnce installed, you can use the `tinyqabenchmarkpp` command (or `python -m tinyqabenchmarkpp.generate`) to create custom QA datasets. \n\n**Example:**\n```bash\ntinyqabenchmarkpp --num 10 --languages \"en,es\" --categories \"science\" --output-file \"./science_pack.jsonl\"\n```\n\nThis will generate a small pack of 10 English and Spanish science questions.\n\nFor detailed instructions on all available parameters (like `--model`, `--context`, `--difficulty`, etc.), advanced usage, and examples for different LLM providers (OpenAI, OpenRouter, Ollama), please refer to the **[Generator Toolkit README](tools/generator/README.md)** or run `tinyqabenchmarkpp --help`.\n\nWhile the `tinyqabenchmarkpp` package focuses on dataset *generation*, the TQB++ project also provides pre-generated datasets and evaluation tools, as described below.\n\n## Loading Datasets with Hugging Face `datasets`\n\nThe TQB++ datasets are available on the Hugging Face Hub and can be easily loaded using the `datasets` library. This is the recommended way to access the data.\n\n```python\nfrom datasets import load_dataset, get_dataset_config_names\n\n# Discover available dataset configurations (e.g., core_en, pack_fr_40, etc.)\nconfigs = get_dataset_config_names(\"vincentkoc/tiny_qa_benchmark_pp\")\nprint(f\"Available configurations: {configs}\")\n\n# Load the core English dataset (assuming 'core_en' is a configuration)\nif \"core_en\" in configs:\n    core_dataset = load_dataset(\"vincentkoc/tiny_qa_benchmark_pp\", name=\"core_en\", split=\"train\")\n    print(f\"\\nLoaded {len(core_dataset)} examples from core_en:\")\n    # print(core_dataset[0]) # Print the first example\nelse:\n    print(\"\\n'core_en' configuration not found.\")\n\n# Load a specific synthetic pack (e.g., a French pack)\n# Replace 'pack_fr_40' with an actual configuration name from the `configs` list\nexample_pack_name = \"pack_fr_40\" # or another valid config name\nif example_pack_name in configs:\n    synthetic_pack = load_dataset(\"vincentkoc/tiny_qa_benchmark_pp\", name=example_pack_name, split=\"train\")\n    print(f\"\\nLoaded {len(synthetic_pack)} examples from {example_pack_name}:\")\n    # print(synthetic_pack[0]) # Print the first example\nelse:\n    print(f\"\\n'{example_pack_name}' configuration not found. Please choose from available configurations.\")\n\n```\n\nFor more detailed information on the datasets, including their structure and specific licenses, please see the README files within the `data/` directory (i.e., `data/README.md`, `data/core_en/README.md`, and `data/packs/README.md`).\n\n## Repository Structure\n\n*   `data/`: Contains the QA datasets.\n    *   `core_en/`: The original 52-item human-curated English core dataset.\n    *   `packs/`: Synthetically generated multilingual and topical dataset packs.\n*   `tools/`: Contains scripts for dataset generation and evaluation.\n    *   `generator/`: The synthetic QA dataset generator.\n    *   `eval/`: Scripts and utilities for evaluating models against TQB++ datasets.\n*   `paper/`: The LaTeX source and associated files for the research paper.\n*   `metadata/`: Croissant JSON-LD metadata files for the datasets.\n*   `LICENSE`: Main license for the codebase (Apache-2.0).\n*   `LICENCE.data_packs.md`: Custom license for synthetically generated data packs.\n*   `LICENCE.paper.md`: License for the paper content.\n\n## Usage Scenarios\n\nTQB++ is designed for various LLMOps and evaluation workflows:\n\n*   **CI/CD Pipeline Testing:** Use as a unit test with LLM testing tooling for LLM services to catch regressions.\n*   **Prompt Engineering \u0026 Agent Development:** Get rapid feedback when iterating on prompts or agent designs.\n*   **Evaluation Harness Integration:** Designed for seamless use with evaluation harnesses. Encode as an OpenAI Evals YAML (see `intergrations/openai-evals/README.md`) or an Opik dataset for dashboard tracking and robust LLM assessment. The `intergrations/` folder provides further details on available out-of-the-box support.\n*   **Cross-Lingual Drift Detection:** Monitor localization regressions using multilingual TQB++ packs.\n*   **Adaptive Testing:** Synthesize new micro-benchmarks on-the-fly tailored to specific features or data drifts.\n*   **Monitoring Fine-tuning Dynamics:** Track knowledge erosion or unintended capability changes during fine-tuning.\n\n## Citation\n\nIf you use TQB++ in your research or work, please cite the original TQB and the TQB++ paper:\n\n```bibtex\n% This synthetic dataset and generator\n@misc{koctinyqabenchmarkpp,\n  author       = {Vincent Koc},\n  title        = {Tiny QA Benchmark++ (TQB++) Datasets and Toolkit},\n  year         = {2025},\n  publisher    = {Hugging Face \u0026 GitHub},\n  doi          = {10.57967/hf/5531},\n  howpublished = {\\url{https://huggingface.co/datasets/vincentkoc/tiny_qa_benchmark_pp}},\n  note         = {See also: \\url{https://github.com/vincentkoc/tiny_qa_benchmark_pp}}\n}\n\n% This is the paper\n@misc{koc2025tinyqabenchmarkultralightweight,\n  title={Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation \u0026 Smoke-Tests for Continuous LLM Evaluation}, \n  author={Vincent Koc},\n  year={2025},\n  eprint={2505.12058},\n  archivePrefix={arXiv},\n  primaryClass={cs.AI},\n  url={https://arxiv.org/abs/2505.12058}\n}\n```\n\n## License\nThe code in this repository (including the generator and evaluation scripts) and the `data/core_en` dataset and anything else not mentioned with a licence are licensed under the Apache License 2.0. See the `LICENSE` file for details.\n\nThe synthetically generated dataset packs in `data/packs/` are distributed under a custom \"Eval-Only, Non-Commercial, No-Derivatives\" license. See `LICENCE.data_packs.md` for details.\n\nThe Croissant JSON-LD metadata files in `metadata/` are available under CC0-1.0.\n\nThe paper content in `paper/` is subject to its own license terms, detailed in `LICENCE.paper.md`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvincentkoc%2Ftiny_qa_benchmark_pp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvincentkoc%2Ftiny_qa_benchmark_pp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvincentkoc%2Ftiny_qa_benchmark_pp/lists"}