{"id":27629942,"url":"https://github.com/ofa-sys/instag","last_synced_at":"2025-08-03T19:38:17.012Z","repository":{"id":188386699,"uuid":"678271577","full_name":"OFA-Sys/InsTag","owner":"OFA-Sys","description":"InsTag: A Tool for Data Analysis in LLM Supervised Fine-tuning","archived":false,"fork":false,"pushed_at":"2023-08-20T14:57:08.000Z","size":2065,"stargazers_count":252,"open_issues_count":8,"forks_count":7,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-23T16:16:27.450Z","etag":null,"topics":["alignment","large-language-models","llama","llama2","natural-language-processing","nlp","tagging"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OFA-Sys.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-14T06:56:16.000Z","updated_at":"2025-04-21T02:07:11.000Z","dependencies_parsed_at":"2024-08-03T09:01:13.795Z","dependency_job_id":"537e9a55-a2cd-4eaf-8ad5-3a89d64085d0","html_url":"https://github.com/OFA-Sys/InsTag","commit_stats":{"total_commits":16,"total_committers":2,"mean_commits":8.0,"dds":0.0625,"last_synced_commit":"aac2f27e153f4e8b1eca3603a7a52676f5d6d89c"},"previous_names":["ofa-sys/instag"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FInsTag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FInsTag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FInsTag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OFA-Sys%2FInsTag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OFA-Sys","download_url":"https://codeload.github.com/OFA-Sys/InsTag/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250468277,"owners_count":21435453,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","large-language-models","llama","llama2","natural-language-processing","nlp","tagging"],"created_at":"2025-04-23T16:16:32.676Z","updated_at":"2025-04-23T16:16:33.248Z","avatar_url":"https://github.com/OFA-Sys.png","language":null,"readme":"# ****InsTag****: A Tool for Data Analysis in LLM Supervised Fine-tuning\n\n\n\nWe introduce a tool named **InsTag** for analyzing supervised fine-tuning (SFT) data in LLM aligning with human preference. For local tagging deployment, we release **InsTagger**, fine-tuned on **InsTag** results, to tag the queries in SFT data.\nThrough the scope of tags, we sample a 6K subset of open-resourced SFT data to fine-tune LLaMA and LLaMA-2 and the fine-tuned models **TagLM-13B-v1.0** and **TagLM-13B-v2.0** outperform many open-resourced LLMs on MT-Bench. \n\n\u003cp align=\"center\"\u003e\n🤗 \u003ca href=\"https://huggingface.co/OFA-Sys/InsTagger\" target=\"_blank\"\u003eInsTagger Checkpoint\u003c/a\u003e • 👉 \u003ca href=\"https://www.modelscope.cn/studios/lukeminglkm/instagger_demo/summary\" target=\"_blank\"\u003eOnline LocalTagger Demo\u003c/a\u003e • 📖 \u003ca href=\"https://arxiv.org/pdf/2308.07074.pdf\" target=\"_blank\"\u003ePaper\u003c/a\u003e  \u003cbr\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n🤖️ \u003ca href=\"https://huggingface.co/OFA-Sys/TagLM-13b-v1.0\" target=\"_blank\"\u003eTagLM-13B-v1.0 Checkpoint\u003c/a\u003e 🤖️ \u003ca href=\"https://huggingface.co/OFA-Sys/TagLM-13b-v2.0\" target=\"_blank\"\u003eTagLM-13B-v2.0 Checkpoint\u003c/a\u003e\u003cbr\u003e\n\u003c/p\u003e\n\n\n**What is *InsTag*?**\n\nFoundation language models obtain the instruction-following ability through supervised fine-tuning (SFT).\nDiversity and complexity are considered critical factors of a successful SFT dataset, while their definitions remain obscure and lack quantitative analyses.\nIn this work, we propose *InsTag*, an open-set fine-grained tagger, to tag samples within SFT datasets based on semantics and intentions and define instruction diversity and complexity regarding tags.\nWe obtain 6.6K tags to describe comprehensive user queries.\nWe analyze popular open-sourced SFT datasets and find that the model ability grows with more diverse and complex data.\nBased on this observation, we propose a data selector based on *InsTag* to select 6K diverse and complex samples from open-source datasets and fine-tune models on *InsTag*-selected data.\nThese models outperform open-source models based on considerably larger SFT data evaluated by MT-Bench, echoing the importance of query diversity and complexity.\n\n\u003cp align=\"center\" width=\"100%\"\u003e\n\u003ca \u003e\u003cimg src=\"assets/main_figure.png\" alt=\"InsTag\" style=\"width: 80%; min-width: 300px; display: block; margin: auto;\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\n## News\n\n- [08/2023] 🔥 We have an online demo of InsTagger hosted by ModelScope. Please refer to the link on the top. Thanks ModelScope!\n\n- [08/2023] 🔥 We released aligned LLMs **TagLM-13B-v1.0** and **TagLM-13B-v2.0** based on LLaMA and LLaMA-2 respectively. Both are fine-tuned on sub-sampled SFT data according to ***InsTag***. Download [v1.0]() and [v2.0](). \n\n- [08/2023] 🔥 We released an LLM **InsTagger** fine-tuned on our tagging results for local tagging deployments. Download [weight](https://huggingface.co/OFA-Sys/InsTagger). \n\n- [08/2023] 🔥 We introduced ***InsTag***, our SFT data analysis tool. Check out the [paper](). \n\n## Contents\n\n- [Model Checkpoints](#model-checkpoints)\n- [Citation](#citation)\n\n## InsTagger\n\nInsTagger is a LLaMa-2 based SFT model trained with FastChat in the vicuna template. You can easily download weight at [HuggingFace ModelHub](https://huggingface.co/OFA-Sys/InsTagger) and then use [FastChat](https://github.com/lm-sys/FastChat) to serve or inference. Demo codes are about to be released.\n\n## Model Checkpoints\n\n- **InsTagger** for local query tagging:\n\n    **InsTagger** is an tagging LLM which is fine-tuned on **InsTag**'s tagging results on open-resourced SFT data. The model is based on 7B version LLaMA-2.\n\n    Download the model checkpoint below:\n\n    | Model | Checkpoint | Exact Match F1 | Semantic-based Fuzzy Match F1  | License |\n    | ----- |------| -------| -------| ----- |\n    | LocalTagger | 🤗 \u003ca href=\"\" target=\"_blank\"\u003eHF Link\u003c/a\u003e  | **31.8%** | **73.4%**  | \u003ca href=\"https://ai.meta.com/resources/models-and-libraries/llama-downloads/\" target=\"_blank\"\u003eLLaMA 2 License \u003c/a\u003e |\n\n\n\n\n- **TagLM**, fine-tuned on our SFT data sub-sampled by complexity-first diverse sampling procedure:\n\n    With only 6k data from current open-resourced SFT dataset, **TagLM** can outperform many open-resourced LLMs on MT-Bench using GPT-4 as a judge. \n\n    Download the model checkpoint below:\n\n    | Model | Checkpoint | MT-Bench  | License |\n    | ----- |------| -------| ----- |\n    | TagLM-13B-v1.0 | 🤗 \u003ca href=\"\" target=\"_blank\"\u003eHF Hub Link\u003c/a\u003e  |  **6.44**\t  | \u003ca href=\"https://ai.meta.com/resources/models-and-libraries/llama-downloads/\" target=\"_blank\"\u003eLLaMA License \u003c/a\u003e |\n    | TagLM-13B-v2.0 | 🤗 \u003ca href=\"\" target=\"_blank\"\u003eHF Hub Link\u003c/a\u003e  |  **6.55**\t  | \u003ca href=\"https://ai.meta.com/resources/models-and-libraries/llama-downloads/\" target=\"_blank\"\u003eLLaMA 2 License \u003c/a\u003e |\n\n    All models are either based on LLaMA or LLaMA-2 and should be used under their licenses accordingly. All the models are fine-tuned using [FastChat](https://github.com/lm-sys/FastChat) codebase, and we apply the system template of Vicuna V1.1. \n\n\n## Citation \n\nPlease cite our work if you find the repository helpful.\n\n```\n@misc{lu2023instag,\n      title={#InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Language Models}, \n      author={Keming Lu and Hongyi Yuan and Zheng Yuan and Runji Lin and Junyang Lin and Chuanqi Tan and Chang Zhou and Jingren Zhou},\n      year={2023},\n      eprint={2308.07074},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofa-sys%2Finstag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fofa-sys%2Finstag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fofa-sys%2Finstag/lists"}