{"id":13789104,"url":"https://github.com/CyberAgentAILab/LCTG-Bench","last_synced_at":"2025-05-12T03:31:23.717Z","repository":{"id":246489241,"uuid":"807397754","full_name":"CyberAgentAILab/LCTG-Bench","owner":"CyberAgentAILab","description":"LCTG Bench: LLM Controlled Text Generation Benchmark","archived":false,"fork":false,"pushed_at":"2024-07-02T02:38:01.000Z","size":185,"stargazers_count":5,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-11-18T02:39:26.008Z","etag":null,"topics":["benchmark","controllability","evaluation","llm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CyberAgentAILab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-29T03:03:46.000Z","updated_at":"2024-10-10T01:57:36.000Z","dependencies_parsed_at":"2024-08-03T21:13:27.636Z","dependency_job_id":null,"html_url":"https://github.com/CyberAgentAILab/LCTG-Bench","commit_stats":null,"previous_names":["cyberagentailab/lctg-bench"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2FLCTG-Bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2FLCTG-Bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2FLCTG-Bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CyberAgentAILab%2FLCTG-Bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CyberAgentAILab","download_url":"https://codeload.github.com/CyberAgentAILab/LCTG-Bench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253668011,"owners_count":21944960,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","controllability","evaluation","llm"],"created_at":"2024-08-03T21:00:58.588Z","updated_at":"2025-05-12T03:31:23.290Z","avatar_url":"https://github.com/CyberAgentAILab.png","language":"Python","funding_links":[],"categories":["日本語LLM評価ベンチマーク/データセットまとめ"],"sub_categories":["制約付きの生成能力を測定するベンチマーク/データセット"],"readme":"# LCTG Bench: LLM Controlled Text Generation Benchmark\nThe LCTG Bench has been built to measure the controllability of Japanese LLMs in terms of how well they comply with constraints such as character count keywords in instructions.\n\nLCTG Bench has been constructed in collaboration with AI Shift Inc., CyberAgent Inc. and Okazaki Lab at Tokyo Institute of Technology.\n\n\n## Task/Datasets\nThe LCTG Bench consists of three text generation tasks: **Summarization**, **Advertisement (AD) Text Geneartion**, and **Pros \u0026 Cons Generation**. For each task, the performance of LLM is evaluated in terms of four controllability aspects: `Format`, `Character Count`, `Keyword`, and `Prohibited word`.\n\nAll data sets consist of test data only.\n\nThe Pros \u0026 Cons Generation task has been built after writing the Japanese paper.\n\n| Task                   | Dataset     | Format | Character Count | Keyword | Prohibihited word |\n|------------------------|-------------|--------|-----------------|---------|-------------------|\n| Summarization          | ABEMA Times | 120    | 120             | 120     | 120               |\n| Ad Text Generation     | CAMERA      | 150    | 150             | 150     | 150               |\n| Pros \u0026 Cons Generation | -           | 150    | 150             | 150     | 150               |\n\n## Evaluation Method\nYou can run the LLM performance evaluation using this benchmark simply by executing ```run_lctg.sh```.\nAll scripts are placed under ```scripts```, and datasets are placed under ```datasets```.\n\n### Procedure\n0. To run this script, you will need to prepare a python 3.10 operating environment in advance. (We have not verified that it works with python 3.11 or later versions.)\n1. To install packages, run the following command.\n```angular2html\npip install -r requirements.txt\n```\n\n2. To conduct an evaluation, you must prepare a ```.env``` file with ```OPENAI_API_KEY```, ```VERSION```, and ```MODEL_PATH_X``` (where X= 0, 1, 2 ...) defined (please specify \"1\" for ```VERSION```). For ```MODEL_PATH_X```, specify the path of the model on the hugging face hub or the name of the model provided by the API. ***Please always specify X as a sequential number starting from 0.***\n```angular2html\nOPENAI_API_KEY={your_openai_api_key}\nVERSION=\"1\"\nMODEL_PATH_0=\"gpt-4-1106-preview\"\nMODEL_PATH_1=\"cyberagent/calm2-7b-chat\"\n```\n3. By executing the following command, inference by LLM and evaluation of the inference results will be conducted. After execution, you can check the scores at ```output/{task}/all_scores.csv```. It takes several hours to execute the evaluation for each model.\n```\nmake run\n```\n\n### Notes\nCurrently, this repository covers the following models.\n```\nMODEL_PATH_0=\"gpt-4-1106-preview\"\nMODEL_PATH_1=\"gpt-3.5-turbo-0125\"\nMODEL_PATH_2=\"gemini-pro\"\nMODEL_PATH_3=\"cyberagent/calm2-7b-chat\"\nMODEL_PATH_4=\"elyza/ELYZA-japanese-Llama-2-7b-fast-instruct\"\nMODEL_PATH_5=\"line-corporation/japanese-large-lm-3.6b-instruction-sft\"\nMODEL_PATH_6=\"matsuo-lab/weblab-10b-instruction-sft\"\nMODEL_PATH_7=\"rinna/youri-7b-chat\"\nMODEL_PATH_8=\"stabilityai/japanese-stablelm-instruct-gamma-7b\"\nMODEL_PATH_9=\"llm-jp/llm-jp-13b-instruct-full-jaster-v1.0\"\nMODEL_PATH_10=\"llm-jp/llm-jp-13b-instruct-full-jaster-dolly-oasst-v1.0\"\nMODEL_PATH_11=\"tokyotech-llm/Swallow-7b-instruct-hf\"\nMODEL_PATH_12=\"tokyotech-llm/Swallow-13b-instruct-hf\"\nMODEL_PATH_13=\"tokyotech-llm/Swallow-70b-instruct-hf\"\n```\nIf you wish to evaluate the performance of models other than those listed above, you will need to edit the following files in addition to adding environment variables.\nBut if you want to use the openai model. You only have to edit ```scripts/utils.model_info_getter.py```\n\n#### scripts/utils/tokenizer_model_tokenids.py\n- you have to add a function that returns model and tokenizer\n```\ndef get_calm2_model_tokenizer(model_path:str):\n    model = AutoModelForCausalLM.from_pretrained(model_path, device_map=\"auto\", torch_dtype=\"auto\")\n    tokenizer = AutoTokenizer.from_pretrained(model_path)\n    return model, tokenizer\n```\n\n#### scripts/utils/output_tokens.py\n- You have to add a function that returns the output tokens (str).\n- You may also have to define system prompts and model parameters in the functions you add.\n```\ndef get_calm2_output_tokens(model, tokenizer, text: str, use_system_prompt: bool = True):\n    prompt = f\"\"\"USER: {text}\nASSISTANT: \"\"\"\n    token_ids = tokenizer.encode(prompt, return_tensors=\"pt\")\n    with torch.no_grad():\n        output_ids = model.generate(\n            input_ids=token_ids.to(model.device),\n            max_new_tokens=MAX_NEW_TOKENS,\n            do_sample=True,\n            temperature=0.8,\n        )\n        output_ids = output_ids[0][len(token_ids[0]):]\n    return tokenizer.decode(output_ids, skip_special_tokens=True)\n```\n\n#### scripts/utils.model_info_getter.py\n- You have to edit the following two functions.\n  - get_model_tokenizer\n    - you have to add few lines.\n```\n    elif \"cyberagent/calm2\" in model_path:\n        model, tokenizer = get_calm2_model_tokenizer(model_path)\n```\n  - get_output_tokens\n    - You have to add the pair of key (model name) and value (function name)\n```\n\"cyberagent/calm2-7b-chat\": get_calm2_output_tokens,\n```\n\n## Citation \nIf you use it in your research, please cite:\n\n```\n@InProceedings{Kurihara_nlp2024,\n  author = \t\"栗原健太郎 and 三田雅人 and 張培楠 and 佐々木翔大 and 石上亮介 and 岡崎直観\",\n  title = \t\"LCTG Bench: 日本語LLMの制御性ベンチマークの構築\",\n  booktitle = \t\"言語処理学会第30回年次大会\",\n  year =\t\"2024\",\n  url = \"https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/D11-2.pdf\"\n  note= \"in Japanese\"\n}\n```\n## License\nThis work is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/).\n\n\u003ca rel=\"license\" href=\"https://creativecommons.org/licenses/by-nc-sa/4.0//\"\u003e\u003cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png\" /\u003e\u003c/a\u003e\u003cbr /\u003e\n\n## Prohibited Uses\nThe evaluation results from this benchmark may be made available to the public, but the following uses are prohibited\n\n- Mentioning Abema TV Inc. as the provider of ABEMA TIMES, the data source for the summarization task\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCyberAgentAILab%2FLCTG-Bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FCyberAgentAILab%2FLCTG-Bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FCyberAgentAILab%2FLCTG-Bench/lists"}