{"id":18950187,"url":"https://github.com/salesforce/ctrl-sum","last_synced_at":"2025-08-20T14:32:38.180Z","repository":{"id":37422205,"uuid":"288802012","full_name":"salesforce/ctrl-sum","owner":"salesforce","description":"Resources for the \"CTRLsum: Towards Generic Controllable Text Summarization\" paper","archived":false,"fork":false,"pushed_at":"2023-06-12T21:29:03.000Z","size":1876,"stargazers_count":146,"open_issues_count":7,"forks_count":24,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-12-10T08:42:20.554Z","etag":null,"topics":["text-generation","text-summarization"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2012.04281","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/salesforce.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":"SECURITY.md","support":null,"governance":null}},"created_at":"2020-08-19T18:03:41.000Z","updated_at":"2024-07-17T10:39:46.000Z","dependencies_parsed_at":"2022-08-19T15:31:13.830Z","dependency_job_id":null,"html_url":"https://github.com/salesforce/ctrl-sum","commit_stats":{"total_commits":34,"total_committers":6,"mean_commits":5.666666666666667,"dds":0.5,"last_synced_commit":"6468beaaceebf463b492992fffef0e4f693a3281"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2Fctrl-sum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2Fctrl-sum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2Fctrl-sum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/salesforce%2Fctrl-sum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/salesforce","download_url":"https://codeload.github.com/salesforce/ctrl-sum/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230431103,"owners_count":18224655,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["text-generation","text-summarization"],"created_at":"2024-11-08T13:21:37.031Z","updated_at":"2024-12-19T12:11:36.716Z","avatar_url":"https://github.com/salesforce.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CTRLsum\nThis is PyTorch implementation of the [paper](https://arxiv.org/abs/2012.04281):\n\n```\nCTRLsum: Towards Generic Controllable Text Summarization\nJunxian He, Wojciech Kryściński, Bryan McCann, Nazneen Rajani, Caiming Xiong\narXiv 2020\n```\n\nThis repo includes instructions for [using pretrained CTRLsum models](#example-usage-of-pretrained-models) as well as [training new models](#train-ctrlsum). \n\nCTRLsum is a generic controllable summarization system to manipulate text summaries given control tokens in the form of keywords or prefix. CTRLsum is also able to achieve strong (e.g. state-of-the-art on CNN/Dailymail) summarization performance in an uncontrolled setting. \n\n🎥 Demo1: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/ctrl-sum)(to interactively generate using the pretrained model)\n\n🎥 [Demo2](https://share.streamlit.io/jxhe/ctrlsum-demo/ctrlsum_demo.py)(to navigate the CTRLsum outputs used in our experiments) \n\n\n## Model checkpoints\n\nDataset | Dowload\n---|---\nCNN/DailyMail | [download (.tar.gz)](https://storage.googleapis.com/sfr-control-summ-data-research/cnndm_ctrlsum.tar.gz)\narXiv | [download (.tar.gz)](https://storage.googleapis.com/sfr-control-summ-data-research/arxiv_ctrlsum.tar.gz)\nBIGPATENT | [download (.tar.gz)](https://storage.googleapis.com/sfr-control-summ-data-research/big_patent_ctrlsum.tar.gz)\n\nThese checkpoints are also available in [huggingface transformers](https://github.com/huggingface/transformers), see details [below](#option-3-through-huggingface-transformers).\n\n## Updates\n\n**April 09, 2022**\n\n[@aliencaocao](https://github.com/aliencaocao) made a repo [here](https://github.com/aliencaocao/CTRLSum-tagger) on converting our pretrained taggers into ONNX to make it much faster to load and run inference.\n\n**October 07, 2021**\n\nIntegrated to [Huggingface Spaces](https://huggingface.co/spaces) with [Gradio](https://github.com/gradio-app/gradio). See demo: [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/akhaliq/ctrl-sum)\n\n**June 18, 2021**\n\nWe released another Web UI Demo ([here](https://share.streamlit.io/jxhe/ctrlsum-demo/ctrlsum_demo.py)) to navigate most of CTRLsum outputs generated in the experiments of the paper.\n\n**Mar 22, 2021**\n\n[Hyunwoong Ko](https://github.com/hyunwoongko) made a python package, [summarizers](https://github.com/hyunwoongko/summarizers), based on CTRLsum. CTRLsum is also now supported in [huggingface transformers](https://github.com/huggingface/transformers) credited to Hyunwoong Ko. Currently CTRLsum can be easily used with several lines of codes with these packages. See an [example](#option-3-through-huggingface-transformers) using huggingface transformers.\n\n\n## Dependencies\nThe code requires Python 3, [PyTorch](https://pytorch.org/) (\u003e=1.4.0), and [fairseq](https://github.com/pytorch/fairseq) (the code is tested on this [commit](https://github.com/pytorch/fairseq/commit/fad3cf0769843e767155f4d0af18a61b9a804f59))\n\nInstall dependencies:\n```bash\n# manually install fairseq\ngit clone https://github.com/pytorch/fairseq\n\n# this repo is tested on a commit of fairseq from May 2020:\n# fad3cf0769843e767155f4d0af18a61b9a804f59\ncd fairseq\ngit reset --hard fad3cf07\n\n# the BART interface in fairseq does not support prefix-constrained decoding\n# as of creating this README, thus we need to make several modifications to \n# fairseq before installing it\ncp ../ctrlsum/fairseq_task.py fairseq/tasks/fairseq_task.py\ncp ../ctrlsum/sequence_generator.py fairseq/\ncp ../ctrlsum/hub_interface.py fairseq/models/bart/\n\n# install fairseq\npip install --editable ./\n\ncd ..\n\n# install other requirements\npip install -r requirements.txt\n```\n\n## Example Usage of Pretrained Models\n\n\n### Option 1. Generate summaries in an interactive way, users can specify the control tokens (keywords, prompts, or both):\n\n```bash \nCUDA_VISIBLE_DEVICES=xx python scripts/generate_bart_interactive.py --exp [checkpoint directory] \\\n\t--dataset example_dataset \\\n\t--src test.oraclewordnssource\n```\nThe command above reads source articles from `datasets/example_dataset/test.oraclewordnssource`, users can then interact with the system in the commandline by inputting the id of examples to be shown, as well as the control tokens: \n\n![ctrlsum](gif/ctrlsum.gif)\n\n\n\n### Option 2. Generate summaries from a file which includes keywords:\n\n```bash\n# the following command generates summaries from `datasets/example_dataset/test.oraclewordnssource`\n# the input data format is concatenated keywords and source with sep token, please refer to the \n# given example data files for examples\n# the predicted summaries are saved into the checkpoint directory\nCUDA_VISIBLE_DEVICES=xx python scripts/generate_bart.py --exp [checkpoint directory] \\\n\t--dataset example_dataset \\\n\t--src test.oraclewordnssource \n```\n\n### Option 3. Through Huggingface Transformers\nOur pretrained model checkpoints are available in [huggingface transformers](https://github.com/huggingface/transformers), the model names are: `hyunwoongko/ctrlsum-cnndm`, `hyunwoongko/ctrlsum-arxiv`, and `hyunwoongko/ctrlsum-bigpatent`. An example code snippet (quoted from [here](https://github.com/huggingface/transformers/issues/9001#issuecomment-803613963)):\n\n\u003e ### 1. Create models and tokenizers\n\u003e ```python\n\u003e \u003e\u003e from transformers import AutoModelForSeq2SeqLM, PreTrainedTokenizerFast\n\u003e \n\u003e \u003e\u003e\u003e model = AutoModelForSeq2SeqLM.from_pretrained(\"hyunwoongko/ctrlsum-cnndm\")\n\u003e \u003e\u003e\u003e # model = AutoModelForSeq2SeqLM.from_pretrained(\"hyunwoongko/ctrlsum-arxiv\")\n\u003e \u003e\u003e\u003e # model = AutoModelForSeq2SeqLM.from_pretrained(\"hyunwoongko/ctrlsum-bigpatent\")\n\u003e \n\u003e \u003e\u003e\u003e tokenizer = PreTrainedTokenizerFast.from_pretrained(\"hyunwoongko/ctrlsum-cnndm\")\n\u003e \u003e\u003e\u003e # tokenizer = PreTrainedTokenizerFast.from_pretrained(\"hyunwoongko/ctrlsum-arxiv\")\n\u003e \u003e\u003e\u003e # tokenizer = PreTrainedTokenizerFast.from_pretrained(\"hyunwoongko/ctrlsum-bigpatent\")\n\u003e ```\n\u003e \n\u003e ### 2. Unconditioned summarization\n\u003e ```python\n\u003e \u003e\u003e\u003e data = tokenizer(\"My name is Kevin. I love dogs. I loved dogs from 1996. Today, I'm going to walk on street with my dogs\", return_tensors=\"pt\")\n\u003e \u003e\u003e\u003e input_ids, attention_mask = data[\"input_ids\"], data[\"attention_mask\"]\n\u003e \u003e\u003e\u003e tokenizer.batch_decode(model.generate(input_ids, attention_mask=attention_mask, num_beams=5))[0]\n\u003e '\u003c/s\u003eMy name is Kevin. I loved dogs from 1996.\u003c/s\u003e'\n\u003e ```\n\u003e \n\u003e ### 3. Conditioned summarization\n\u003e * You can input condition token using `TOKEN =\u003e CONTENTS` structure\n\u003e \n\u003e ```python\n\u003e \u003e\u003e\u003e data = tokenizer(\"today plan =\u003e My name is Kevin. I love dogs. I loved dogs from 1996. Today, I'm going to walk on street with my dogs\", return_tensors=\"pt\")\n\u003e \u003e\u003e\u003e input_ids, attention_mask = data[\"input_ids\"], data[\"attention_mask\"]\n\u003e \u003e\u003e\u003e tokenizer.batch_decode(model.generate(input_ids, attention_mask=attention_mask, num_beams=5))[0]\n\u003e \"\u003c/s\u003e Today, I'm going to walk on street with my dogs. I loved dogs from 1996\u003c/s\u003e\"\n\u003e ```\n\u003e \n\u003e ### 4. Prompt summarization\n\u003e * You can also input `decoder_input_ids` for input prompt.\n\u003e \n\u003e ```python\n\u003e \u003e\u003e\u003e data = tokenizer(\"Q:What is my name? A: =\u003e My name is Kevin. I love dogs. I loved dogs from 1996. Today, I'm going to walk on street with my dogs\", return_tensors=\"pt\")\n\u003e \u003e\u003e\u003e input_ids, attention_mask = data[\"input_ids\"], data[\"attention_mask\"]\n\u003e \u003e\u003e\u003e tokenizer.batch_decode(model.generate(input_ids, attention_mask=attention_mask, num_beams=5, decoder_input_ids=tokenizer(\"Q:What is My name? A:\", return_tensors=\"pt\")[\"input_ids\"][:, :-1]))[0]\n\u003e '\u003cs\u003eQ:What is My name? A: Kevin.\u003c/s\u003e'\n\u003e ```\n\n### Option 4. Through the Summarizers Python Package\n The python package [summarizers](https://github.com/hyunwoongko/summarizers) allows you to use the pretrained CTRLsum with several lines of code. \n\n\n## Train CTRLsum\n\n### Data Processing\n\nPrepare your data files into `datasets/[dataset name]`, which should consist of six data files as `[train/val/test].[source/target]`. These data files are raw text with each row representing one example. We take `cnndm` dataset as an example to preprocess the dataset (see [here](https://github.com/pytorch/fairseq/blob/master/examples/bart/README.summarization.md) for instructions to obtain the cnndm dataset): \n\n```bash\n# this command runs the preprocessing pipeline including tokenization, truncation, and \n# keywords extraction. It will generate all required data files to train CTRLsum into \n# `datasets/cnndm`. Example obtained files can be found in `datasets/example_dataset`\n# Some optional arguments can be found in preprocess.py\npython scripts/preprocess.py cnndm --mode pipeline\n\n# gpt2 encoding\nbash scripts/gpt2_encode.sh cnndm\n\n# binarize dataset for fairseq\nbash scripts/binarize_dataset.sh cnndm\n```\n\nFor the generated files in the `datasets/cnndm`, the suffix `oracleword` represents the keywords (after keyword dropout) file,   `oraclewordsource` represents the concatenated keywords and source. `oraclewordns` represents the original keywords without keyword dropout. The `.jsonl` files are potentially used to train the tagger later.\n\n### Train the summarization model on multiple GPUs:\n\n```\nbash scripts/train_bart.sh -g [GPUs] -d [dataset name] -b [bart checkpoint path (.pt file)]\n```\n`GPUs` are GPU ids separated by `,`. All our experiments are on 8 GPUs accumulating 8 gradient steps, resulting in an effective batch size of 1024x8x8 tokens in total. You propably need to increase the `update_freq` variable in `train_bart.sh` if you use less GPUs to match the effective batch size. The saved models are in dir `checkpoint`. The training arguments can be found in `train_bart.sh`.\n\n\n\n### Train the keyword tagger (optional):\nNote that the keyword tagger is required only in uncontrolled summarization setting and certain control settings which require automatic keywords (like length control in the paper)\n```bash\n# this requires to give 4 gpus for training by default,\n# you need to change the --nproc_per_node value if you \n# train with different number of gpus\nbash scripts/train_seqlabel.sh -g [GPUs] -d [dataset name]\n```\n\nThe effective batch size we used for different datasets can be found in the training script as `number of gpus x batch x uddate_freq`\n\n\n\n## Evaluate CTRLsum\nHere we include evaluation for uncontrolled summarization settings. \n\n### Obtain automatic keywords from a trained tagger:\n\n```bash\n# run prediction from the tagger which outputs confidence values for every token\n# `checkpoint directory` is the directory that contains the `pytorch_model.bin` checkpoint.\n# the results are saved in the checkpoint directory as test_predictions.txt\nbash scripts/train_seqlabel.sh -g [GPUs] -d [dataset name] -p [checkpoint directory]\n\n\n# obtain keywords by selecting confident words, `threshold, maximum-word, and summary-size` \n# are three hyperparameters in this step, please check Appendix A in the paper for specific\n# values we used for different datasets, the performance is relatively robust\n# this command will yield a file `.predwordsource` in `datasets/[dataset name]` which can be\n# used as input to the summarization model to obtain uncontrolled summaries\npython scripts/preprocess.py [dataset name] \\\n\t\t--split test \\\n\t\t--mode process_tagger_prediction \\\n\t\t--tag-pred [the tagger prediction file path, named as test_predictions.txt] \\\n\t\t--threshold [confidence threshold] \\\n\t\t--maximum-word [maximum number of keywords] \\\n\t\t--summary-size [number of sentences from which to identify keywords]\n```\n\n\n\n### Metrics:\n\nWe report ROUGE scores and [BERTScore](https://github.com/Tiiiger/bert_score) in the paper. The ROUGE scores in the paper are computed using [files2rouge](https://github.com/pltrdy/files2rouge) which is a wrapper of a wrapper of the original ROUGE perl scripts. Please refer to `scripts/test_bart.sh` for our evaluation script:\n\n```bash\n# you will need the Stanford corenlp java toolkit to run this, we use it for tokenization\n# this script computes ROUGE and (optionally) BERTScore.\nbash scripts/test_bart.sh -g [GPUs] -s [source file name, NOT full path] -d [dataset] -p [ctrlsum checkpoint directory]\n```\n\n## Citation\n\n```\n@article{he2020ctrlsum,\ntitle={{\\{}CTRL{\\}}sum: Towards Generic Controllable Text Summarization},\nauthor={He, Junxian and Kry{\\'s}ci{\\'n}ski, Wojciech and McCann, Bryan and Rajani, Nazneen and Xiong, Caiming},\njournal={arXiv},\nyear={2020}\n}\n```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Fctrl-sum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsalesforce%2Fctrl-sum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsalesforce%2Fctrl-sum/lists"}