{"id":22381938,"url":"https://github.com/googleinterns/smart-content-summary","last_synced_at":"2025-07-31T02:33:14.302Z","repository":{"id":37633293,"uuid":"266863262","full_name":"googleinterns/smart-content-summary","owner":"googleinterns","description":"Improvement of the LaserTagger model for text summarization.","archived":false,"fork":false,"pushed_at":"2024-10-25T20:38:17.000Z","size":501,"stargazers_count":6,"open_issues_count":11,"forks_count":2,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-04-06T18:50:37.346Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/googleinterns.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":"docs/code-of-conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-05-25T19:27:44.000Z","updated_at":"2022-01-15T08:57:47.000Z","dependencies_parsed_at":"2024-12-05T00:11:09.163Z","dependency_job_id":"2f07fef9-9ce3-41f4-805d-41be754ff165","html_url":"https://github.com/googleinterns/smart-content-summary","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/googleinterns/smart-content-summary","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/googleinterns%2Fsmart-content-summary","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/googleinterns%2Fsmart-content-summary/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/googleinterns%2Fsmart-content-summary/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/googleinterns%2Fsmart-content-summary/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/googleinterns","download_url":"https://codeload.github.com/googleinterns/smart-content-summary/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/googleinterns%2Fsmart-content-summary/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267977983,"owners_count":24175239,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-31T02:00:08.723Z","response_time":66,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-05T00:11:06.720Z","updated_at":"2025-07-31T02:33:13.845Z","avatar_url":"https://github.com/googleinterns.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Text Summarization based on LaserTagger\n\nBased on a text-editing model called [LaserTagger](\nhttps://github.com/google-research/lasertagger), this project aims to train \na machine learning model that rewrites short sentences and phrases to a more \nconcise form. \n\nThe application of this project includes:\n- Creative suggestion: When the creatives provided by the customers or the \nsuggested creatives from existing models exceed word limit, this model can \nprovide an automatic summary of the text.\n- Keyword/category clustering: This model can generate a shorter version for \neach category and keyword, and cluster those with the same shortened version \ntogether. \n\nThe LaserTagger model is developed and trained by Google Research, which \ntransforms a source text into a target text by predicting a sequence of \ntoken-level edit operations. The goal of this project is to improve the \nperformance of this model specifically for the short-sentence-and-phrase \nsummarization task. \n\nOur improvement of the model includes adding part-of-speech (POS) tags to \nBERT embeddings (which involves pre-training BERT with POS tags), \ncustomizing loss function and loss weights, applying masks, and \nhyperparameter tuning. \n\nTo address the lack of grammar evaluation in existing performance metrics \nfor text summarization, we designed and trained a grammar [checker](classifier). We \nprovide the code and instructions for training the grammar checker.\n\nThe end-to-end process of this model can be deployed on Google Cloud \nPlatform, where the web interface accepts a text input, and returns its \nsummarized version along with a grammar rating of the summary. The \n[code](GCP_deploy) for the deployment is also provided.\n\n## Modified LaserTagger\nThe modified LaserTagger is built on Python 3, Tensorflow and BERT. It works \nwith CPU, GPU, and Cloud TPU. In addition to improving the model performance, \nwe also provide code to streamline the training and exporting process, and \nmaking  predictions faster by running inferences in batches.\n\nThe LaserTagger model uses BERT as the encoder. There are pre-trained BERT \nmodels online, but adding part-of-speech (POS) tags to the embeddings involves \nretraining the BERT model. We provide two pretrained BERT models trained on \nthe [OpenSubtitles](https://www.opensubtitles.org/en/search/subs) dataset. The \nBERT model with POS tags can be found at \ngs://bert_traning_yechen/trained_bert_uncased/bert_POS. The BERT model with \nPOS-concise tags can be found at \ngs://bert_traning_yechen/trained_bert_uncased/bert_POS_concise. If you plan to \nuse \"Normal\" embeddings or \"Sentence\" embeddings which do not include POS tags, \nyou can download a pretrained BERT model from the [official repository](\nhttps://github.com/google-research/bert#pre-trained-models). You can use \neither the 12-layer ''BERT-Base, Cased'' model or the 12-layer ''BERT-Base, \nUncased'' model. \n\n### Usage Instructions\n\n**1. Data Preprocessing**\n\nThe dataset we train the model on is the \n[Microsoft Abstractive Text Compression Dataset](\nhttps://www.microsoft.com/en-us/download/confirmation.aspx?id=54262) (MSF dataset). \nTo preprocess the dataset, and split to train, tune, and test set, run the \nfollowing command\n\n```\npython preprocess_MS_dataset_main.py path/to/raw/data num_of_tuning num_of_testing\n```\nwhere we use 3,000 samples for tuning and 3,000 samples for testing in the project. \nThe preprocessed and split dataset will be saved in three tsv files named \ntrain_MS_dataset, tune_MS_dataset, and tune_MS_dataset for training, tuning, and \ntesting set respectively.\n\nWe also provide preprocessing scripts for three other datasets: \n[news summary dataset](https://www.kaggle.com/sunnysai12345/news-summary), \n[Google sentence compression dataset](\nhttps://github.com/google-research-datasets/sentence-compression), and \n[reddit_tifu dataset](https://www.tensorflow.org/datasets/catalog/reddit_tifu). \nSee [preprocess_news_dataset_main.py](preprocess_news_dataset_main.py), \n[preprocess_google_dataset_main.py](preprocess_google_dataset_main.py), and \n[preprocess_reddit_dataset_main.py](preprocess_reddit_dataset_main.py) for code and instructions.\n\nThe preprocessing script also computes basic statistics of the dataset. A \nsample output when preprocessing the MSF dataset is\n```\nNumber of samples is 26119\nTotal number of excluded sample is 304\nAverage word count of original sentence is 32.08 ( std: 10.79 )\nMax word count is 145\nMin word count is 7\nAverage word count of shortened sentence is 22.28 ( std: 8.54 )\nMax Length is 108\nMin Length is 3\nOn average, there are 1.00 sentences in each original text ( std: 0.00 )\nOn average, there are 1.91 words in each shortened sentence that are not in the original sentence. ( std: 3.76 )\nThe average compression ratio is 0.70 ( std: 0.19 )\n```\n\n**2. Training \u0026 Export**\n\nThis streamlined process covers the steps of phrase vocabulary optimization, \npreparing data for training, model training, and model export. The script is \ndesigned for training on the Google Cloud Platform. Before running the script, \nthere are several prerequisites:\n- Create or have access to a Google Storage Bucket. Currently, the GCP bucket \npath is set to gs://trained_models_yechen/. If you create another bucket, change \nthe path by changing the GCP_BUCKET variable in [streamline_training.py](streamline_training.py).\n-  Set up a virtual machine to run the script on. Follow this [guide](\nhttps://docs.google.com/document/d/1oV8Swp_BDfmDHkhSkWb2wo_ZhC9jIP-Lk7kCbYvdYTM/edit#heading=h.o18hkt51hrci) \nto set up a virtual machine on GCP.\n- If you plan to train with a Cloud TPU, follow this [guide](\nhttps://docs.google.com/document/d/1PlCB6DOH8LUBsN8UcgPxzds9MqPRFIPs_rht9fWqbAA/edit?usp=sharing) \nto set up a TPU on GCP. Make sure that your TPU has the same name as your VM. \n- Download the pre-trained BERT model from sources suggested above. To copy a \nfolder in the GCP bucket to your virtual machine, use the `gsutil cp` command.\n\nAfter satisfying the prerequisites, you can run the streamline script \n[streamline_training.py](streamline_training.py). The usage is \n```\npython streamline_training.py \\\n[-vocab_size VOCAB_SIZE] [-train_batch_size TRAIN_BATCH_SIZE] \\\n[-learning_rate LEARNING_RATE] [-num_train_epochs NUM_TRAIN_EPOCHS] \\\n[-warmup_proportion WARMUP_PROPORTION] \\\n[-max_input_examples MAX_INPUT_EXAMPLES] \\\n[-train] [-export] \\\n[-use_tpu] [-gbucket GBUCKET] \\\n[-t2t T2T] [-number_layer NUMBER_LAYER] \\\n[-hidden_size HIDDEN_SIZE] [-num_attention_head NUM_ATTENTION_HEAD] \\\n[-filter_size FILTER_SIZE] [-full_attention FULL_ATTENTION] \\\n[-add_tag_loss_weight ADD_TAG_LOSS_WEIGHT] \\\n[-delete_tag_loss_weight DELETE_TAG_LOSS_WEIGHT] \\\n[-keep_tag_loss_weight KEEP_TAG_LOSS_WEIGHT] \\\nmodel/output/dir abs/path/to/lasertagger abs/path/to/bert \\\npath/to/training/file path/to/tuning/file \\\nembedding_type\n```\nThe positional arguments are:\n- `model/output/dir`: the directory of the model output\n- `abs/path/to/lasertagger`: absolute path to the folder where the lasertagger \nscripts are located\n- `abs/path/to/bert`: absolute path to the folder where the pretrained BERT is \nlocated\n- `path/to/training/file`: path to training samples\n- `path/to/tuning/file`: path to tuning samples\n- `embedding_type`: type of embedding. Must be one of [Normal, POS, POS_\nconcise, Sentence]. Normal: segment id is all zero. POS: part of speech tagging. \nPOS_concise: POS tagging with a smaller set of tags. Sentence: sentence tagging.\n\nThe general optional arguments are:\n- `-train`: if added, skip preprocessing and start training.\n- `-export`: if added, skip preprocessing and training, and start exporting to \nbucket.\n\nThe optional arguments relevant to the data preprocessing step are:\n- `-vocab_size VOCAB_SIZE`: the size of the vocabulary for the adding tag. \ndefault = 500\n- `-max_input_examples MAX_INPUT_EXAMPLES`: number of training examples to use \nin the vocab optimization. default is all training data\n- `-masking`: if added, numbers and symbols will be masked. All numbers are \nreplaced with the [NUMBER] token, and all special characters other than ., !, \n?, ;, and , are replaced with the [SYMBOL] token.\n\nThe optional arguments relevant to the training step are:\n- `-train_batch_size TRAIN_BATCH_SIZE`: batch size during training. default \n= 32\n- `-learning_rate LEARNING_RATE`: the initial learning rate for Adam. default \n= 3e-5\n- `-num_train_epochs NUM_TRAIN_EPOCHS`: total number of training epochs to perform. \ndefault = 3\n- `-warmup_proportion WARMUP_PROPORTION`: proportion of training to perform linear \nlearning rate warmup for. default = 0.1\n- `-use_tpu`: if added, will use cloud TPU for training.\n- `-gbucket GBUCKET`: the gcp bucket where the cloud TPU will store intermediary \noutputs to.\n- `-verb_deletion_loss VERB_DELETION_LOSS`: the weight of verb deletion loss. Need \nto be \u003e= 0. default=0. Cannot be set to a number other than 0 unless the \nembedding_type is POS or POS_concise.\n`-add_tag_loss_weight ADD_TAG_LOSS_WEIGHT`: the weight of loss for adding tags. default=1\n`-delete_tag_loss_weight DELETE_TAG_LOSS_WEIGHT`: the weight of loss for deleting tags. default=1\n`-keep_tag_loss_weight KEEP_TAG_LOSS_WEIGHT`: the weight of loss for keeping tags. default=1\n\nThe optional arguments relevant to the model architecture are:\n- `-t2t T2T`: if True, use autoregressive version of LaserTagger. If false, use, \nfeed-forward version of LaserTagger. default = True\n- `-number_layer NUMBER_LAYER`: number of hidden layers in the decoder. default\n= 1\n- `-hidden_size HIDDEN_SIZE`: the size of the hidden layer size in the decoder. \ndefault=768\n- `-num_attention_head NUM_ATTENTION_HEAD`: the number of attention heads in the \ndecoder. default=4\n- `-filter_size FILTER_SIZE`: the size of the filter in the decoder. default = \n3072\n- `-full_attention FULL_ATTENTION`: whether to use full attention in the decoder. \ndefault = False\n\nThe trained and exported model will be saved at the local directory specified by \n`model/output/dir` and in the GCP bucket in a folder whose name is the last folder \nname of model/output/dir. Currently, the GCP bucket is set to be \ngs://trained_models_yechen/. If you would like to save to another bucket, change \nthe GCP_BUCKET variable in [streamline_training.py](streamline_training.py). \n\n**3.  Prediction**\n\nThe [custom_predict.py](custom_predict.py) runs prediction on input and computes SARI score and exact \nscore if applicable. The usage is\n```\npython custom_predict.py [-score] \\ \npath/to/input/file path/to/lasertagger path/to/bert \\\nmodels [models ...] \\\nembedding_type\n```\n\nThe positional arguments are:\n\n- `path/to/input/file`: the path to the tsv file with inputs. If the scores do not \nneed to be computed, the tsv should have one column which contains the inputs. If \nscores need to be computed, then the tsv file needs to have two columns, with the \nfirst column being the inputs and the second column being the targets.\n- `path/to/lasertagger`: path to the folder where the lasertagger scripts are located\n- `path/to/bert`: path to the folder where the pretrained BERT is located\nabs_path_to_lasertagger\n- `models`: the name of trained LaserTagger models. Need to provided at least one \nmodel name. The model name should be the folder name of the LaserTagger model in the \nGCP bucket.\n- `embedding_type`: type of embedding. Must be one of [Normal, POS, POS_concise, \nSentence].\n\noptional arguments:\n- `-score`: if added, compute scores for the predictions.\n- `-grammar`: if added, automatically apply grammar correction on predictions \nusing LanguageTool.\n- `-masking`: if added, numbers and symbols will be masked.\n- `-batch_size`: the batch size of prediction. default=1\n\nIf you add the `-grammar` tag for automatic grammar correction, you need to install \nthe LanguageTool using following commands:\n```\npip install 3to2\nsudo apt update\nsudo apt install default-jre\npip install language-tool-python\n```\nAll predictions are written to a tsv file named pred.tsv. The first column is the \noriginal input. The last column is the targets if targets are provided. All other \ncolumns are predictions from different models specified by the `models` arguments. \nIf the `-score` tag is added, another tsv file named score.tsv will also be generated. \nThis file contains six rows. The first row is the model names; the second row is the \nexact scores; the third row is the SARI score; the fourth to sixth row are the keep, \naddition, and delete scores (of the SARI score). Each column corresponds to the scores \nof a model. \n\n## Grammar \u0026 Meaning Preservation Checker\nThere is a lack of grammar evaluations in the existing text summarization metrics. \nTherefore, we design a model that can classify a text as grammatically correct or \nincorrect. Follow the instructions in [classifier](classifier) to preprocess the MSF \ndataset, train the model, and make predictions. This model can also be used for \nchecking whether a summary preserves the most important meaning in the source text. \n\n## Deployment to Google Cloud Platform\nThe LaserTagger model can be deployed on GCP with a web interface. The process \ninvolves three main steps: 1. exporting the LaserTagger to a format acceptable to the \nGCP AI platform, 2. deploying the model to AI platform, and 3. deploying the web \napplication to the GCP App Engine.\n\n### 1. Re-export the LaserTagger Model\nThe GCP app engine only accepts exported model with one metagraph. However, if the \nmodel is trained on GPU or TPU, the exported version from above very likely contains \nmore than one metagraph. To re-export the model, run the \n[export_model_for_gcp.py](GCP_deploy/export_model_for_gcp.py) in \nthe [GCP_deploy](GCP_deploy) folder.\n```\npython run the export_model_for_gcp.py exported/model/dir output/dir\n```\nThe `exported/model/dir` is the path to a folder containing an exported LaserTagger \nmodel. The `output/dir` specified where you would like the re-exported model to be \nsaved. After re-exporting the model, check that the model indeed only has one \nmetagraph by running the following command:\n```\nsaved_model_cli show --dir=output/dir --all\n```\nYou should only see one `signature_def['serving_default']` in the output. \n\n### 2. Deploy to GCP AI Platform\nFirst, use the `gsutil cp` command to copy the exported LaserTagger model to a GCP \nbucket. Then, create a model on the AI platform by running:\n```\nexport MODEL_NAME=lasertagger\ngcloud ai-platform models create $MODEL_NAME --enable-logging\n```\nThen, run the following commands to deploy the exported model as a new version:\n```\nexport MODEL_DIR=path/to/model/in/GCP/bucket\nexport VERSION_NAME=version_name\nexport FRAMEWORK=TENSORFLOW\ngcloud beta ai-platform versions create $VERSION_NAME \\\n--model $MODEL_NAME \\\n--origin $MODEL_DIR \\\n--runtime-version=1.15 \\\n--framework $FRAMEWORK --python-version=3.7 \\\n--accelerator=count=4,type=nvidia-tesla-v100 \\\n--machine-type=n1-standard-32\n```\nLater, we can send our inputs as a json object to this deployed model, and \ninferences will be made by the AI platform on a virtual machine with 4 GPUs, which \ncan significantly decrease the inference time.\n\n### 3. Build the Web Application on GCP App Engine\nOpen a shell terminal in the GCP console. Use `git clone` to copy the repository to \nthe shell terminal. In addition, copy the following files and folders in the \n[lasertagger](lasertagger) folder to the gcp_deploy folder: [bert](lasertagger/bert),\n[bert_example.py](lasertagger/bert_example.py), [custom_utils.py](lasertagger/custom_utils.py), \n[tagging.py](lasertagger/tagging.py), [tagging_converter.py](lasertagger/tagging_converter.py), \nand [utils.py](lasertagger/utils.py). Then running the following to deploy the web app:\n```\ncd gcp_deploy\ngcloud app deploy\n```\nThis deployment will take a few minutes. After it is successfully built, run \n`gcloud app browse` to find the link to the web page. \n## License\n\nApache 2.0; see [LICENSE](LICENSE) for details.\n\n## Disclaimer\n\n**This is not an officially supported Google product.**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleinterns%2Fsmart-content-summary","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogleinterns%2Fsmart-content-summary","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleinterns%2Fsmart-content-summary/lists"}