{"id":13609966,"url":"https://github.com/spring-media/headliner","last_synced_at":"2025-04-12T22:32:18.441Z","repository":{"id":82412710,"uuid":"211845076","full_name":"spring-media/headliner","owner":"spring-media","description":"🏖 Easy training and deployment of seq2seq models.","archived":false,"fork":false,"pushed_at":"2021-03-26T07:19:57.000Z","size":2851,"stargazers_count":228,"open_issues_count":2,"forks_count":41,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-03-22T22:39:23.921Z","etag":null,"topics":["neural-network","nlp","python","seq2seq","tensorflow"],"latest_commit_sha":null,"homepage":"https://as-ideas.github.io/headliner/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/spring-media.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2019-09-30T11:33:28.000Z","updated_at":"2025-02-20T12:08:33.000Z","dependencies_parsed_at":"2023-06-15T14:00:15.920Z","dependency_job_id":null,"html_url":"https://github.com/spring-media/headliner","commit_stats":null,"previous_names":["as-ideas/headliner"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spring-media%2Fheadliner","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spring-media%2Fheadliner/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spring-media%2Fheadliner/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/spring-media%2Fheadliner/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/spring-media","download_url":"https://codeload.github.com/spring-media/headliner/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248641004,"owners_count":21138127,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["neural-network","nlp","python","seq2seq","tensorflow"],"created_at":"2024-08-01T19:01:39.804Z","updated_at":"2025-04-12T22:32:17.559Z","avatar_url":"https://github.com/spring-media.png","language":"Python","funding_links":[],"categories":["文本数据和NLP"],"sub_categories":[],"readme":"# Headliner\n\n[![Build Status](https://dev.azure.com/axelspringerai/Public/_apis/build/status/as-ideas.headliner?branchName=master)](https://dev.azure.com/axelspringerai/Public/_build/latest?definitionId=2\u0026branchName=master)\n[![Build Status](https://travis-ci.org/as-ideas/headliner.svg?branch=master)](https://travis-ci.org/as-ideas/headliner)\n[![Docs](https://img.shields.io/badge/docs-online-brightgreen)](https://as-ideas.github.io/headliner/)\n[![codecov](https://codecov.io/gh/as-ideas/headliner/branch/master/graph/badge.svg)](https://codecov.io/gh/as-ideas/headliner)\n[![PyPI Version](https://img.shields.io/pypi/v/headliner)](https://pypi.org/project/headliner/)\n[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/as-ideas/headliner/blob/master/LICENSE)\n\nHeadliner is a sequence modeling library that eases the training and **in particular, the deployment of custom sequence models**\nfor both researchers and developers. You can very easily deploy your models in a few lines of code. It was originally\nbuilt for our own research to generate headlines from [Welt news articles](https://www.welt.de/) (see figure 1). That's why we chose the name, Headliner.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/headline_generator.png\" /\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003cb\u003eFigure 1:\u003c/b\u003e One example from our Welt.de headline generator.\n\u003c/p\u003e\n\n## Update 21.01.2020\nThe library now supports fine-tuning pre-trained BERT models with \ncustom preprocessing as in [Text Summarization with Pretrained Encoders](https://arxiv.org/pdf/1908.08345.pdf)!\n\ncheck out \n[this](https://colab.research.google.com/github/as-ideas/headliner/blob/master/notebooks/BERT_Translation_Example.ipynb)\ntutorial on colab!\n\n## 🧠 Internals\nWe use sequence-to-sequence (seq2seq) under the hood,\nan encoder-decoder framework (see figure 2). We provide a very simple interface to train\nand deploy seq2seq models. Although this library was created internally to\ngenerate headlines, you can also use it for **other tasks like machine translations,\ntext summarization and many more.**\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"figures/seq2seq.jpg\" /\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003cb\u003eFigure 2:\u003c/b\u003e Encoder-decoder sequence-to-sequence model.\n\u003c/p\u003e\n\n### Why Headliner?\n\nYou may ask why another seq2seq library? There are a couple of them out there already.\nFor example, Facebook has [fairseq](https://github.com/pytorch/fairseq), Google has [seq2seq](https://github.com/google/seq2seq)\nand there is also [OpenNMT](http://opennmt.net/).\nAlthough those libraries are great, they have a few drawbacks for our use case e.g. the former doesn't focus much on production\nwhereas the Google one is not actively maintained. OpenNMT was the closest one to match our requirements i.e.\nit has a strong focus on production. However, we didn't like that their workflow\n(preparing data, training and evaluation) is mainly done via the command line.\nThey also expose a well-defined API though but the complexity there is still too high with too much custom code\n(see their [minimal transformer training example](https://github.com/OpenNMT/OpenNMT-tf/blob/master/examples/library/minimal_transformer_training.py)).    \n\nTherefore, we built this library for us with the following goals in mind:\n\n* Easy-to-use API for training and deployment (only a few lines of code)\n* Uses TensorFlow 2.0 with all its new features (`tf.function`, `tf.keras.layers` etc.)\n* Modular classes: text preprocessing, modeling, evaluation\n* Extensible for different encoder-decoder models\n* Works on large text data\n\nFor more details on the library, read the documentation at: [https://as-ideas.github.io/headliner/](https://as-ideas.github.io/headliner/)\n\nHeadliner is compatible with Python 3.6 and is distributed under the MIT license.\n\n## ⚙️ Installation\n\u003e ⚠️ Before installing Headliner, you need to install TensorFlow as we use this as our deep learning framework. For more\n\u003e details on how to install it, have a look at the [TensorFlow installation instructions](https://www.tensorflow.org/install/).\n\nThen you can install Headliner itself. There are two ways to install Headliner:\n\n* Install Headliner from PyPI (recommended):\n\n```bash\npip install headliner\n```\n\n* Install Headliner from the GitHub source:\n\n```bash\ngit clone https://github.com/as-ideas/headliner.git\ncd headliner\npython setup.py install\n```\n\n## 📖 Usage\n\n### Training\nFor the training, you need to import one of our provided models or create your own custom one. Then you need to\ncreate the dataset, a `tuple` of input-output sequences, and then train it:\n\n```python\nfrom headliner.trainer import Trainer\nfrom headliner.model.transformer_summarizer import TransformerSummarizer\n\ndata = [('You are the stars, earth and sky for me!', 'I love you.'),\n        ('You are great, but I have other plans.', 'I like you.')]\n\nsummarizer = TransformerSummarizer(embedding_size=64, max_prediction_len=20)\ntrainer = Trainer(batch_size=2, steps_per_epoch=100)\ntrainer.train(summarizer, data, num_epochs=2)\nsummarizer.save('/tmp/summarizer')\n```\n\n### Prediction\nThe prediction can be done in a few lines of code:\n\n```python\nfrom headliner.model.transformer_summarizer import TransformerSummarizer\n\nsummarizer = TransformerSummarizer.load('/tmp/summarizer')\nsummarizer.predict('You are the stars, earth and sky for me!')\n```\n\n### Models\nCurrently available models include a basic encoder-decoder, \nan encoder-decoder with Luong attention, the transformer and \na transformer on top of a pre-trained BERT-model:\n\n```python\nfrom headliner.model.basic_summarizer import BasicSummarizer\nfrom headliner.model.attention_summarizer import AttentionSummarizer\nfrom headliner.model.transformer_summarizer import TransformerSummarizer\nfrom headliner.model.bert_summarizer import BertSummarizer\n\nbasic_summarizer = BasicSummarizer()\nattention_summarizer = AttentionSummarizer()\ntransformer_summarizer = TransformerSummarizer()\nbert_summarizer = BertSummarizer()\n```\n\n### Advanced training\nTraining using a validation split and model checkpointing:\n\n```python\nfrom headliner.model.transformer_summarizer import TransformerSummarizer\nfrom headliner.trainer import Trainer\n\ntrain_data = [('You are the stars, earth and sky for me!', 'I love you.'),\n              ('You are great, but I have other plans.', 'I like you.')]\nval_data = [('You are great, but I have other plans.', 'I like you.')]\n\nsummarizer = TransformerSummarizer(num_heads=1,\n                                   feed_forward_dim=512,\n                                   num_layers=1,\n                                   embedding_size=64,\n                                   max_prediction_len=50)\ntrainer = Trainer(batch_size=8,\n                  steps_per_epoch=50,\n                  max_vocab_size_encoder=10000,\n                  max_vocab_size_decoder=10000,\n                  tensorboard_dir='/tmp/tensorboard',\n                  model_save_path='/tmp/summarizer')\n\ntrainer.train(summarizer, train_data, val_data=val_data, num_epochs=3)\n```\n\n### Advanced prediction\nPrediction information such as attention weights and logits can be accessed via predict_vectors returning a dictionary:\n\n```python\nfrom headliner.model.transformer_summarizer import TransformerSummarizer\n\nsummarizer = TransformerSummarizer.load('/tmp/summarizer')\nsummarizer.predict_vectors('You are the stars, earth and sky for me!')\n```\n\n### Resume training\nA previously trained summarizer can be loaded and then retrained. In this case the data preprocessing and vectorization is loaded from the model.\n\n```python\ntrain_data = [('Some new training data.', 'New data.')] * 10\n\nsummarizer_loaded = TransformerSummarizer.load('/tmp/summarizer')\ntrainer = Trainer(batch_size=2)\ntrainer.train(summarizer_loaded, train_data)\nsummarizer_loaded.save('/tmp/summarizer_retrained')\n```\n\n### Use pretrained GloVe embeddings\nEmbeddings in GloVe format can be injected in to the trainer as follows. Optionally, set the embedding to non-trainable.\n\n```python\ntrainer = Trainer(embedding_path_encoder='/tmp/embedding_encoder.txt',\n                  embedding_path_decoder='/tmp/embedding_decoder.txt')\n\n# make sure the embedding size matches to the embedding size of the files\nsummarizer = TransformerSummarizer(embedding_size=64,\n                                   embedding_encoder_trainable=False,\n                                   embedding_decoder_trainable=False)\n```\n\n### Custom preprocessing\nA model can be initialized with custom preprocessing and tokenization:\n\n```python\nfrom headliner.preprocessing.preprocessor import Preprocessor\n\ntrain_data = [('Some inputs.', 'Some outputs.')] * 10\n\npreprocessor = Preprocessor(filter_pattern='',\n                            lower_case=True,\n                            hash_numbers=False)\ntrain_prep = [preprocessor(t) for t in train_data]\ninputs_prep = [t[0] for t in train_prep]\ntargets_prep = [t[1] for t in train_prep]\n\n# Build tf subword tokenizers. Other custom tokenizers can be implemented\n# by subclassing headliner.preprocessing.Tokenizer\nfrom tensorflow_datasets.core.features.text import SubwordTextEncoder\ntokenizer_input = SubwordTextEncoder.build_from_corpus(\ninputs_prep, target_vocab_size=2**13, reserved_tokens=[preprocessor.start_token, preprocessor.end_token])\ntokenizer_target = SubwordTextEncoder.build_from_corpus(\n    targets_prep, target_vocab_size=2**13,  reserved_tokens=[preprocessor.start_token, preprocessor.end_token])\n\nvectorizer = Vectorizer(tokenizer_input, tokenizer_target)\nsummarizer = TransformerSummarizer(embedding_size=64, max_prediction_len=50)\nsummarizer.init_model(preprocessor, vectorizer)\n\ntrainer = Trainer(batch_size=2)\ntrainer.train(summarizer, train_data, num_epochs=3)\n```\n\n\n### Use pre-trained BERT embeddings\nPre-trained BERT models can be included as follows. \nBe aware that pre-trained BERT models are expensive to train and require custom preprocessing!\n\n```python\nfrom headliner.preprocessing.bert_preprocessor import BertPreprocessor\nfrom spacy.lang.en import English\n\ntrain_data = [('Some inputs.', 'Some outputs.')] * 10\n\n# use BERT-specific start and end token\npreprocessor = BertPreprocessor(nlp=English()\ntrain_prep = [preprocessor(t) for t in train_data]\ntargets_prep = [t[1] for t in train_prep]\n\n\nfrom tensorflow_datasets.core.features.text import SubwordTextEncoder\nfrom transformers import BertTokenizer\nfrom headliner.model.bert_summarizer import BertSummarizer\n\n# Use a pre-trained BERT embedding and BERT tokenizer for the encoder \ntokenizer_input = BertTokenizer.from_pretrained('bert-base-uncased')\ntokenizer_target = SubwordTextEncoder.build_from_corpus(\n    targets_prep, target_vocab_size=2**13,  reserved_tokens=[preprocessor.start_token, preprocessor.end_token])\n\nvectorizer = BertVectorizer(tokenizer_input, tokenizer_target)\nsummarizer = BertSummarizer(num_heads=2,\n                            feed_forward_dim=512,\n                            num_layers_encoder=0,\n                            num_layers_decoder=4,\n                            bert_embedding_encoder='bert-base-uncased',\n                            embedding_size_encoder=768,\n                            embedding_size_decoder=768,\n                            dropout_rate=0.1,\n                            max_prediction_len=50))\nsummarizer.init_model(preprocessor, vectorizer)\n\ntrainer = Trainer(batch_size=2)\ntrainer.train(summarizer, train_data, num_epochs=3)\n```\n\n\n### Training on large datasets\nLarge datasets can be handled by using an iterator:\n\n```python\ndef read_data_iteratively():\n    return (('Some inputs.', 'Some outputs.') for _ in range(1000))\n\nclass DataIterator:\n    def __iter__(self):\n        return read_data_iteratively()\n\ndata_iter = DataIterator()\n\nsummarizer = TransformerSummarizer(embedding_size=10, max_prediction_len=20)\ntrainer = Trainer(batch_size=16, steps_per_epoch=1000)\ntrainer.train(summarizer, data_iter, num_epochs=3)\n```\n\n## 🤝 Contribute\nWe welcome all kinds of contributions such as new models, new examples and many more.\nSee the [Contribution](CONTRIBUTING.md) guide for more details.\n\n## 📝 Cite this work\nPlease cite Headliner in your publications if this is useful for your research. Here is an example BibTeX entry:\n```BibTeX\n@misc{axelspringerai2019headliners,\n  title={Headliner},\n  author={Christian Schäfer \u0026 Dat Tran},\n  year={2019},\n  howpublished={\\url{https://github.com/as-ideas/headliner}},\n}\n```\n\n## 🏗 Maintainers\n* Christian Schäfer, github: [cschaefer26](https://github.com/cschaefer26)\n* Dat Tran, github: [datitran](https://github.com/datitran)\n\n## © Copyright\n\nSee [LICENSE](LICENSE) for details.\n\n## References\n\n[Text Summarization with Pretrained Encoders](https://arxiv.org/pdf/1908.08345.pdf)\n\n[Effective Approaches to Attention-based Neural Machine Translation](https://arxiv.org/abs/1508.04025)\n\n## Acknowlegements\n\nhttps://www.tensorflow.org/tutorials/text/transformer\n\nhttps://github.com/huggingface/transformers\n\nhttps://machinetalk.org/2019/03/29/neural-machine-translation-with-attention-mechanism/\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspring-media%2Fheadliner","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fspring-media%2Fheadliner","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fspring-media%2Fheadliner/lists"}