{"id":13456697,"url":"https://github.com/pszemraj/textsum","last_synced_at":"2026-03-06T16:33:54.777Z","repository":{"id":65348512,"uuid":"579525076","full_name":"pszemraj/textsum","owner":"pszemraj","description":"CLI \u0026 Python API to easily summarize text-based files with transformers","archived":false,"fork":false,"pushed_at":"2024-11-02T23:35:40.000Z","size":72,"stargazers_count":129,"open_issues_count":0,"forks_count":8,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-05-17T06:01:53.296Z","etag":null,"topics":["batch-processing","inference","inference-api","pipeline","summarization","summary","text","text-to-text-transformer","transformer","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pszemraj.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"AUTHORS.md","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-18T01:06:09.000Z","updated_at":"2025-05-09T05:20:44.000Z","dependencies_parsed_at":"2024-01-13T17:48:21.964Z","dependency_job_id":"11e56139-df52-4bcf-ae60-1b1b48f4dd86","html_url":"https://github.com/pszemraj/textsum","commit_stats":{"total_commits":18,"total_committers":2,"mean_commits":9.0,"dds":"0.33333333333333337","last_synced_commit":"8ecf04b1f6740e3c9ea25814363016ec72d7bb3e"},"previous_names":[],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/pszemraj/textsum","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Ftextsum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Ftextsum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Ftextsum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Ftextsum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pszemraj","download_url":"https://codeload.github.com/pszemraj/textsum/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pszemraj%2Ftextsum/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30185534,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-06T14:42:24.748Z","status":"ssl_error","status_checked_at":"2026-03-06T14:42:14.925Z","response_time":250,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["batch-processing","inference","inference-api","pipeline","summarization","summary","text","text-to-text-transformer","transformer","transformers"],"created_at":"2024-07-31T08:01:26.257Z","updated_at":"2026-03-06T16:33:54.741Z","avatar_url":"https://github.com/pszemraj.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# textsum\n\n \u003ca href=\"https://colab.research.google.com/gist/pszemraj/ff8a8486dc3303199fe9c9790a606fff/textsum-summarize-text-files-example.ipynb\"\u003e\n  \u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\n\u003c/a\u003e\n\u003ca href=\"https://pypi.org/project/textsum/\"\u003e \u003cimg src=\"https://img.shields.io/pypi/v/textsum.svg\" alt=\"PyPI-Server\"/\u003e\u003c/a\u003e\n\n\u003cbr\u003e\n\n\u003e a utility for using transformers summarization models on text docs 🖇\n\nThis package provides easy-to-use interfaces for using summarization models on text documents of arbitrary length. Currently implemented interfaces include a python API, CLI, and a shareable demo app.\n\n\u003e [!TIP]\n\u003e For additional details, explanations, and docs, see the [wiki](https://github.com/pszemraj/textsum/wiki)\n\n---\n\n- [textsum](#textsum)\n  - [🔦 Quick Start Guide](#-quick-start-guide)\n  - [Installation](#installation)\n    - [Full Installation](#full-installation)\n    - [Extra Features](#extra-features)\n  - [Usage](#usage)\n    - [Python API](#python-api)\n    - [CLI](#cli)\n    - [Demo App](#demo-app)\n  - [Models](#models)\n  - [Advanced Configuration](#advanced-configuration)\n    - [Parameters](#parameters)\n    - [8-bit Quantization \\\u0026 TensorFloat32](#8-bit-quantization--tensorfloat32)\n    - [Using Optimum ONNX Runtime](#using-optimum-onnx-runtime)\n    - [Force Cache](#force-cache)\n    - [Compile Model](#compile-model)\n  - [Contributing](#contributing)\n  - [Road Map](#road-map)\n\n---\n\n## 🔦 Quick Start Guide\n\n1. Install the package with pip:\n\n```bash\npip install textsum\n```\n\n2. Import the package and create a summarizer:\n\n```python\nfrom textsum.summarize import Summarizer\nsummarizer = Summarizer() # loads default model and parameters\n```\n\n3. Summarize a text string:\n\n```python\ntext = \"This is a long string of text that will be summarized.\"\nsummary = summarizer.summarize_string(text)\nprint(f'Summary: {summary}')\n```\n\n---\n\n## Installation\n\nInstall using pip with Python 3.8 or later (_after creating a virtual environment_):\n\n```bash\npip install textsum\n```\n\nThe `textsum` package is now installed in your virtual environment. [CLI commands](#cli) are available in your terminal, and the [python API](#python-api) is available in your python environment.\n\n### Full Installation\n\nFor a full installation, which includes additional features such as PDF OCR, Gradio UI demo, and Optimum, run the following commands:\n\n```bash\ngit clone https://github.com/pszemraj/textsum.git\ncd textsum\n# create a virtual environment (optional)\npip install -e .[all]\n```\n\n### Extra Features\n\nThe package also supports a number of optional extra features, which can be installed as follows:\n\n- `8bit`: Install with `pip install -e \"textsum[8bit]\"`\n- `optimum`: Install with `pip install -e \"textsum[optimum]\"`\n- `PDF`: Install with `pip install -e \"textsum[PDF]\"`\n- `app`: Install with `pip install -e \"textsum[app]\"`\n- `unidecode`: Install with `pip install -e \"textsum[unidecode]\"`\n\nReplace `textsum` in the command with `.` if installing from source. Read below for more details on how to use these features.\n\n\u003e [!TIP]\n\u003e The `unidecode` extra is a GPL-licensed dependency not included by default with the `clean-text` package. Installing it should improve the cleaning of noisy input text, but it should not make a significant difference in most use cases.\n\n## Usage\n\nThere are three ways to use this package:\n\n1. [python API](#python-api)\n2. [CLI](#cli)\n3. [Demo App](#demo-app)\n\n### Python API\n\nTo use the python API, import the `Summarizer` class and instantiate it. This will load the default model and parameters.\n\nYou can then use the `summarize_string` method to summarize a long text string.\n\n```python\nfrom textsum.summarize import Summarizer\n\nsummarizer = Summarizer() # loads default model and parameters\n\n# summarize a long string\nout_str = summarizer.summarize_string('This is a long string of text that will be summarized.')\nprint(f'summary: {out_str}')\n```\n\nyou can also directly summarize a file:\n\n```python\nout_path = summarizer.summarize_file('/path/to/file.txt')\nprint(f'summary saved to {out_path}')\n```\n\n### CLI\n\nTo summarize a directory of text files, run the following command in your terminal:\n\n```bash\ntextsum-dir /path/to/dir\n```\n\nThere are many CLI flags available. A full list:\n\n\n\u003cdetails\u003e\n  \u003csummary\u003eClick to expand table\u003c/summary\u003e\n\n  | Flag                             | Description                              |\n  | -------------------------------- | ---------------------------------------- |\n  | `--output_dir`                   | Specify the output directory             |\n  | `--model`                        | Specify the model to use                 |\n  | `--no_cuda`                      | Disable CUDA                             |\n  | `--tf32`                         | Use TF32 precision                       |\n  | `--force_cache`                  | Force cache usage                        |\n  | `--load_in_8bit`                 | Load in 8-bit mode                       |\n  | `--compile`                      | Compile the model                        |\n  | `--optimum_onnx`                 | Use optimum ONNX                         |\n  | `--batch_length`                 | Specify the batch length                 |\n  | `--batch_stride`                 | Specify the batch stride                 |\n  | `--num_beams`                    | Specify the number of beams              |\n  | `--length_penalty`               | Specify the length penalty               |\n  | `--repetition_penalty`           | Specify the repetition penalty           |\n  | `--max_length_ratio`             | Specify the maximum length ratio         |\n  | `--min_length`                   | Specify the minimum length               |\n  | `--encoder_no_repeat_ngram_size` | Specify the encoder no repeat ngram size |\n  | `--no_repeat_ngram_size`         | Specify the no repeat ngram size         |\n  | `--early_stopping`               | Enable early stopping                    |\n  | `--shuffle`                      | Shuffle the input data                   |\n  | `--lowercase`                    | Convert input to lowercase               |\n  | `--loglevel`                     | Specify the log level                    |\n  | `--logfile`                      | Specify the log file                     |\n  | `--file_extension`               | Specify the file extension               |\n  | `--skip_completed`               | Skip completed files                     |\n\n\u003c/details\u003e\n\n\nSome useful options are:\n\nArguments:\n\n- `--model`: model name or path to use for summarization. (Optional)\n- `--shuffle`: Shuffle the input files before processing. (Optional)\n- `--skip_completed`: Skip already completed files in the output directory. (Optional)\n- `--batch_length`: The maximum length of each input batch. Default is 4096. (Optional)\n- `--output_dir`: The directory to write the summarized output files. Default is `./summarized/`. (Optional)\n\nTo see all available options, run the following command:\n\n```bash\ntextsum-dir --help\n```\n\n### Demo App\n\nFor convenience, a UI demo[^1] is provided using [gradio](https://gradio.app/). To ensure you have the dependencies installed, run the following command:\n\n```bash\npip install textsum[app]\n```\n\nTo launch the demo, run:\n\n```bash\ntextsum-ui\n```\n\nThis will start a local server that you can access in your browser \u0026 a shareable link will be printed to the console.\n\n[^1]: The demo is minimal but will be expanded to accept other arguments and options.\n\n## Models\n\nSummarization is a memory-intensive task, and the [default model is relatively small and efficient](https://huggingface.co/BEE-spoke-data/pegasus-x-base-synthsumm_open-16k) for long-form text summarization. If you want to use a different model, you can specify the `model_name_or_path` argument when instantiating the `Summarizer` class.\n\n```python\nsummarizer = Summarizer(model_name_or_path='pszemraj/long-t5-tglobal-xl-16384-book-summary')\n```\n\nYou can also use the `-m` argument when using the CLI:\n\n```bash\ntextsum-dir /path/to/dir -m pszemraj/long-t5-tglobal-xl-16384-book-summary\n```\n\nAny [text-to-text](https://huggingface.co/models?filter=text2text) or [summarization](https://huggingface.co/models?filter=summarization) model from the [HuggingFace model hub](https://huggingface.co/models) can be used. Models are automatically downloaded and cached in `~/.cache/huggingface/hub`.\n\n---\n\n## Advanced Configuration\n\n### Parameters\n\nMemory usage can also be reduced by adjusting the [parameters for inference](https://huggingface.co/docs/transformers/generation_strategies#beam-search-decoding). This is discussed in detail in the [project wiki](https://github.com/pszemraj/textsum/wiki).\n\n\u003e [!IMPORTANT]\n\u003e tl;dr: use the `summarizer.set_inference_params()` and `summarizer.get_inference_params()` methods to adjust the inference parameters, passing either a python `dict` or a JSON file.\n\nSupport for `GenerationConfig` as the primary method to adjust inference parameters is planned for a future release.\n\n### 8-bit Quantization \u0026 TensorFloat32\n\nSome methods of efficient inference[^2] include loading the model in 8-bit precision via [LLM.int8](https://arxiv.org/abs/2208.07339) (_reduces memory usage_) and enabling TensorFloat32 precision  in the torch backend (_reduces latency_). See the [transformers docs](https://huggingface.co/docs/transformers/perf_infer_gpu_one#efficient-inference-on-a-single-gpu) for more details. Using LLM.int8 requires the [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) package, which can either be installed directly or via the `textsum[8bit]` extra:\n\n[^2]: if you have compatible hardware. In general, ampere (RTX 30XX) and newer GPUs are recommended.\n\n```bash\npip install textsum[8bit]\n```\n\nTo use these options, use the `--load_in_8bit` and `--tf32` flags when using the CLI:\n\n```bash\ntextsum-dir /path/to/dir --load_in_8bit --tf32\n```\n\nOr in Python, using the `load_in_8bit` argument:\n\n```python\nsummarizer = Summarizer(load_in_8bit=True)\n```\n\nIf using the Python API, either [manually activate tf32](https://huggingface.co/docs/transformers/perf_train_gpu_one#tf32) or use the `check_ampere_gpu()` function from `textsum.utils` **before initializing the `Summarizer` class**:\n\n```python\nfrom textsum.utils import check_ampere_gpu\ncheck_ampere_gpu() # automatically enables TF32 if Ampere+ available\nsummarizer = Summarizer(load_in_8bit=True)\n```\n\n### Using Optimum ONNX Runtime\n\n\u003e [!CAUTION]\n\u003e This feature is experimental and might not work as expected. Use at your own risk. ⚠️🧪\n\nONNX Runtime is a performance-oriented inference engine for ONNX models. It can be used to increase the speed of model inference, especially on Windows and in environments where GPU acceleration is not available. If you want to use ONNX runtime for inference, you need to set `optimum_onnx=True` when initializing the `Summarizer` class.\n\nFirst, install with `pip install textsum[optimum]`. Then initialize the `Summarizer` class with ONNX runtime:\n\n```python\nsummarizer = Summarizer(model_name_or_Path=\"onnx-compatible-model-name\", optimum_onnx=True)\n```\n\nIt will automatically convert the model if it has not been converted to ONNX yet.\n\n**Notes:**\n\n1. ONNX runtime+cuda needs an additional package. Manually install `onnxruntime-gpu` if you plan to use ONNX with GPU.\n2. Using ONNX runtime might lead to different behavior in certain models. It is recommended to test the model with and without ONNX runtime **the same input text** before using it for anything important.\n\n### Force Cache\n\n\u003e [!CAUTION]\n\u003e Setting `force_cache=True` might lead to different behavior in certain models. Test the model with and without `force_cache` on **the same input text** before using it for anything important.\n\nUsing the cache speeds up autoregressive generation by avoiding recomputing attention for tokens that have already been generated. If you want to force the model to always use cache irrespective of the model's default behavior[^3], you can set `force_cache=True` when initializing the `Summarizer` class.\n\n[^3]: `use_cache` can sometimes be disabled due to things like gradient accumulation training, etc., and if not re-enabled will result in slower inference times.\n\n```python\nsummarizer = Summarizer(force_cache=True)\n```\n\n\n### Compile Model\n\nIf you want to [compile the model](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for faster inference times, you can set `compile_model=True` when initializing the `Summarizer` class.\n\n```python\nsummarizer = Summarizer(compile_model=True)\n```\n\n\u003e [!NOTE]\n\u003e Compiling the model might not be supported on all platforms and requires pytorch \u003e 2.0.0.\n\n---\n\n## Contributing\n\nContributions are welcome! Please open an issue or PR if you have any ideas or suggestions.\n\nSee the [CONTRIBUTING.md](CONTRIBUTING.md) file for details on how to contribute.\n\n## Road Map\n\n- [x] add CLI for summarization of all text files in a directory\n- [x] python API for summarization of text docs\n- [ ] add argparse CLI for UI demo\n- [x] put on PyPI\n- [x] LLM.int8 inference\n- [x] optimum inference integration\n- [ ] better documentation [in the wiki](https://github.com/pszemraj/textsum/wiki), details on improving performance (speed, quality, memory usage, etc.)\n  - [x] in-progress\n- [ ] improvements to the PDF OCR helper module (_TBD - may focus more on being a summarization tool_)\n\n_Other ideas? Open an issue or PR!_\n\n---\n\n[![Project generated with PyScaffold](https://img.shields.io/badge/-PyScaffold-005CA0?logo=pyscaffold)](https://pyscaffold.org/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpszemraj%2Ftextsum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpszemraj%2Ftextsum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpszemraj%2Ftextsum/lists"}