{"id":23840420,"url":"https://github.com/admk/sembr","last_synced_at":"2025-09-07T17:31:44.615Z","repository":{"id":210197893,"uuid":"723281521","full_name":"admk/sembr","owner":"admk","description":"⚡️ A semantic line breaker that truly breaks lines semantically. Powered by Transformers.","archived":false,"fork":false,"pushed_at":"2025-08-03T09:50:52.000Z","size":158,"stargazers_count":26,"open_issues_count":2,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-08-03T10:08:41.745Z","etag":null,"topics":["formatter","latex","markdown","semantic-line-breaks"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/admk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-11-25T06:52:56.000Z","updated_at":"2025-08-03T08:09:48.000Z","dependencies_parsed_at":"2024-09-12T15:54:45.536Z","dependency_job_id":"71031b90-9af4-4a06-866d-4ce2330c34c2","html_url":"https://github.com/admk/sembr","commit_stats":null,"previous_names":["admk/sembr"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/admk/sembr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/admk%2Fsembr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/admk%2Fsembr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/admk%2Fsembr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/admk%2Fsembr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/admk","download_url":"https://codeload.github.com/admk/sembr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/admk%2Fsembr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":274068098,"owners_count":25216847,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-07T02:00:09.463Z","response_time":67,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["formatter","latex","markdown","semantic-line-breaks"],"created_at":"2025-01-02T17:49:09.105Z","updated_at":"2025-09-07T17:31:44.604Z","avatar_url":"https://github.com/admk.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Semantic Line Breaker (SemBr)\n\n[![GitHub](https://img.shields.io/github/license/admk/sembr)](LICENSE)\n[![python](https://img.shields.io/badge/Python-3.10-3776AB.svg?style=flat\u0026logo=python\u0026logoColor=white)](https://www.python.org)\n[![pytorch](https://img.shields.io/badge/PyTorch-2.1.0-EE4C2C.svg?style=flat\u0026logo=pytorch)](https://pytorch.org)\n[![PyPI](https://badge.fury.io/py/sembr.svg)](https://pypi.org/project/sembr)\n\n```\n\u003e When writing text\n\u003e with a compatible markup language,\n\u003e add a line break\n\u003e after each substantial unit of thought.\n```\n\n\n## What is SemBr?\n\nSemBr is a command-line tool\npowered by [Transformer][transformers1] [models][transformers2]\nthat performs [semantic linebreaks](#what-are-semantic-line-breaks)\nto breaks lines in a text file at semantic boundaries.\nIt supports multiple file types\nincluding LaTeX, Markdown, and plain text,\nwith automatic file type detection.\n\n### Installation\n\nSemBr is available as a [Python package on PyPI][pypi].\nTo install it,\nsimply run the following command in your terminal,\nassuming that you have Python 3.10 or later installed:\n```shell\npip install sembr\n```\nAlternatively,\nwith [`uv`][uv]:\n```shell\n# either\nuv tool install sembr  # install\nsembr  # run\n\n# or\nuvx sembr  # install and run directly\n```\n\n#### From GitHub (Latest Development Version)\n\nTo install the latest development version directly from GitHub:\n\n```shell\n# Install from GitHub main branch\nuv tool install git+https://github.com/admk/sembr.git\n\n# Run directly without installing\nuvx --from git+https://github.com/admk/sembr.git sembr\n```\n\nAlternatively, clone and install in development mode:\n\n```shell\n# Clone the repository\ngit clone https://github.com/admk/sembr.git\ncd sembr\n\n# Install in development mode\npip install -e .\n\n# Or with uv\nuv pip install -e .\n```\n\nNote that the development version may include experimental features and could be less stable than the PyPI release.\n\n### Supported Platforms\n\nSemBr is supported on Linux, Mac and Windows.\nOn machines with CUDA devices,\nor on Apple Silicon Macs,\nSemBr will use the GPU / Apple Neural Engine\nto accelerate inference.\n\n### Usage\n\n#### Command Line Interface\n\nTo use SemBr,\nrun the following command in your terminal:\n```shell\nsembr -i \u003cinput_file\u003e -o \u003coutput_file\u003e\n```\nwhere `\u003cinput_file\u003e` and `\u003coutput_file\u003e`\nare the paths to the input and output files respectively.\n\nOn the first run,\nit will download the SemBr model\nand cache it in `~/.cache/huggingface`.\nSubsequent runs will check for updates\nand use the cached model if it is up-to-date.\n\nAlternatively,\nyou can pipe the input into `sembr`,\nand the output can also be printed to the terminal:\n```shell\ncat \u003cinput_file\u003e | sembr\n```\nThis is especially useful if you want to use SemBr\nwith clipboard managers, for instance, on a Mac:\n```shell\npbpaste | sembr | pbcopy\n```\nOr on Linux:\n```shell\nxclip -o | sembr | xclip -i\n```\n\nAdditionally,\nyou can specify the following options\nto customize the behavior of SemBr:\n\n* `-m \u003cmodel_name\u003e`, `--model-name \u003cmodel_name\u003e`:\n  The name of the Hugging Face model to use.\n  - The default is\n    [`admko/sembr2023-bert-small`][sembr-bert-small].\n  - To use it offline,\n    you can download the model from Hugging Face,\n    and then specify the path to the model directory,\n    or prepend `TRANSFORMERS_OFFLINE=1` to the command\n    to use the cached model.\n* `-l`, `--listen`:\n  Serves the SemBr API on a local server.\n  - Each instance of `sembr` run\n    will detect if the API is accessible,\n    and if not it will run the model on its own.\n  - This option is useful\n    to avoid the time taken to initialize the model\n    by keeping it in memory in a separate process.\n* `-p \u003cport\u003e`, `--port \u003cport\u003e`:\n  The port to serve the SemBr API on.\n  - The default is `8384`.\n* `-s \u003cip\u003e`, `--server \u003cip\u003e`:\n  The IP address to serve the SemBr API on.\n  - The default is `127.0.0.1`.\n* `-b \u003cint\u003e`, `--batch_size \u003cint\u003e`:\n  The number of lines to process in a batch.\n  Default is `8`.\n* `-d \u003cint\u003e`, `--overlap-divisor \u003cint\u003e`:\n  The overlap divisor for tiled inference.\n  Default is `8`.\n* `-f \u003cfunc\u003e`, `--predict-func \u003cfunc\u003e`:\n  The prediction function to use.\n  Options are `argmax`, `logit_adjustment`, `greedy_line_breaks`.\n  Default is `argmax`.\n* `-t \u003cint\u003e`, `--tokens-per-line \u003cint\u003e`:\n  Maximum tokens per line for greedy line breaking.\n  This is only effective\n  when using the `greedy_line_breaks` prediction function.\n* `--bits \u003c4|8\u003e`:\n  Quantization bits for model weights (4 or 8).\n  Requires CUDA. Not supported on MPS.\n* `--dtype \u003cdtype\u003e`:\n  Data type for model weights (e.g. `float16`, `bfloat16`).\n  Default is `float32`.\n* `--file-type \u003ctype\u003e`:\n  File type (`plaintext`, `latex`, `markdown`, etc.).\n  Auto-detected using [Magika][magika] if not provided.\n* `--mcp`:\n  Start MCP server mode instead of processing text.\n\n\n#### MCP Server\n\nAlternatively,\nyou can run `sembr` as an [MCP server][mcp].\nSimply add the following configuration\nto your MCP server configuration:\n```json\n\"mcpServers\": {\n  \"sembr\": {\n    \"type\": \"stdio\",\n    \"command\": \"uvx\",\n    \"args\": [\n      \"sembr\",\n      \"--mcp\"\n    ],\n  }\n}\n```\n\nThe server also supports the formatting options described above.\nIt will expose a `wrap_text` tool\nfor the MCP client to use.\n\n## What are Semantic Line Breaks?\n\n[Semantic Line Breaks][sembr]\nor [Semantic Linefeeds][semlf]\ndescribe a set of conventions\nfor using insensitive vertical whitespace\nto structure prose along semantic boundaries.\n\n\n## Why use Semantic Line Breaks?\n\nSemantic Line Breaks has the following advantages:\n\n* Breaking lines by splitting clauses\n  reflects the logical, grammatical and semantic structure\n  of the text.\n\n* It enhances the ease of editing and version control\n  for a text file.\n  Merge conflicts are less likely to occur\n  when small changes are made,\n  and the changes are easier to identify.\n\n* Documents written with semantic line breaks\n  are easier to navigate and edit\n  with Vim and other text editors\n  that use Vim keybindings.\n\n* Semantic line breaks\n  are invisible to readers.\n  The final rendered output\n  shows no changes to the source text.\n\n\n## Why SemBr?\n\nConverting existing text not written\nwith semantic line breaks\ntakes a long time to do it manually,\nand it is surprisingly difficult\nto do it automatically with rule-based methods.\n\n### Challenges of rule-based methods\n\nRule-based heuristics do not work well\nwith the actual semantic structure of the text,\noften leading to incorrect semantic boundaries.\nMoreover,\nthese boundaries are hierarchical and nested,\nand a rule-based approach\ncannot capture this structure.\nA semantic line break\nmay occur after a dependent clause,\nbut where to break clauses into lines\nis challenging to determine\nwithout syntactic and semantic reasoning capabilities.\nFor examples:\n\n* A rule that breaks lines at punctuation marks\n  will not work well with sentences\n  that contain periods\n  in abbreviations or mathematical expressions.\n\n* Syntactic or semantic structures\n  are not always easy to determine.\n  \"I like to eat apples and oranges\n  because they are healthy.\"\n  should be broken into lines as follows:\n  ```\n  \u003e I like to eat apples and oranges\n  \u003e because they are healthy.\n  ```\n  rather than:\n  ```\n  \u003e I like to eat apples\n  \u003e and oranges because they are healthy.\n  ```\n\nFor this reason,\nI have created SemBr,\nwhich uses finetuned Transformer models\nto predict line breaks at semantic boundaries.\n\n\n## How does SemBr work?\n\nSemBr uses a Transformer model\nto predict line breaks at semantic boundaries.\n\nA small dataset of text with semantic line breaks\nwas created from my existing LaTeX documents.\nThe dataset was split into training\n(46,295 lines, 170,681 words and 1,492,952 characters)\nand test\n(2,187 lines, 7,564 words and 72,231 characters)\ndatasets.\n\nThe data was prepared\nby extracting line breaks and indent levels\nfrom the files,\nand then converting the result\ninto strings of paragraphs with line breaks removed.\nThe data can then be tokenized using the tokenizer\nand converted into a dataset with tokens,\nwhere each token has a label\ndenoting if there is line break before it,\nand the indent level of the token.\n\nFor LaTeX documents,\nthere are two types of line breaks:\none with a normal line break\nthat adds implicit spacing (e.g. `line a⏎line b`)\nand one with no spacing (e.g. `line a%⏎line b`).\nThe data processor\nalso tries to preserve the LaTeX syntax of the text\nby adding and removing comment symbols (`%`),\nif necessary.\n\nThe pretrained masked language model\nis then finetuned as a token classifier\non the training dataset\nto predict the labels of the tokens.\nWe save the model with the best F1 score\non correctly predicting the existence of a line break\non the test set.\nThe finetuning logs for the following models\ncan be found on this [WandB][wandb] report:\n\n* `distilbert-base-uncased`\n  [[Pretrained]][distilbert-bu]\n  [[Finetuned]][sembr-distilbert-bu]\n* `distilbert-base-cased`\n  [[Pretrained]][distilbert-bc]\n  [[Finetuned]][sembr-distilbert-bc]\n* `distilbert-base-uncased-finetuned-sst-2-english`\n  [[Pretrained]][distilbert-bufs2e]\n  [[Finetuned]][sembr-distilbert-bufs2e]\n* `prajjwal1/bert-tiny`\n  [[Pretrained]][bert-tiny]\n  [[Finetuned]][sembr-bert-tiny]\n* `prajjwal1/bert-mini`\n  [[Pretrained]][bert-mini]\n  [[Finetuned]][sembr-bert-mini]\n* `prajjwal1/bert-small`\n  [[Pretrained]][bert-small]\n  [[Finetuned]][sembr-bert-small]\n\n\n## Performance\n\nCurrent inference speed on an M2 Macbook Pro\nis about 850 words per second\non `bert-small` with the default options,\nthe memory usage is about 1.70 GB.\n\nThe link breaking accuracy is difficult to measure,\nand the locations of line breaks\ncould also be subjective.\nOn the test set,\nthe per-token line break accuracy\nof the models are \u003e95%,\nwith ~80% F1 scores.\nBecause of the sparse nature of line breaks,\nthe accuracy is not a good metric\nto measure the performance of the model,\nand I used the F1 score instead\nto save best models.\n\n\n## Improvements and TODOs\n\n- Features:\n  - Natural language support:\n    - [ ] Support natural languages other than English.\n  - Typesetting languages support:\n    - [x] ~~Markdown.~~\n    - [ ] Typst.\n  - Usability:\n    - [ ] Inference queue.\n    - [ ] Daemon with model unloading.\n  - Editor integration:\n    - [x] ~~NeoVim plugin.~~\n    - [x] ~~VSCode extension.~~\n    - [x] MCP server.\n  - [x] ~~Use the [Hugging Face API][hfapi] for inference.~~\n- Accuracy:\n  - Some lines are too short or too long:\n    - [x] Long lines can be penalized greedily\n          by breaking lines with token counts\n          more than `--tokens-per-line`.\n    - [ ] Support `--words-per-line`.\n    - [ ] Improve the algorithm to penalize short and long lines\n          with a more sophisticated method.\n  - [ ] Improve indent level prediction.\n  - [ ] Performance and accuracy benchmarking,\n        and comparisons with related works.\n- Performance:\n  - [x] Improve inference speed.\n  - [x] Reduce memory usage.\n\n\n## Related Projects and References\n\nSentence splitting:\n* https://code.google.com/archive/p/splitta/\n* https://en.wikipedia.org/wiki/Sentence_boundary_disambiguation\n* https://github.com/nipunsadvilkar/pySBD\n* https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html\n\nSemantic line breaking:\n* https://github.com/sembr/specification\n* https://github.com/waldyrious/semantic-linebreaker\n* https://github.com/bobheadxi/readable ([blog post][readable-blog-post])\n* https://github.com/chrisgrieser/obsidian-sembr\n* https://github.com/cllns/semantic_linefeeds\n\n\n[transformers1]: https://huggingface.co/learn/nlp-course/chapter1/4\n[transformers2]: https://lilianweng.github.io/posts/2020-04-07-the-transformer-family/\n\n[pypi]: https://pypi.org/project/sembr\n[uv]: https://github.com/astral-sh/uv\n[mcp]: https://modelcontextprotocol.io/overview\n[magika]: https://github.com/google/magika\n\n[sembr]: https://sembr.org\n[semlf]: https://rhodesmill.org/brandon/2012/one-sentence-per-line\n\n[wandb]: https://api.wandb.ai/links/admk/efvui9f4\n\n[distilbert-bu]: https://huggingface.co/distilbert-base-uncased\n[distilbert-bc]: https://huggingface.co/distilbert-base-cased\n[distilbert-bufs2e]: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english\n[bert-tiny]: https://huggingface.co/prajjwal1/bert-tiny\n[bert-mini]: https://huggingface.co/prajjwal1/bert-mini\n[bert-small]: https://huggingface.co/prajjwal1/bert-small\n[sembr-distilbert-bu]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased\n[sembr-distilbert-bc]: https://huggingface.co/admko/sembr2023-distilbert-base-cased\n[sembr-distilbert-bufs2e]: https://huggingface.co/admko/sembr2023-distilbert-base-uncased-finetuned-sst-2-english\n[sembr-bert-tiny]: https://huggingface.co/admko/sembr2023-bert-tiny\n[sembr-bert-mini]: https://huggingface.co/admko/sembr2023-bert-mini\n[sembr-bert-small]: https://huggingface.co/admko/sembr2023-bert-small\n\n[hfapi]: https://huggingface.co/docs/api-inference/detailed_parameters#token-classification-task\n\n[readable-blog-post]: https://bobheadxi.dev/semantic-line-breaks\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadmk%2Fsembr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadmk%2Fsembr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadmk%2Fsembr/lists"}