{"id":13715316,"url":"https://github.com/FerdinandZhong/punctuator","last_synced_at":"2025-05-07T04:30:41.049Z","repository":{"id":40739314,"uuid":"314250913","full_name":"FerdinandZhong/punctuator","owner":"FerdinandZhong","description":"A small seq2seq punctuator tool based on DistilBERT","archived":false,"fork":false,"pushed_at":"2024-12-23T00:26:20.000Z","size":124306,"stargazers_count":51,"open_issues_count":2,"forks_count":8,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-06T19:11:36.386Z","etag":null,"topics":["bert","bert-ner","chinese-nlp","deep-learning","nlp","punctuation","pytorch","seq2seq"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FerdinandZhong.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-11-19T13:07:34.000Z","updated_at":"2025-03-31T02:40:44.000Z","dependencies_parsed_at":"2024-01-14T22:03:33.559Z","dependency_job_id":"9559bc03-31b1-41ac-ac82-eb54f68b6561","html_url":"https://github.com/FerdinandZhong/punctuator","commit_stats":{"total_commits":54,"total_committers":4,"mean_commits":13.5,"dds":"0.18518518518518523","last_synced_commit":"5d072d0320154f6557f35d46b1e491278f0d09ec"},"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FerdinandZhong%2Fpunctuator","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FerdinandZhong%2Fpunctuator/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FerdinandZhong%2Fpunctuator/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FerdinandZhong%2Fpunctuator/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FerdinandZhong","download_url":"https://codeload.github.com/FerdinandZhong/punctuator/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252813627,"owners_count":21808361,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bert","bert-ner","chinese-nlp","deep-learning","nlp","punctuation","pytorch","seq2seq"],"created_at":"2024-08-03T00:00:57.423Z","updated_at":"2025-05-07T04:30:40.569Z","avatar_url":"https://github.com/FerdinandZhong.png","language":"Python","funding_links":[],"categories":["List"],"sub_categories":[],"readme":"# Distilbert-punctuator\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/distilbert-punctuator/\"\u003e\n      \u003cimg src=\"https://badge.fury.io/py/distilbert-punctuator.svg\" alt=\"PyPI version\" height=\"20\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pepy.tech/project/distilbert-punctuator\"\u003e\n      \u003cimg src=\"https://pepy.tech/badge/distilbert-punctuator\" alt=\"PyPi Downloads\" height=\"20\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://pepy.tech/project/distilbert-punctuator\"\u003e\n      \u003cimg src=\"https://pepy.tech/badge/distilbert-punctuator/month\" alt=\"PyPi Latest Month Downloads\" height=\"20\"\u003e\n  \u003c/a\u003e\n  \u003ca href=\"https://tldrlegal.com/license/apache-license-2.0-(apache-2.0)\"\u003e\n      \u003cimg src=\"https://img.shields.io/github/license/mosecorg/mosec\" alt=\"License\" height=\"20\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n## Introduction\nDistilbert-punctuator is a python package provides a bert-based punctuator (fine-tuned model of `pretrained huggingface DistilBertForTokenClassification`) with following three components:\n\n* **data process**: funcs for processing user's data to prepare for training. If user perfer to fine-tune the model with his/her own data.\n* **training**: training pipeline and evaluation. User can fine-tune his/her own punctuator with the pipeline\n* **inference**: easy-to-use interface for user to use trained punctuator.\n* If user doesn't want to train a punctuator himself/herself, two pre-fined-tuned model from huggingface model hub\n  * `Qishuai/distilbert_punctuator_en` 📎 [Model details](https://huggingface.co/Qishuai/distilbert_punctuator_en)\n  * `Qishuai/distilbert_punctuator_zh` 📎 [Model details](https://huggingface.co/Qishuai/distilbert_punctuator_zh)\n* model examples in huggingface web page.\n  * English model\n  \u003cfigure\u003e\n    \u003cimg src=\"./docs/static/english_model_example.png\" width=\"600\" /\u003e\n  \u003c/figure\u003e\n\n  * Simplified Chinese model\n  \u003cfigure\u003e\n    \u003cimg src=\"./docs/static/chinese_model_example.png\" width=\"600\" /\u003e\n  \u003c/figure\u003e\n\n## Installation\n* Installing the package from pypi: `pip install distilbert-punctuator` for directly usage of punctuator.\n* Installing the package with option to do data processing `pip install distilbert-punctuator[data_process]`.\n* Installing the package with option to train and validate your own model `pip install distilbert-punctuator[training]`\n* For development and contribution\n  * clone the repo\n  * `make install`\n\n## Data Process\nComponent for pre-processing the training data. To use this component, please install as `pip install distilbert-punctuator[data_process]`\n\nThe package is providing a simple pipeline for you to generate `NER` format training data.\n\n### Example\n`examples/data_sample.py`\n\n## Train\nComponent for providing a training pipeline for fine-tuning a pretrained `DistilBertForTokenClassification` model from `huggingface`.\nThe latest version has the implementation of **`R-Drop`** enhanced training. \n[R-Drop github repo](https://github.com/dropreg/R-Drop)\n[Paper of R-Drop](https://arxiv.org/abs/2106.14448)\n\n### Example\n`examples/english_train_sample.py`\n\n### Training_arguments:\nArguments required for the training pipeline.\n\n- # basic arguments\n  - `training_corpus(List[List[str]])`: list of sequences for training, longest sequence should be no longer than pretrained LM # noqa: E501\n  - `validation_corpus(List[List[str]])`: list of sequences for validation, longest sequence should be no longer than pretrained LM # noqa: E501\n  - `training_tags(List[List[int]])`: tags(int) for training\n  - `validation_tags(List[List[int]])`: tags(int) for validation\n  - `model_name_or_path(str)`: name or path of pre-trained model\n  - `tokenizer_name(str)`: name of pretrained tokenizer\n\n- # training arguments\n  - `epoch(int)`: number of epoch\n  - `batch_size(int)`: batch size\n  - `model_storage_dir(str)`: fine-tuned model storage path\n  - `label2id(Dict)`: the tags label and id mapping\n  - `early_stop_count(int)`: after how many epochs to early stop training if valid loss not become smaller. default 3 # noqa: E501\n  - `gpu_device(int)`: specific gpu card index, default is the CUDA_VISIBLE_DEVICES from environ\n  - `warm_up_steps(int)`: warm up steps.\n  - `r_drop(bool)`: whether to train with r-drop\n  - `r_alpha(int)`: alpha value for kl divengence in the loss, default is 0\n  - `plot_steps(int)`: record training status to tensorboard among how many steps\n  - `tensorboard_log_dir(Optional[str])`: the tensorboard logs output directory, default is \"runs\"\n\n- # model arguments\n  - `addtional_model_config(Optional[Dict])`: additional configuration for model\n\nYou can also train your own NER models with the trainer provided in this repo. \nThe example can be found in `notebooks/R-drop NER.ipynb`\n\n## Evaluation\nValidation of fine-tuned model\n\n### Example\n`examples/train_sample.py`\n\n### Validation_arguments:\n- `evaluation_corpus(List[List[str]])`: list of sequences for evaluation, longest sequence should be no longer than pretrained LM's max_position_embedding(512)\n- `evaluation_tags(List[List[int]])`: tags(int) for evaluation (the GT)\n- `model_name_or_path(str)`: name or path of fine-tuned model\n- `tokenizer_name(str)`: name of tokenizer\n- `batch_size(int)`: batch size\n- `label2id(Optional[Dict])`: label2id. Default one is from model config. Pass in this argument if your model doesn't have a label2id inside config\n- `gpu_device(int)`: specific gpu card index, default is the CUDA_VISIBLE_DEVICES from environ\n\n## Inference\nComponent for providing an inference interface for user to use punctuator.\n\n### Architecture\n```\n +----------------------+              (child process)\n |   user application   |             +-------------------+\n +                      + \u003c----------\u003e| punctuator server |\n |   +inference object  |             +-------------------+\n +----------------------+\n```\n\nThe punctuator will be deployed in a child process which communicates with main process through pipe connection.\nTherefore user can initialize an inference object and call its `punctuation` function when needed. The punctuator will never block the main process unless doing punctuation.\nThere is a `graceful shutdown` methodology for the punctuator, hence user dosen't need to worry about the shutting-down.\n\n### Example\n`examples/inference_sample.py`\n\n### Inference_arguments\nArguments required for the inference pipeline.\n\n- `model_name_or_path(str)`: name or path of pre-trained model\n- `tokenizer_name(str)`: name of pretrained tokenizer\n- `tag2punctuator(Dict[str, tuple])`: tag to punctuator mapping.\n   dbpunctuator.utils provides two default mappings for English and Chinese\n   ```python\n   NORMAL_TOKEN_TAG = \"O\"\n   DEFAULT_ENGLISH_TAG_PUNCTUATOR_MAP = {\n       NORMAL_TOKEN_TAG: (\"\", False),\n       \"COMMA\": (\",\", False),\n       \"PERIOD\": (\".\", True),\n       \"QUESTIONMARK\": (\"?\", True),\n       \"EXLAMATIONMARK\": (\"!\", True),\n   }\n\n   DEFAULT_CHINESE_TAG_PUNCTUATOR_MAP = {\n       NORMAL_TOKEN_TAG: (\"\", False),\n       \"C_COMMA\": (\"，\", False),\n       \"C_PERIOD\": (\"。\", True),\n       \"C_QUESTIONMARK\": (\"? \", True),\n       \"C_EXLAMATIONMARK\": (\"! \", True),\n       \"C_DUNHAO\": (\"、\", False),\n   }\n   ```\n   for own fine-tuned model with different tags, pass in your own mapping\n- `tag2id_storage_path(Optional[str])`: tag2id storage path. Default one is from model config. Pass in this argument if your model doesn't have a tag2id inside config\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFerdinandZhong%2Fpunctuator","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FFerdinandZhong%2Fpunctuator","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FFerdinandZhong%2Fpunctuator/lists"}