{"id":23237011,"url":"https://github.com/nicolay-r/vilongt5","last_synced_at":"2025-09-01T00:48:06.494Z","repository":{"id":168982285,"uuid":"613865495","full_name":"nicolay-r/ViLongT5","owner":"nicolay-r","description":"LongT5-based model pre-trained on a large amount of unlabeled Vietnamese news texts and fine-tuned with ViMS and VMDS collections","archived":false,"fork":false,"pushed_at":"2024-11-09T20:36:33.000Z","size":3541,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-11-09T21:20:07.064Z","etag":null,"topics":["language-model","multi-document-summarization","nlp","t5","t5-model","textsummarization","transformer","vietnamese","vietnamese-nlp"],"latest_commit_sha":null,"homepage":"https://link.springer.com/article/10.1007/s10958-024-07435-z","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nicolay-r.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-14T12:32:17.000Z","updated_at":"2024-11-09T20:36:36.000Z","dependencies_parsed_at":"2024-11-09T21:19:21.259Z","dependency_job_id":"bd504bb0-4d2c-4b1d-8c6a-86e5084a5cb3","html_url":"https://github.com/nicolay-r/ViLongT5","commit_stats":null,"previous_names":["nicolay-r/vilongt5"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicolay-r%2FViLongT5","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicolay-r%2FViLongT5/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicolay-r%2FViLongT5/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nicolay-r%2FViLongT5/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nicolay-r","download_url":"https://codeload.github.com/nicolay-r/ViLongT5/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230374271,"owners_count":18216044,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-model","multi-document-summarization","nlp","t5","t5-model","textsummarization","transformer","vietnamese","vietnamese-nlp"],"created_at":"2024-12-19T04:13:24.312Z","updated_at":"2024-12-19T04:13:24.891Z","avatar_url":"https://github.com/nicolay-r.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ViLongT5 • [![twitter](https://img.shields.io/twitter/url/https/shields.io.svg?style=social)](https://x.com/nicolayr_/status/1855348255153861026)\n![](https://img.shields.io/badge/Python-3.8+-lightgreen.svg)\n[![PRs welcome!](https://img.shields.io/badge/PRs-welcome-brightgreen.svg)]()\n[![twitter](https://img.shields.io/twitter/url/https/shields.io.svg?style=social)](https://x.com/nicolayr_/status/1855348255153861026)\n\nA pretrained [Transformer-based encoder-decoder model](https://arxiv.org/pdf/2112.07916.pdf) for the\nmulti-document text-summarization\ntask in Vietnamese language.\nThe code represents a non-framework implementation, which \ncombines \n[flaxformer](https://github.com/google/flaxformer), \n[t5x](https://github.com/google-research/t5x)\nand purely based on [JAX library](https://github.com/google/jax).\n\n`ViLongT5` is trained on a large NewsCorpus of Vietnamese news texts.\nWe benchmark `ViLongT5` on multidocument text-summarization tasks,\nAbstractive Text Summarization and Named Entity Recognition.\nAll the experiments are shown in our paper\n**[Pre-training LongT5 for Vietnamese Mass-Media\nMulti-document Summarization Task](https://link.springer.com/article/10.1007/s10958-024-07435-z)**\n\n\n# Pretrained Models\n**Vocabulary:**\n[ViLongT5_vocab](sentencepiece/model/vietnam.vocab) / [training-script](sentencepiece/readme.md)\n\nModel        | Gin File Location                                                                  | Checkpoint Location|\n------------ | ---------------------------------------------------------------------------------- | -------------------|\nViLongT5-Large | [ViLongT5_large.gin](https://www.dropbox.com/s/nu3hgkz36zra3qq/config.gin?dl=1) | [ViLongt5-finetuned-large.tar.gz](https://www.dropbox.com/s/gl4vxpie7s3liqm/longt5-finetuned-vims-vmds-vlsp-large.tar.gz?dl=1) |\n\n📄 Example scripts based on `Flaxformer` library for model: \n    [finetunning](usage/finetunning.md) / \n    [inferring](usage/inferring.md) / \n    [evaluating](usage/evaluating.md)\n\n### Results\n\n![image](https://user-images.githubusercontent.com/14871187/233701416-af11f6ff-40fd-4575-9727-fbb932cc76ed.png)\n\n### Datasets\nList of datasets utilized in experiments conduction:\n- [NewsCorpus](https://github.com/binhvq/news-corpus)\n- [VMDS](https://github.com/lupanh/VietnameseMDS)\n- [ViMS](https://github.com/CLC-HCMUS/ViMs-Dataset)\n\n# Installation\n\n\u003e **NOTE:** considering `GPU` as a computational device.\nThis project has been tested under the following [configuration](misc/nvidia-smi.txt)\n\n* Python-3.8+\n* List of the python packages at `dependencies.txt`\n    * The [complete list of packages](misc/pip_freeze.txt) this project has been tested under `venv`.\n* CUDA Compiler `nvcc`\n    * [Installation details](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html)\n* CuDNN toolkit `cudnn`\n    * [Installation details](https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html)\n\n### Local Installation\n\n* Initialize virtual environment and install project dependencies:\n```\nvirtualenv env --python=/usr/bin/python3.9`\npip install -r dependencies.txt\n```\n* [Re-install JAX with the related support of the GPU](usage/jax-gpu-support-tutorial.md).\n\n### Kaggle Installation\n\nFor testing under [Kaggle](https://www.kaggle.com/), [there is a separted tutorial](usage/kaggle.md).\n\n# Fine-tuning\n\n* [Fine-tunning (`t5x` tutorial)](usage/finetunning.md)\n\nWe finetunning the model based on training part of the `vims+vmds+vlsp` training part as follows:\n```\npython -m t5x.train --gin_file=\"longt5_finetune_vims_vmds_vlsp_large.gin\" --gin_search_paths='./configs'\n```\n\n# Inferring \n* [Inferring (`t5x` tutorial)](usage/inferring.md)\n\n# Evaluation\n\nFor `vims+vmds+vlsp` (test part) is as follows:\n```\npython -m t5x.eval --gin_file=\"longt5_eval_vims_vmds_vlsp_large.gin\" --gin_search_paths='./configs'\n```\n\nFor `vlsp` (validation part) is as follows:\n```\npython -m t5x.eval --gin_file=\"configs/longt5_infer_vlsp_validation_large.gin\" --gin_search_paths='./configs'\n```\n\n# References\n```bibtex\n@inproceedings{rusnachenko2023pretraining,\n    title = \"Pre-training {LongT5} for Vietnamese Mass-Media Multi-document Summarization Task\",\n    author = \"Rusnachenko, Nicolay and Le, The Anh and Nguyen, Ngoc Diep\",\n    booktitle = \"Proceedings of Artificial Intelligence and Natural Language\",\n    year = \"2023\"\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicolay-r%2Fvilongt5","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnicolay-r%2Fvilongt5","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnicolay-r%2Fvilongt5/lists"}