{"id":15131407,"url":"https://github.com/petrochukm/pytorch-nlp","last_synced_at":"2025-09-28T23:30:36.084Z","repository":{"id":41508594,"uuid":"122806629","full_name":"PetrochukM/PyTorch-NLP","owner":"PetrochukM","description":"Basic Utilities for PyTorch Natural Language Processing (NLP)","archived":true,"fork":false,"pushed_at":"2023-07-04T21:11:26.000Z","size":13169,"stargazers_count":2212,"open_issues_count":24,"forks_count":256,"subscribers_count":56,"default_branch":"master","last_synced_at":"2024-10-30T16:39:59.685Z","etag":null,"topics":["data-loader","dataset","deep-learning","embeddings","machine-learning","metrics","natural-language-processing","neural-network","nlp","python","pytorch","pytorch-nlp","sru","torchnlp","word-vectors"],"latest_commit_sha":null,"homepage":"https://pytorchnlp.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PetrochukM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-02-25T05:00:36.000Z","updated_at":"2024-10-09T02:12:00.000Z","dependencies_parsed_at":"2022-08-10T02:35:15.864Z","dependency_job_id":"424528be-8d76-4907-858e-47c8ef1352d0","html_url":"https://github.com/PetrochukM/PyTorch-NLP","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PetrochukM%2FPyTorch-NLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PetrochukM%2FPyTorch-NLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PetrochukM%2FPyTorch-NLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PetrochukM%2FPyTorch-NLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PetrochukM","download_url":"https://codeload.github.com/PetrochukM/PyTorch-NLP/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":234569773,"owners_count":18854133,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-loader","dataset","deep-learning","embeddings","machine-learning","metrics","natural-language-processing","neural-network","nlp","python","pytorch","pytorch-nlp","sru","torchnlp","word-vectors"],"created_at":"2024-09-26T03:41:40.904Z","updated_at":"2025-09-28T23:30:30.165Z","avatar_url":"https://github.com/PetrochukM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## :two_hearts: Now Archived :two_hearts:\n\nWith the PyTorch toolchain maturing, it's time to archive repos like this one. You'll be able to find more developed options for every part of this toolkit: \n\n- [Hugging Face Datasets (Datasets)](https://github.com/huggingface/datasets)\n- [Hugging Face Tokenizers (Encoders)](https://github.com/huggingface/tokenizers)\n- [Hugging Face Metrics (Metrics)](https://github.com/huggingface/evaluate)\n- [PyTorch Datapipes (Download \u0026 Samplers)](https://github.com/pytorch/data)\n- [Hugging Face Embeddings (Word Vectors)](https://huggingface.co/blog/getting-started-with-embeddings)\n- [PyTorch NN (NN)](https://pytorch.org/docs/stable/nn.html)\n- [PyTorch TorchText (All-In-One)](https://pytorch.org/text/stable/transforms.html)\n\nHappy developing! :sparkles:\n\nFeel free to contact me if anyone wants to unarchive this repo and continue developing it. You can reach me at \"petrochukm [at] gmail.com\".\n\n------\n\n\u003cp align=\"center\"\u003e\u003cimg width=\"55%\" src=\"docs/_static/img/logo.svg\" /\u003e\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003eBasic Utilities for PyTorch Natural Language Processing (NLP)\u003c/h3\u003e\n\nPyTorch-NLP, or `torchnlp` for short, is a library of basic utilities for PyTorch\nNLP. `torchnlp` extends PyTorch to provide you with\nbasic text data processing functions.\n\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pytorch-nlp.svg?style=flat-square)\n[![Codecov](https://img.shields.io/codecov/c/github/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://codecov.io/gh/PetrochukM/PyTorch-NLP)\n[![Downloads](http://pepy.tech/badge/pytorch-nlp)](http://pepy.tech/project/pytorch-nlp)\n[![Documentation Status](https://img.shields.io/readthedocs/pytorchnlp/latest.svg?style=flat-square)](http://pytorchnlp.readthedocs.io/en/latest/?badge=latest\u0026style=flat-square)\n[![Build Status](https://img.shields.io/travis/PetrochukM/PyTorch-NLP/master.svg?style=flat-square)](https://travis-ci.org/PetrochukM/PyTorch-NLP)\n[![Twitter: PetrochukM](https://img.shields.io/twitter/follow/MPetrochuk.svg?style=social)](https://twitter.com/MPetrochuk)\n\n_Logo by [Chloe Yeo](http://www.yeochloe.com/), Corporate Sponsorship by [WellSaid Labs](https://wellsaidlabs.com/)_\n\n## Installation 🐾\n\nMake sure you have Python 3.6+ and PyTorch 1.0+. You can then install `pytorch-nlp` using\npip:\n\n```python\npip install pytorch-nlp\n```\n\nOr to install the latest code via:\n\n```python\npip install git+https://github.com/PetrochukM/PyTorch-NLP.git\n```\n\n## Docs\n\nThe complete documentation for PyTorch-NLP is available\nvia [our ReadTheDocs website](https://pytorchnlp.readthedocs.io).\n\n## Get Started\n\nWithin an NLP data pipeline, you'll want to implement these basic steps:\n\n### 1. Load your Data 🐿\n\nLoad the IMDB dataset, for example:\n\n```python\nfrom torchnlp.datasets import imdb_dataset\n\n# Load the imdb training dataset\ntrain = imdb_dataset(train=True)\ntrain[0]  # RETURNS: {'text': 'For a movie that gets..', 'sentiment': 'pos'}\n```\n\nLoad a custom dataset, for example:\n\n```python\nfrom pathlib import Path\n\nfrom torchnlp.download import download_file_maybe_extract\n\ndirectory_path = Path('data/')\ntrain_file_path = Path('trees/train.txt')\n\ndownload_file_maybe_extract(\n    url='http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip',\n    directory=directory_path,\n    check_files=[train_file_path])\n\nopen(directory_path / train_file_path)\n```\n\nDon't worry we'll handle caching for you!\n\n### 2. Text to Tensor\n\nTokenize and encode your text as a tensor.\n\nFor example, a `WhitespaceEncoder` breaks\ntext into tokens whenever it encounters a whitespace character.\n\n```python\nfrom torchnlp.encoders.text import WhitespaceEncoder\n\nloaded_data = [\"now this ain't funny\", \"so don't you dare laugh\"]\nencoder = WhitespaceEncoder(loaded_data)\nencoded_data = [encoder.encode(example) for example in loaded_data]\n```\n\n### 3. Tensor to Batch\n\nWith your loaded and encoded data in hand, you'll want to batch your dataset.\n\n```python\nimport torch\nfrom torchnlp.samplers import BucketBatchSampler\nfrom torchnlp.utils import collate_tensors\nfrom torchnlp.encoders.text import stack_and_pad_tensors\n\nencoded_data = [torch.randn(2), torch.randn(3), torch.randn(4), torch.randn(5)]\n\ntrain_sampler = torch.utils.data.sampler.SequentialSampler(encoded_data)\ntrain_batch_sampler = BucketBatchSampler(\n    train_sampler, batch_size=2, drop_last=False, sort_key=lambda i: encoded_data[i].shape[0])\n\nbatches = [[encoded_data[i] for i in batch] for batch in train_batch_sampler]\nbatches = [collate_tensors(batch, stack_tensors=stack_and_pad_tensors) for batch in batches]\n```\n\nPyTorch-NLP builds on top of PyTorch's existing `torch.utils.data.sampler`, `torch.stack`\nand `default_collate` to support sequential inputs of varying lengths!\n\n### 4. Training and Inference\n\nWith your batch in hand, you can use PyTorch to develop and train your model using gradient descent.\nFor example, check out [this example code](examples/snli/train.py) for training on the Stanford\nNatural Language Inference (SNLI) Corpus.\n\n## Last But Not Least\n\nPyTorch-NLP has a couple more NLP focused utility packages to support you! 🤗\n\n### Deterministic Functions\n\nNow you've setup your pipeline, you may want to ensure that some functions run deterministically.\nWrap any code that's random, with `fork_rng` and you'll be good to go, like so:\n\n```python\nimport random\nimport numpy\nimport torch\n\nfrom torchnlp.random import fork_rng\n\nwith fork_rng(seed=123):  # Ensure determinism\n    print('Random:', random.randint(1, 2**31))\n    print('Numpy:', numpy.random.randint(1, 2**31))\n    print('Torch:', int(torch.randint(1, 2**31, (1,))))\n```\n\nThis will always print:\n\n```text\nRandom: 224899943\nNumpy: 843828735\nTorch: 843828736\n```\n\n### Pre-Trained Word Vectors\n\nNow that you've computed your vocabulary, you may want to make use of\npre-trained word vectors to set your embeddings, like so:\n\n```python\nimport torch\nfrom torchnlp.encoders.text import WhitespaceEncoder\nfrom torchnlp.word_to_vector import GloVe\n\nencoder = WhitespaceEncoder([\"now this ain't funny\", \"so don't you dare laugh\"])\n\nvocab_set = set(encoder.vocab)\npretrained_embedding = GloVe(name='6B', dim=100, is_include=lambda w: w in vocab_set)\nembedding_weights = torch.Tensor(encoder.vocab_size, pretrained_embedding.dim)\nfor i, token in enumerate(encoder.vocab):\n    embedding_weights[i] = pretrained_embedding[token]\n```\n\n### Neural Networks Layers\n\nFor example, from the neural network package, apply the state-of-the-art `LockedDropout`:\n\n```python\nimport torch\nfrom torchnlp.nn import LockedDropout\n\ninput_ = torch.randn(6, 3, 10)\ndropout = LockedDropout(0.5)\n\n# Apply a LockedDropout to `input_`\ndropout(input_) # RETURNS: torch.FloatTensor (6x3x10)\n```\n\n### Metrics\n\nCompute common NLP metrics such as the BLEU score.\n\n```python\nfrom torchnlp.metrics import get_moses_multi_bleu\n\nhypotheses = [\"The brown fox jumps over the dog 笑\"]\nreferences = [\"The quick brown fox jumps over the lazy dog 笑\"]\n\n# Compute BLEU score with the official BLEU perl script\nget_moses_multi_bleu(hypotheses, references, lowercase=True)  # RETURNS: 47.9\n```\n\n### Help :question:\n\nMaybe looking at longer examples may help you at [`examples/`](examples/).\n\nNeed more help? We are happy to answer your questions via [Gitter Chat](https://gitter.im/PyTorch-NLP)\n\n## Contributing\n\nWe've released PyTorch-NLP because we found a lack of basic toolkits for NLP in PyTorch. We hope\nthat other organizations can benefit from the project. We are thankful for any contributions from\nthe community.\n\n### Contributing Guide\n\nRead our [contributing guide](https://github.com/PetrochukM/PyTorch-NLP/blob/master/CONTRIBUTING.md)\nto learn about our development process, how to propose bugfixes and improvements, and how to build\nand test your changes to PyTorch-NLP.\n\n## Related Work\n\n### [torchtext](https://github.com/pytorch/text)\n\ntorchtext and PyTorch-NLP differ in the architecture and feature set; otherwise, they are similar.\ntorchtext and PyTorch-NLP provide pre-trained word vectors, datasets, iterators and text encoders.\nPyTorch-NLP also provides neural network modules and metrics. From an architecture standpoint,\ntorchtext is object orientated with external coupling while PyTorch-NLP is object orientated with\nlow coupling.\n\n### [AllenNLP](https://github.com/allenai/allennlp)\n\nAllenNLP is designed to be a platform for research. PyTorch-NLP is designed to be a lightweight toolkit.\n\n## Authors\n\n- [Michael Petrochuk](https://github.com/PetrochukM/) — Developer\n- [Chloe Yeo](http://www.yeochloe.com/) — Logo Design\n\n## Citing\n\nIf you find PyTorch-NLP useful for an academic publication, then please use the following BibTeX to\ncite it:\n\n```\n@misc{pytorch-nlp,\n  author = {Petrochuk, Michael},\n  title = {PyTorch-NLP: Rapid Prototyping with PyTorch Natural Language Processing (NLP) Tools},\n  year = {2018},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/PetrochukM/PyTorch-NLP}},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpetrochukm%2Fpytorch-nlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpetrochukm%2Fpytorch-nlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpetrochukm%2Fpytorch-nlp/lists"}